Not so trivially, these classifiers can help each other. What you have is
a form of transduction or example based learnng.
On Fri, May 18, 2012 at 5:24 PM, Sean Owen sro...@gmail.com wrote:
Trivially it's four classifiers. You have just one input here, and
it's binary. That seems like too
Sounds like a class path issue.
Sent from my iPhone
On May 15, 2012, at 2:43 AM, Yohan Chin yohan@gmail.com wrote:
Hi,
Recently, I've tried to utilize elephant-bird for loading mahout result into
pig.
I could install elephant-bird and got .jar file.
and followed instructions as
What you are missing is a Linux compatible environment. Running programs under
Cygwin can be pretty difficult because of the path name insanity that often
ensues.
Sent from my iPhone
On May 13, 2012, at 6:33 PM, mahout-newbie raman.sriniva...@gmail.com wrote:
When I try to run the 20
Tim,
Sorry for the confusion and lack of help. Pig-vector is half-done and not
even quite half-baked.
Your help in updating the readme is very much appreciated.
On Mon, May 14, 2012 at 10:17 AM, Timothy Potter thelabd...@gmail.comwrote:
Hi Ted,
Re:
In the readme, there is an example of
I have tried it. And an unnamed large customer of ours has tried it with good
results. That isnt much of a track record yet but it is encouraging.
All of this use so far is as part of k-nearest neighbor work so there isn't a
comparison for pure clustering. Also, this work is all at 10-50
One thing that may be happening here is that the scale of your data varies
from place to place.
Have you tried the upcoming k-means stuff?
On Sat, May 12, 2012 at 8:53 AM, Pat Ferrel p...@farfetchers.com wrote:
One problem I have is that virtually any value for T gives me a very large
number
Roughly.
But it also gives you a small-ish surrogate for your data that would let
you use all kinds of different clustering methods since the surrogate fits
in memory.
On Sat, May 12, 2012 at 9:51 AM, Pat Ferrel p...@occamsmachete.com wrote:
This why canopy has been frustrating because by
Wish I could be there.
Can you send slides when they are available?
On Sat, May 12, 2012 at 2:58 AM, Sebastian Schelter s...@apache.org wrote:
Hi,
I will give a talk titled Large Scale Graph Processing with Apache
Giraph in Berlin on May 29th. Details are available at:
Yes. It may help with variable scale.
The class technique for dealing with that is to cluster with a small number
of clusters at a gross level and then cluster each set of documents that
belong to a single large cluster. This automatically adapts to different
scales.
The new stuff would
Regarding whether this is classification or clustering, it is clustering
but you have some initial conditions that should be used to prime the
algorithm.
Manuel's links are excellent. The LSH hash based clustering in the new
clustering codes could be competitive with these other methods in the
PigModelStorage stores SGD models.
The elephant bird stuff stores data in the form of vectors.
On Fri, May 11, 2012 at 11:38 AM, Timothy Potter thelabd...@gmail.comwrote:
So my main question is what does the elephant-bird model storage stuff do
that PigModelStorage doesn't?
On Mon, May 7, 2012 at 12:01 AM, Dawid Weiss
dawid.we...@cs.put.poznan.plwrote:
- it doesn't have the final pass of in-memory clustering so it really
just
gives you an indifferent quality clustering with a huge number of
weighted
clusters. With the final pass, it will give you a high
As Sean points out, cosine should pick up on this. You will have the usual
problems with small counts that any rating based system has.
And in spite of your last comment, I would strongly recommend that you test
a boolean approach where in *any* action is considered positive and another
where
Pat,
You may be interested in the code at https://github.com/tdunning/knn
This includes some high speed clustering code that could help you with your
issues. To wit,
- there aren't as many knobs to tweak on the algorithm (you still have data
scaling tricks to do)
- the speed should be 10-100x
On Sat, May 5, 2012 at 12:06 AM, hao wang wang...@huofar.com wrote:
1) is there anyway we can dump the weights of the features from a
trained-model?
Yes. Use the model dissector or just grab the weights out of the model.
You can also access the weights matrix directly using getBeta()
Gently here:
You misspelled woWpal wabbit.
I look forward to seeing you at the graphlab workshop and hearing more
about this.
On Thu, May 3, 2012 at 7:06 AM, Nicholas Kolegraff
nickkolegr...@gmail.comwrote:
Hi Everyone,
I'm working on a Linux Distro with a focus around Machine Learning and
Thanks for including Mahout.
As a point of strategy, wouldn't have better to just build a debian package
repository and a script for installing packages? That would allow people
to use their own debian or ubuntu based distros for their own special needs
such as hardware virtualization or special
Yes. It is impossible for me to correctly spell when correcting somebody
else's spelling.
I think that this follows from the general karmic principle.
On Thu, May 3, 2012 at 9:36 AM, Sean Owen sro...@gmail.com wrote:
*V*owpal Wabbit ? :)
On Thu, May 3, 2012 at 5:32 PM, Ted Dunning ted.dunn
On Thu, May 3, 2012 at 10:06 AM, Nicholas Kolegraff nickkolegr...@gmail.com
wrote:
... I have this crazy notion that nothing should ever be installed and
bootstrapping is really annoying.
This opinion is more and more in the minority. Yum and apt have made this
much less painful. And
Don't take any of our suggestions as discouragement. At most treat them as an
excuse to reexamine your decisions.
Sent from my iPhone
On May 3, 2012, at 6:58 PM, Nicholas Kolegraff nickkolegr...@gmail.com wrote:
Agree, this could prove insane. If that is the case, it wouldn't be *too*
On Wed, May 2, 2012 at 11:06 AM, Timothy Potter thelabd...@gmail.comwrote:
We're really keen on Ted's pig-vector project
(https://github.com/tdunning/pig-vector) as we're building a number of
classifiers on Mahout's SGD framework, with the bulk of our data being
in Cassandra processed almost
Making a pig module for mahout is a fine idea. The twitter guys may have
something better, though, so we should explore that as well. Andy's
comments make that possibility very interesting.
On Wed, May 2, 2012 at 5:20 PM, Timothy Potter thelabd...@gmail.com wrote:
Thanks Ted! Removing the
On Wed, May 2, 2012 at 9:05 PM, Jake Mannix jake.man...@gmail.com wrote:
On Wed, May 2, 2012 at 8:07 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Making a pig module for mahout is a fine idea. The twitter guys may have
something better, though, so we should explore that as well. Andy's
On Mon, Apr 30, 2012 at 1:36 AM, Amrhal Lelasm arm...@hotmail.com wrote:
I'm wondering how I can combine these two to get the input data for my
recommender engine. Do, I start by implementing the the JDBCDataModel or ?
Yes.
I appreciate any insight you might have for this?
Sounds like
Yuriy,
Take a look at https://github.com/tdunning/knn to see some upcoming k-means
stuff that may help you out with respect to speed.
On Sat, Apr 28, 2012 at 11:19 AM, Юрий Басов basov.yo1...@gmail.com wrote:
Good day.
My name is Yuriy. I'm working as engineer in Rambler Internet Holding.
Putting a smaller value here will degrade prediction quality because more
and more features will collide in the hashed feature space. Increasing
this beyond a certain point, however, will not significantly increase
prediction quality and it will increase memory usage.
On Fri, Apr 27, 2012 at
It is determined automagically by an evolutionary process.
From what I hear, it has a tendency to do a good job on regularization and
a bad job on learning rate optimization.
On Fri, Apr 27, 2012 at 11:41 PM, Yang tedd...@gmail.com wrote:
when I run
mahout trainlogistic
is there an
The GA is old code and unused and unmaintained for the most part. I would
expect that unless somebody steps up, it is a candidate for removal.
The EP code is an implementation of recorded step meta-mutation as
described here: http://arxiv.org/abs/0803.3838
The EP code is unrelated to genetic
I think that map-reduce has broader applicability than just places were you
need the sort, but I completely agree that other models are far better than
most graph theoretic programs unless you have a problem that is susceptible
to spectral methods. This last proviso applies because map-reduce can
Nicolas,
Are you replying to this? Or asking these questions?
On Tue, Apr 17, 2012 at 11:03 AM, Nicolas Pied nicolas.p...@gmail.comwrote:
Hello,
I would like to implement an application like Like.fm / Pandora (but
more simple) that suggests musics close to a given one.
I think
If you really want to recommend music that people will like, you have to
start from the realization that most of musical appreciation is social, not
auditory. This has been substantiated in controlled tests where as much as
60% of appreciation was driven by very weak social cues in a test. In my
Now that I have been all negative, if you want to go developing auditory
features, look up music information retrieval. The ISMIR conferences have
a wealth of information.
http://www.ismir.net/
On Tue, Apr 17, 2012 at 11:03 AM, Nicolas Pied nicolas.p...@gmail.comwrote:
Hello,
I would
So, the first thought that I have is that it sounds like you have dense
variables rather than sparse. This may affect behavior of the Mahout
system. If you have some text-like features of the ad, then you may get
cleaner results.
Secondly, I don't see any interaction features. With as much
Well, this shorter reference does avoid the problem of having a typo in the
abstract.
On Mon, Apr 9, 2012 at 2:35 AM, Sebastian Schelter s...@apache.org wrote:
I use a (not so beautiful) very short reference:
@Unpublished{Mahout,
key = {Apache Mahout},
title = {Apache {Mahout},
Beautiful, I was just writing up some clustering work and needed exactly
this.
Thanks!
On Sun, Apr 8, 2012 at 4:54 PM, Manuel Blechschmidt
manuel.blechschm...@gmx.de wrote:
Hi Ahmed,
I used the following BibTex entry in my Master Thesis:
@webpage{mahout,
Abstract = {Apache Mahout's
The current state of the art in ad recognition is contextual bandits backed up
by logistic or probit regression. The mahout logistic regression is a decent
first step on this but probably doesn't provide the necessary accuracy.
I have some early work on the bandit algorithms on github but
There is also the stochastic projection code. Search for ssvd in the mailing
list archives.
Sent from my iPhone
On Apr 4, 2012, at 8:36 AM, Sebastian Schelter s...@apache.org wrote:
There is a distributed recommender that uses matrix factorization via
Alternating Least Squares. Due to
With this announcement, this group has a fork in the road facing us.
We can choose the Hadoop path of forcibly excluding anybody with a slightly
wrong commercial taint from discussions (I call this the more GNU than
GNU philosophy).
Or we can choose a real community based approach that includes
I am sorry, but I don't understand the question.
All of the code in Mahout compiles. This is verified several times a day
by the continuous integration testing.
Can you say more specifically what you mean? Line 95 of what?
On Wed, Apr 4, 2012 at 12:18 PM, Ahmed Abdeen Hamed
works, but figured it's a good as time as any to ask I figure.
On Wed, Apr 4, 2012 at 5:35 PM, Ted Dunning ted.dunn...@gmail.com wrote:
With this announcement, this group has a fork in the road facing us.
We can choose the Hadoop path of forcibly excluding anybody with a
slightly
wrong
set of items. That makes the computation of similarity between
users imprecise and consequently reduces the accuracy of CF
algorithms.
http://www.jucs.org/jucs_17_4/a_clustering_approach_for
On Sun, Apr 1, 2012 at 1:20 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Could you say a bit more
preferences ? What about
semi-anonymous model ? very good answer. Thanks Ted
On Mon, Apr 2, 2012 at 7:40 PM, Ted Dunning ted.dunn...@gmail.com wrote:
This problem is much more commonly referred to as the cold start problem
and is far smaller than many authors assume. Typically a dozen good
Could you say a bit more about what you mean? Which data sparsity problem?
Sent from my iPhone
On Apr 1, 2012, at 6:35 AM, ziad kamel ziad.kame...@gmail.com wrote:
Hi,
Is there any ways that mahout CF can overcome the data sparsity problem?
Thanks
It depends.
The large scale systems for item based recommendations definitely do not do
this.
Sent from my iPhone
On Apr 1, 2012, at 7:13 AM, ziad kamel ziad.kame...@gmail.com wrote:
Do Mahout compute the similarity between every pair of users to
determine their neighborhoods ?
It is very common that preferences or ratings DECREASE recommendation
performance.
The basic reason is that there is little or no real signal in the ratings
after you account for the fact that the rating exists at all.
In practice, there is the additional reason that if you don't need a
rating,
Split your training data into lots of little files. Depending on the wind,
that may cause more mappers to be invoked.
On Thu, Mar 29, 2012 at 3:05 PM, Jason L Shaw jls...@uw.edu wrote:
Suggestion, indeed. I passed that option, but still only 2 mappers were
created.
On Thu, Mar 29, 2012 at
?
On Thu, Mar 29, 2012 at 5:04 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
It is very common that preferences or ratings DECREASE recommendation
performance.
The basic reason is that there is little or no real signal in the ratings
after you account for the fact that the rating exists
Have you subscribed?
Most readers of the email list will assume that you have subscribed to the
list and they will answer to the list. If you haven't subscribed, you
won't see these answers.
On the other hand, some questions may not be answered if the questions are
difficult to understand or
THe smallest eigenvalues are always problematic in large matrices.
Any trick to expose them (such as the diagonal subtraction that you
mention) should work with any of our stuff as well.
On Tue, Mar 27, 2012 at 2:01 AM, Dan Brickley dan...@danbri.org wrote:
If one wanted the *smallest*
recommendations and
models per click of the user (because you need to rebuild the data in the
HDFS run you batch job, and return an answer)
-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Monday, March 26, 2012 00:56
To: user@mahout.apache.org
Subject: Re
It rounds like the original poster isn't clear about the division between
off-line and on-line work.
Almost all production recommendation systems have a large off-line
component which analyzes logs of behavior and produces a recommendation
model. This model typically consists of item-item
Not really. See my previous posting.
The best way to get fast recommendations is to use an item-based
recommender. Pre-computing recommendations for all users is not usually a
win because you wind up doing a lot of wasted work and you still don't have
anything for new users who appear between
On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren oren.ra...@intel.com wrote:
...
The system I need should of course give the recommendation itself in no
time.
...
But because I'm talking about very large scales, I guess that I want to
push much of my model computation to offline mode (which
On Sun, Mar 25, 2012 at 4:02 PM, Razon, Oren oren.ra...@intel.com wrote:
So let's continue with your example... I will do I 2 I similarity matrix
on Hadoop and then will do online recommendation based on it and the user
ranked items.
Yes.
So where does the online part will sit at? Is it
I don't know what you mean by significant any more than Sean.
But serendipity in a recommender comes from two sources. Both must be
present. One source is having enough people who interact with the
recommender. The second source is a judicious injection of exploration
which can come from
they want. good luck
:-)
On 24 March 2012 17:00, Ted Dunning ted.dunn...@gmail.com wrote:
I don't know what you mean by significant any more than Sean.
But serendipity in a recommender comes from two sources. Both must be
present. One source is having enough people who interact
My own recommendation is to reduce both scores to binary form using
whatever sound statistical method you care to adopt and then use OR.
A viable alternative that is relatively good is to convert both scores to
percentiles with the same polarity (i.e. 99-th %-ile is very close or very
similar).
Session data never needs to be in memory. It can be processed sequentially or
using map reduce.
The item item data is all you need in memory.
Sent from my iPhone
On Mar 18, 2012, at 10:19 PM, Mridul Kapoor mridulkap...@gmail.com wrote:
On 19 March 2012 02:24, Ted Dunning ted.dunn
While I didn't as nice a job as your friend, TFIDF of n-grams has
consistently done very well for me. The soft TFIDF that they examine is
something that I haven't previously looked at, but everything else seems
just in order based on what I have seen.
On Mon, Mar 19, 2012 at 1:06 PM, Dawid Weiss
On Mon, Mar 19, 2012 at 10:06 PM, Mridul Kapoor mridulkap...@gmail.comwrote:
Is there a way that I run the ItemSimilarityJob on a single
machine ?
Yes. There is a sequential invocation as well.
The last third of the Mahout in Action book covers this pretty extensively.
On Sun, Mar 18, 2012 at 5:25 AM, Felix.徐 ygnhz...@gmail.com wrote:
Hi,all.
I'm new to mahout, it seems that logistic regression is already integrated
into
Mridul,
What is the humongous amount of data in Mongo? Is it really item-item
links? Or is it session information?
With a recommender, it is unusual to have more than a few hundred links to
other items for any given item. This means that even for 10 million items,
you only have about a
This is search, not recommendation.
For search, you need to build and index (which can be built off-line). In
the process of building that index, you can propagate content terms across
highly similar (behaviorally) items and you can include references to and
from similar items.
Content-based
In order to get time similarity that you want, you can have virtual users
for each session as well as real users for longer time periods. The longer
periods will have weaker statistics so you probably won't have to weight
things.
This will let you use the standard Mahout framework for everything
Sean's comment is dead-on and your design inclinations are just fine.
Hadoop can (eventually) help with the offline item similarity computation.
The existing Mahout recommendation engine can do the actual item
recommendation work at very high speed with an appropriate data store.
On Mon, Mar
Be aware that cluster based recommenders almost never perform as well as
user/item based recommenders.
On Mon, Mar 12, 2012 at 10:03 AM, Ahmed Abdeen Hamed
ahmed.elma...@gmail.com wrote:
This is really great. Thanks so much!
-Ahmed
On Mon, Mar 12, 2012 at 12:13 PM, Sean Owen
Actually I don't think that you will need to implement your own item
similarity.
Just preprocess your input by grouping by user and sorting by time. Then break
user sessions into separate users and emit the standard user,item,pref format
for the mahout processing. The pref will be always 1 in
I would generally recommend using the LLR similarity.
But if you have an itch, scratch it. I do think we have a tanimoto similarity
already, possibly under a slightly different name.
Sent from my iPhone
On Mar 12, 2012, at 2:00 PM, Mridul Kapoor mridulkap...@gmail.com wrote:
Ah, right.
It is probably worth trying the LLR item-item off-line build. This is more
like what the guy needs than raw counts.
On Sun, Mar 11, 2012 at 5:42 AM, Sean Owen sro...@gmail.com wrote:
No, it's so easy you can do it in about 20 lines of code so I don't
think it really warrants a software
A separate project like this is a better way to package this in any case.
It is bad practice to have developers modifying Mahout itself in order to
build their applications.
Nice work, Manuel!
On Wed, Mar 7, 2012 at 10:00 AM, Manuel Blechschmidt
manuel.blechschm...@gmx.de wrote:
Hi Ben,
I
Business logic like this can be built into the IDRescorer capabilities.
There is a lot of information in the mailing list archive on this kind of
thing.
See
http://www.lucidimagination.com/search/p:mahout?q=IDRescorersearchProvider=lucid
and
And further linear Markov chains can be expressed as matrix products which
can be computed efficiently using SVD's.
Zoltan, is this literally the problem that you are working on? Or is this
a shadow of the problem that you are interested in?
On Sat, Mar 3, 2012 at 9:55 AM, Jack Tanner
I think that you have an invocation or format bug and you are effectively
giving NB different data you think.
Note that this is what is called a stopped clock model. That means it is only
getting correct results by putting out a constant value.
Sent from my iPhone
On Feb 28, 2012, at 2:58
This is a tiny dataset. Have you considered just trying R? In fact in terms of
just diagnosing the problem it would be good to run a regression in R first.
Sent from my iPhone
On Feb 27, 2012, at 3:57 AM, Naveenchandra naveenchandr...@gmail.com wrote:
Hi guys,
Thanks alot for your regular
If your synthetic data comes from the se distribution for yellow and purple
then clearly no classifier will help.
Also naive bayes wants words not numbers.
Sent from my iPhone
On Feb 24, 2012, at 5:08 AM, Naveenchandra naveenchandr...@gmail.com wrote:
The python code which used is :
No problem.
And thank you for being kind when I used language less moderate than
appropriate.
On Thu, Feb 23, 2012 at 8:13 PM, Ioan Eugen Stan stan.ieu...@gmail.comwrote:
2012/2/23 Ted Dunning ted.dunn...@gmail.com:
Is this a joke?
new String[] {-t, INPUT_TABLE, -m, MAIL_ACCOUNT_ID
Aye say I.
Sent from my iPhone
On Feb 22, 2012, at 4:24 AM, Jake Mannix jake.man...@gmail.com wrote:
If we're able to wrap this release up cleanly and get quickly moving on to
new features again, maybe we can try this on a more regular basis, with
even releases being feature-work, and odd
Bigger is always better.
But you may be happier if you downsample the negative cases since they will
be providing very little value in this model.
Can you say what you mean by threshold? There is no threshold in Mahout's
logistic regression.
On Tue, Feb 21, 2012 at 5:44 PM, Sagar Sharma
Mahout 0.4 is ancient.
Upgrade!
Nobody can help with such an old version, really.
On Sun, Feb 19, 2012 at 6:34 PM, Peyman Mohajerian mohaj...@gmail.comwrote:
Hi Dmitriy Others,
Dmitriy thanks for your previous response.
I have a follow up question to my LSA project. I have managed to
Efficiency is not normally a term used with classifiers. Can you define it?
From you confusion matrix, it looks like nearly all of your documents are
being classified into one class. That usually indicates that there is some
fundamental formatting difference between your original training data
John,
This is well said and is a critical need.
There are some beginnings to this. The recommender side of the house
already works the way you say. The classifier and hashed encoding API's
are beginning to work that way. The naive Bayes classifiers pretty much do
not and the classifier API's
On Tue, Feb 14, 2012 at 2:25 AM, Lance Norskog goks...@gmail.com wrote:
...
OnlineLogisticRegression allocates DenseVector/DenseMatrix objects- if
it used RandomSparse Vector/Matrix could it operate on million-term
sparse arrays?
Not likely.
The feature vectors that come in are sparse and
Hash coded vectorization *is* a random projection. It is just one that
preserves some degree of sparsity. It definitely loses information when
you use it to decrease dimension of the input. It does not add bogus
information.
SGD doesn't like dense vectors, actually. In fact, one of the nice
.
On Sun, Feb 12, 2012 at 7:00 AM, Ted Dunning ted.dunn...@gmail.com wrote:
Hash coded vectorization *is* a random projection. It is just one that
preserves some degree of sparsity. It definitely loses information when
you use it to decrease dimension of the input. It does not add bogus
information
Trim the model by setting a minimum term frequency.
On Thu, Feb 2, 2012 at 9:39 PM, SAMIK CHAKRABORTY sam...@gmail.com wrote:
Hi,
I am new to mahout and hadoop.
I have created a model (following the train classifier command) which has a
size of 500MB. Now when I am loading the model for
I think your analysis is correct, but you are also probably correct that
having multiple levels at the same time would be preferable.
On Wed, Feb 1, 2012 at 1:05 PM, Stuart Smith stu24m...@yahoo.com wrote:
Hello,
I was curious about how bayes handles the ngram argument, and how it
could be
So the total size of the data is modest at about 560 M non-zero elements.
Total data should be small compared to your node sizes.
But the distribution of your data can be important as well.
Can you say if you have any rows or columns are extremely dense?
On Wed, Feb 1, 2012 at 4:58 PM, Kate
Matrix inverse is almost never a good idea. The same effect can usually be
had using a decomposition at far less cost. For instance, for solving a
linear system, QR decomposition provides two sub-matrices that can easily
have an inverse multiply operation applied to them avoiding the need for
THere are a bunch of papers on this. Search named entity recognizer CRF
on google.
The basic idea is that an HMM or CRF has internal state that can be used to
mark named entities. We don't have to define what the hidden states mean,
just help the HMM or CRF find an internal representation that
From: Ted Dunning ted.dunn...@gmail.com
To: user@mahout.apache.org; Stuart Smith stu24m...@yahoo.com
Cc: Mahout List mahout-u...@lucene.apache.org
Sent: Monday, January 23, 2012 5:52 PM
Subject: Re: SGD: mismatch in percentCorrect vs classify() on training
data?
Hmm... I am surprised
If you have supervised training data (and it sounds that way), then
classification is likely to be more effective.
On Tue, Jan 24, 2012 at 7:44 PM, Vikas Pandya vika...@yahoo.com wrote:
Thanks. creating vectors for these three columns and clustering them
doesn't bring desired results. here is
The HMM implementations might be of help, but I think that a small CRF
implementation that is oriented around string transduction would be more
helpful.
The Stanford Named Entity Recognizer (NER) has such an implementation. I
think NLTK has one. I think GATE has one as well.
The basic
I doubt if it will work on Hadoop 0.19. Mahout requires 0.20 and pretty
much always has. Changing that will be difficult to check even if it isn't
difficult to do.
In any case, you should probably get off of 0.19 as soon as possible as
well since there are known stability problems with that
Yes.
The use of Hadoop here makes things silly slow.
On Thu, Jan 19, 2012 at 8:07 AM, Daniel Korzekwa
daniel.korze...@gmail.comwrote:
./mahout trainclassifier -i /mnt/hgfs/C/daniel/my_fav_data/test -o model
-type bayes -ng 1 -source hdfs, it takes 40 seconds to train a model for a
file with
Mike,
I think that where you are going is that Mahout might be well served by
non-Hadoop implementations or map-reduce or by non-map-reduce frameworks,
especially where smaller data and experimental use is concerned.
You are right. Or, at least I agree with what I think you are saying.
Sean is
There are lots of QR decomposition algorithms and the results are not
necessarily unique, especially for rank deficient inputs.
If you post your exact results, I could comment more specifically. Without
more details, I really can't answer your question in any specific way.
On Wed, Jan 18, 2012
Time since the last packet from the same source or to the same destination
is another interesting feature.
On Tue, Jan 17, 2012 at 11:10 AM, Harry Potter harry123gr...@yahoo.comwrote:
thanks sir... that was really helpful..
From: Ioan Eugen Stan
On Sun, Jan 15, 2012 at 2:13 PM, Raviv Pavel ra...@gigya-inc.com wrote:
If I understand correctly, in normalization option #2 you mean that each
interest is encoded to value so that the sum of all interests is 1?
Yes.
Also, What do you mean by normalize the interests to have unit vector
It isn't that bad. Maven is opinionated (that is a feature, not a defect).
But it isn't that hard to deal with.
The first concept to deal with is that maven has pre-defined life cycle
goals. The most important for most programmers are compile, test, package
and install. These pretty much mean
I usually prefer to represent location as an xyz triple on a unit sphere.
That allows Euclidean distance to be useful.
On the 1 of n encoded values. Euclidean works as well. For gender, it also
works fine.
The only issue is how to combine these with reasonable weightings. An easy
way to do
901 - 1000 of 1929 matches
Mail list logo