Block size and input size should not matter for the Reducer. You do have to
explicitly say the number of workers.
It defaults to 1. You do set it with just these methods. Make sure you are
setting on the right object and before you run. Look for other things that
may be overriding it.
I don't
The JAR you ship to Hadoop needs to have all the required class files
including third-party dependencies. Right now you're just sending it
Mahout classes. Use the .job file that is built by the Maven
targets. mvn package should make them. That has all the dependencies
packaged up.
On Sun, Aug 26,
It's the same idea, but yes you'd have to re-implement it for Hadoop.
Randomly select a subset of users. Identify a small number of
most-preferred items for that user -- perhaps the video(s) watched
most often. Hold these data points out as a test set. Run your process
on all the rest.
Make
Most watched by that particular user.
The issue is that the recommender is trying to answer, of all items
the user has not interacted with, which is the user most likely to
interact with? So the 'right answers' to the quiz it gets ought to be
answers to this question. That is why the test data
interesting. This would seem to
work well with our Boolean dataset. We will give this a try.
Thanks again for the help.
-Jonathan
On Sun, Aug 26, 2012 at 3:55 PM, Sean Owen sro...@gmail.com wrote:
Most watched by that particular user.
The issue is that the recommender is trying to answer
a mean
of zero by nature.
Thanks for your time on this question and all of your efforts on
Mahout -- it's a great project.
best,
Francis
On Wed, Aug 22, 2012 at 5:11 PM, Sean Owen sro...@gmail.com wrote:
The similarity is only defined over the dimensions where both series
have a value, yes
MapReduce programs are never installed directly on a Hadoop cluster. Hadoop
deploys the program JAR to workers as needed. This is not specific to
Mahout.
Mahout compiles against 0.20.205 and so needs to be used with 0.20.205. It
will work with 1.0.3 as far as I know, with a recompile, as they are
Hi Ziad, I did answer your last question on this list -- don't see this one
previously though.
The relevant items are chosen as those whose pref value exceed some given
threshold. The default threshold is the mean of all 100 pref values plus
one standard deviation. Assuming the prefs are about
The relevant items, the top 16, are a set. You find how many of the
recommendations fall in that set. For precision, ordering does not matter.
You are right that the metric kind of falls apart for users with very few
data points. You want to use precision at a small number, and perhaps
ignore the
Yes, or else those items would not be eligible for recommendation. And it
would be like giving students the answers to a test before the test.
On Thu, Aug 9, 2012 at 5:41 PM, ziad kamel ziad.kame...@gmail.com wrote:
A related question please.
Do Mahout remove the 16% good items before
are the recommended approaches to evaluate the results ? I assume
IR approach is one of them.
Highly appreciating your help Sean .
On Thu, Aug 9, 2012 at 11:45 AM, Sean Owen sro...@gmail.com wrote:
Yes, or else those items would not be eligible for recommendation. And it
would be like giving
in a classifier ? Does
that mean a recommender becomes a classifier at this case ?
On Thu, Aug 9, 2012 at 12:18 PM, Sean Owen sro...@gmail.com wrote:
Yes, this is a definite weakness of the precision test as applied to
recommenders. It is somewhat flawed; it is easy to apply and has some
use
It depends on what the values really mean. If they are something like
ratings, using the most recent version makes most sense. (This is what the
implementations do now.) If they are some kind of sampled reading it might
make sense to take an average. If the input is based on observed activity,
it
Yes, or anywhere else you want to publish static results to, if you don't
want to expose HDFS. HDFS isn't good at small random reads, so it would be
a question of bulk-loading shards of results. The MapReduce workers are not
relevant to serving. They would have produced the results, offline, at
(You can ask in the book forum if it is specific to the book rather than
the project. Maybe I can follow up with you directly off list.)
Which graph are you referring to?
I made them in PowerPoint if I recall correctly, nothing too exotic.
On Thu, Aug 2, 2012 at 8:52 PM, Matt Mitchell
This sounds a lot like a bug that was fixed by a patch some time ago. Grant
I think it was something I had wanted you to double-check, not sure if you
had a look. But I think it was fixed if it's the same issue.
On Thu, Aug 2, 2012 at 8:44 AM, Abramov Pavel p.abra...@rambler-co.ruwrote:
Thanks
It overrides older values. Here it would have no effect.
I don't see an error here...? the warning is an ignorable message from
hadoop.
On Fri, Aug 3, 2012 at 4:56 PM, Sears Merritt sears.merr...@gmail.comwrote:
Hi All,
I'm trying to run a kmeans job using mahout 0.8 on my hadoop cluster
(Cloudera's 0.20.2-cdh3u3) and am running into an odd
:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
On Aug 3, 2012, at 3:00 PM, Sean Owen sro...@gmail.com wrote:
I don't see an error here...? the warning is an ignorable message from
hadoop.
On Fri, Aug 3, 2012 at 4:56 PM, Sears Merritt sears.merr...@gmail.com
wrote
Is it reasonable to use 1.5GB of heap for recs? sure -- assuming you can
allow the JVM to use, say, 2GB or more of heap total.
There are more choices in Mahout for non-distributed recs. The primary
distributed version is an item-similarity-based approach but you can choose
from several similarity
Good good question. One straightforward way to approach things is to
compute all recommendations offline, in batch, and publish them to some
location, and then simply read them as needed. Yes your front-end would
need to access HDFS if the data were on HDFS. The downside is that you
can't update
to validate POM for project
org.apache.mahout:mahout-integration at
/home/sweiss/mahout-distribution-0.7/integration/pom.xml
...
--
Steve
On Mon, Jul 30, 2012 at 5:17 PM, Sean Owen sro...@gmail.com wrote:
Hmm. what happens if you add this to dependencies in integration?
dependency
The backing store doesn't matter much, in the sense that using it for
real-time computation needs it to all end up in memory anyway. It can live
wherever you want before that, like Solr. It's not going to be feasible to
run anything in real-time off Solr or any other store. Yes the trick is to
use
Classifiers are supervised learning algorithms, so you need to provide
a bunch of examples of positive and negative classes. In your example,
it would be fine to label a bunch of articles as about Apple or not,
then use feature vectors derived from TF-IDF as input, with these
labels, to train a
for the clarification. So you are saying that Mahout is not
suitable in this case or did you say clustering is not the right way to go
and If its worth it, I should go for classification?
Secondly are you the same Sean Owen who wrote Mahout in Action? :)
If the data is 'really' there in the DataModel you seem to have ruled
out all the differences. ;) I imagine there is something slightly
amiss. Can you step through with a debugger to see what the
UserSimilarity calculates? look what data it gets and see if it makes
sense. If it seems to,
That may be a typo in the book. I don't know if it was non-abstract in the
past. But try against version 0.5 to be sure. I don't know what the
replacement code is if so but someone else here likely does.
On Wed, Aug 1, 2012 at 9:20 PM, Abhinav M Kulkarni
abhinavkulka...@gmail.com wrote:
Hi,
Yep, just hash to a long, from UUID or String or whatever. The occasional
collision does not cause a real problem. If you mix the tastes of two users
or items once in a billion times, the overall results will hardly be
different.
You have to maintain the reverse mapping of course. Look at the
No, but I'd recommend XORing the top 64 bits with the bottom 64 bits,
something simple like that.
On Wed, Aug 1, 2012 at 9:40 PM, Matt Mitchell goodie...@gmail.com wrote:
Thanks Sean! That all makes sense. Would you mind recommended a
hashing function for this? Is there something in Mahout I
Mahout currently works against 0.20.205. I do not know if it still
works with 0.20.2. You should not have to downgrade.
The very first thing is to use Cygwin, not the Windows shell.
On Tue, Jul 31, 2012 at 9:23 AM, Julian Ortega
julian.ort...@fredhopper.com wrote:
Not exactly, you will still
I don't know this code too much, but, there is simply a step in front
I believe that vectorizes text with TF-IDF. The result are simple
vectors. You could just inject your vectors (i.e. real-value
attributes) at that stage and skip the TF-IDF. It may need a little
hacking.
On Tue, Jul 31, 2012 at
Extract as in web crawl? No it's nothing to do with that.
Extract as in entity extraction? I don't think there are relevant
implementations here either, though that begins to border on machine
learning.
This is more about clustering and classification of documents than anything
else.
On Mon, Jul
Hmm. what happens if you add this to dependencies in integration?
dependency
groupId${project.groupId}/groupId
artifactIdmahout-examples/artifactId
/dependency
On Mon, Jul 30, 2012 at 9:59 PM, Stephen Weiss swe...@stylesight.com wrote:
Hi,
I am just getting started with
Are you basically asking how much faster a parallel algorithm is than
non-parallel?
If you're measuring wall-clock time, the answer depends on how many
workers/threads you use to parallelize. The point is the time generally
goes down as more workers are added, so there's not one answer.
If
If you are doing something using Hadoop -- then the question is really, can
you use MongoDB as a data source for Hadoop? I'm sure someone has made an
InputFormat for it, yes. Mahout itself doesn't connect to MongoDB, it uses
Hadoop, which may get data from many sources.
If you're not using
Look at SamplingCandidateItemsStrategy and its arguments. These are the
knobs you can turn to reduce the amount of data considered. You might start
with something low like 10 for each of the first 3 args.
You can set this on an ItemBasedRecommender once configured.
On Tue, Jul 24, 2012 at 11:05
This sounds more like a clustering problem to me -- find a centroid, find
which cluster a new article belongs to.
On Tue, Jul 24, 2012 at 11:04 AM, Alexander Aristov
alexander.aris...@gmail.com wrote:
yes, good point.
What I want to reach is to calculate some average of a group of articles
Unless your data set is tiny (thousands of users / items), you can't really
run straight off a database. It is far too data intensive. Real-time always
means in memory to me.
Look at the ReloadFromJDBCDataModel wrapper, which will cache the DB data
in memory. This should be orders of magnitude
Hmm, that doesn't sound right. This isn't all that big for data.
Any chance you've run a profiler to see the hotspot
My guess is that you need to set a CandidateItemStrategy to cut down the
number of items considered.
On Tue, Jul 24, 2012 at 10:36 PM, Jonathan Nassau jonathan.nas...@gmail.com
(Assuming that's 'Mahout in Action' but filtered through iPhone
auto-correct...)
On Mon, Jul 23, 2012 at 7:04 PM, Alexander Aristov
alexander.aris...@gmail.com wrote:
Read Nagpur in action: )
Alexander
23.07.2012 21:53 пользователь Wei Shung Chung weish...@gmail.com
написал:
Hi my
From this, I don't have any good ideas. I think you would need to dig in
with a debugger. First, determine whether the DataModel actually has the
data. I am guessing it does not.
Hmm, call refresh() on reloadModel after it's set up?
On Thu, Jul 19, 2012 at 11:54 AM, Nick Katsipoulakis popa...@gmail.comwrote:
On 07/18/2012 11:56 PM, Sean Owen wrote:
Unless your data set is tiny, like 100K records or less, it is not going
to
be feasible to run recommendations off
Oh, that means it's still initializing then. It does take a while to read
all that info from the DB potentially.
On Thu, Jul 19, 2012 at 2:57 PM, Nick Katsipoulakis popa...@gmail.comwrote:
On 07/19/2012 02:50 PM, Sean Owen wrote:
Hmm, call refresh() on reloadModel after it's set up?
Sure, override refresh()? Yes, call refresh() to make it run when you want.
On Wed, Jul 18, 2012 at 1:31 AM, Matt Mitchell goodie...@gmail.com wrote:
Thanks Sean. This makes sense. I'll see how far I can get with the
anonymous user. I wonder, is there any way to hook into when the
refresh
Without the denominator, the prediction is not a weighted average -- it's
some kind of weighted sum. The values will not be in nearly the same range
as the input ratings -- might be in the thousands. It's not a prediction
anymore. You can rank on it, but it will just favor items that co-occur
with
If you mean, the user says I like Drama and you return to them Dramas,
sure you can do that -- it's not a recommender then. It's not personalized.
It's very easy, and may be useful.
If you mean, can you prioritize Dramas in recommendations, then, as I've
said several times: use the Rescorer! It
Cardinality is the logical size of the vector, its number of dimensions.
You can only add vectors with the same cardinality -- it's not defined what
the result is to add, say, a 2D and 3D vector. So yes this vector needs to
have a cardinality equal to number of features, it seems.
On Wed, Jul 18,
Yes that's right, I'll change the docs.
On Wed, Jul 18, 2012 at 4:52 PM, Eyal Allweil eyal_allw...@yahoo.comwrote:
Hello everyone,
I think there's a mistake in the javadoc for
PlusAnonymousConcurrentUserDataModel.
Under the code sample for real time recommendation, it says
:
Hi Owen,
is it possible to connect mahout with heterogeneous (parallel database)?
Is there some connector which could facilitate these?
thanks in advance.
-Rizki-
On Thu, Jul 19, 2012 at 5:56 AM, Sean Owen sro...@gmail.com wrote:
Unless your data set is tiny, like 100K records or less
There's not a very clean answer to this. The original design from way back
when was definitely about reloading a fixed model periodically. So that's
always an option -- put the users in your database, or update files, or
whatever backs the model and they'll turn up at the next reload.
The
On Mon, Jul 16, 2012 at 9:33 AM, Cleophus Pereira
cleophus.pere...@mphasis.com wrote:
You mentioned to use IDRescorer to get data based on user preferences. But
in mahout schema we have just itemid (number) and scores(double). How can we
determine purely based on this what is user
This sounds like a target leak, like your test data is actually
getting copied into the training data.
On Sun, Jul 15, 2012 at 1:08 AM, Matt Mitchell goodie...@gmail.com wrote:
One strange thing, and I'm going to dig through the MIA book tonight,
is that my user based recommendation evaluator
this could happen from duplicate
user/pref/score values in my data? How does Mahout handle duplicate
entries in data, whether in a load-once file or coming from a refresh?
On Sun, Jul 15, 2012 at 4:01 AM, Sean Owen sro...@gmail.com wrote:
This sounds like a target leak, like your test data is actually
absolutely right. Things are working nicely now.
- Matt
On Sat, Jul 7, 2012 at 3:48 AM, Sean Owen sro...@gmail.com wrote:
What it really means is that there is not enough data to make a
meaningful test here.
On Sat, Jul 7, 2012 at 1:28 AM, Matt Mitchell goodie...@gmail.com wrote:
Hi,
I have
Ah yes I see that now.
Try increasing evaluation percentage to 1.0. At the moment you're only
using 10% of the data. That's a quick way to make a bigger test!
Also, what happens if you set the threshold to 0.5?
On Sat, Jul 14, 2012 at 4:56 PM, Matt Mitchell goodie...@gmail.com wrote:
Hey Sean,
SlopeOneRecommender does not use an ItemSimilarity, what are you referring to ?
User and item ID must be an integer (long).
You use the IDRescorer to do exactly the query-time filtering you describe.
The recommender will give you as many recs as you ask for, unless it
is not possible to
This is too much code to ask people to debug in detail, but I get the
gist of it.
I am guess that this is happening: the 2 War movies were rated 5.0,
and were only tagged War. This means that any other movie tagged only
War is estimated to be 5.0, given this similarity definition. And then
I don't understand this -- you make a recommender and then throw it
away and make another one. Why do you have two?
Giving recommendations based on user preferences is what all algorithms do.
You use a Rescorer to filter results at query time, yes, based on
anything you like.
On Fri, Jul 13,
Look at doEstimatePreference().
On Fri, Jul 13, 2012 at 5:16 PM, a a uzayiz...@yahoo.com wrote:
Sean
Thanks for your quick reply.
Switching to a Jaccard coefficient based ItemSimilarity already improved
things tremendously.
You can change the estimation to account for certainty in some way.
user-unsubcr...@mahout.apache.org
If it doesn't work it's a question for Apache, not the project. We
don't run this stuff.
On Fri, Jul 13, 2012 at 7:57 PM, Lingxiang Cheng
lingxiangch...@yahoo.com wrote:
I have unsubscribed from Mahout at least 3 times in the past year. Why do I
keep getting
I was going to say set MAHOUT_OPTS... but I just looked at the script
and why does is set the number of mappers/reducers to 1 by default? It
sort of looks like it intends to override the user's setting.
On Fri, Jul 13, 2012 at 11:33 PM, Aniruddha Basak t-aba...@expedia.com wrote:
Hi Sean,
There would not be any point in this. A user or item with no data has no
effect and can't get any recommendations under any algorithm.
What are you trying to achieve or solve?
Sean
On Jul 11, 2012 1:10 AM, Jaspreet Singh jaspr...@usc.edu wrote:
Hi,
Is it possible to add users and items to
It means you don't have enough data to run a meaningful test.
On Wed, Jul 11, 2012 at 9:54 AM, Mugoma Joseph Okomba mug...@yengas.com wrote:
Hello,
While running evaluate () on RecommenderIRStatsEvaluator I am getting the
error:
java.lang.IllegalArgumentException: Illegal nDCG: NaN
Could
I see. If you're not using collaborative filtering then you're not
using Recommender / DataModel. So I don't think your solution includes
adding these users / items to the model. Yes, you can start by
recommending a simple global top-N most popular items, or, do
something reasonable based on
I don't think this result holds in general -- they chose a very CPU
intensive problem, without much data movement. This won't work for,
say, Mahout jobs. I don't really see the point in porting Hadoop to a
GPU. If you're in a GPU you don't need most of what Hadoop does! That
is I imagine this is
(I agree, it's quite a useful approach -- was answering the question
about whether there was any such thing in Mahout. This all assumes you
can fit the data in memory in the GPU but that is true for moderately
large data sets.)
On Mon, Jul 9, 2012 at 9:04 AM, Manuel Blechschmidt
You can derive many metrics based on just co-occurrence, if your data
is 1 and 0. Pearson, cosine similarity, Tanimoto/Jaccard, Euclidean
distance, log-likelihood all just reduce to counting. Why not at least
give the choice?
You can keep half the diff matrix since it's symmetric of course.
The factorization is the heavy number crunching. The client of a
recommender needs to do very little computation in comparison, like a
vector-matrix product. While a GPU might make this happen faster, it's
already on the order of microseconds. Compare with the cost of
downloading the whole
than add.
On Mon, Jul 9, 2012 at 7:55 AM, bangbig lizhongliangg...@163.com wrote:
I have thought about this problem before, and I read several posts talking
about this. Sean Owen is right that the math doesn't care about what the
things are. But in practice I think a better way is that you
Hadoop and CUDA are quite at odds -- Hadoop is all about splitting up
a problem across quite remote machines while CUDA/GPU approaches rely
on putting all computation together not only on one machine but within
one graphics card.
It doesn't make sense to combine them. Either you want to
More than that, Mahout is mostly Hadoop-based, which is well up the
stack from Java. No there is nothing CUDA-related in the project. The
closest thing are the pure Java non-Hadoop-based recommender pieces.
But it is still far from CUDA.
I think CUDA is intriguing since a lot of ML is a bunch of
What it really means is that there is not enough data to make a
meaningful test here.
On Sat, Jul 7, 2012 at 1:28 AM, Matt Mitchell goodie...@gmail.com wrote:
Hi,
I have a recommender, with a boolean prefs model. I am following the
instructions in the MIA book, but only get this exception:
Here's one I've been puzzling over for a bit. In a factorization based
on the SVD or what have you, you reconstruct the approximate original
matrix (well, one row) by multiplying the matrices back together and
looking for the largest elements. This is essentially multiplying a
user feature vector
(Changed subject from unrelated thread)
You measure precision / recall, or the related F1 measure, or
normalized discounted cumulative gain, or ROC. They are different,
standard metrics that are less complicated than the sound.
On Fri, Jul 6, 2012 at 6:13 PM, Razon, Oren oren.ra...@intel.com
I don't recall that it has ever caused a problem, no. The values are
just keys in a hashtable, so don't need to be sequential.
On Fri, Jul 6, 2012 at 8:26 PM, Dan Brickley dan...@danbri.org wrote:
I recall having problems with this before, using the non-Mahout Taste
code. I have meaningful
6, 2012 at 3:18 AM, Jens Grivolla j+...@grivolla.net wrote:
Maybe locality-sensitive hashing can help to get candidates before
calculating the exact distance?
Bye,
Jens
On 07/06/2012 11:35 AM, Sean Owen wrote:
Here's one I've been puzzling over for a bit. In a factorization based
That's right, in the formulation you are referring to you are not
predicting the original input values, so you can't compare them with
RMSE or something.
To test precision / recall you hold out some of the top-rated items
(these are the relevant results), and see how many come back in the
If you want Java, the implementation in Commons Math is just fine.
There are others.
Limiting the number of features is just a matter of tossing all but
the first k rows, or columns.
On Thu, Jul 5, 2012 at 9:46 AM, Lance Norskog goks...@gmail.com wrote:
What is a good factorizer for doing
: Sean Owen [mailto:sro...@gmail.com]
Sent: Wednesday, July 04, 2012 18:39
To: user@mahout.apache.org
Subject: Re: A bunch of SVD questions...
SVD is not the same thing as ALS, though both are factoring matrices.
There is not a distributed SVD-based recommender, though there is a
distributed SVD
Unless you are recommending users to items too, you don't have a cold
start problem for items. If you are, you can apply the same technique.
Using fold-in, you can create a reasonable user or item vector from
the time you have the very first interaction for the user or item,
which solves most of
Well, the other metrics are mostly undefined in this case! so yes.
On Thu, Jul 5, 2012 at 6:36 PM, Saikat Kanjilal sxk1...@hotmail.com wrote:
Thanks for the input Sean, one other question, in the scenario where most of
the recommendations are boolean style recommendations (i.e. a csv file
The best default answer is to put them all in one model. The math
doesn't care what the things are. Unless you have a strong reason to
weight one data set I wouldn't. If you do, then two models is best. It
is hard to weight a subset of the data within most similarity
functions. I don't think it
If your input is 10MB then the good news is you are not near the scale
where you need Hadoop. A simple non-distributed Mahout recommender
works well, and includes the Rescorer capability you need. That's a
fine place to start.
The book ought to give a pretty good tour of how that works in chapter
Have a look at the PlusAnonymousUserDataModel, which is a bit of a
hack but a decent sort of solution for this case. It lets you
temporarily add a user to the system and then everything else works as
normal, so you can make recommendations to these new / temp users.
There isn't a way to inject
Sure. It will ignore columns beyond the fourth, which is an optional
timestamp. If you just want it to read some common input file but
ignore the unused columns, that's easy.
You can copy and modify FileDataModel to do whatever you like, if you
want it to use this data. You'd have to change other
Look at the example DataModels in integration. The pattern is the
same: load it all into memory! it's too slow for real-time otherwise.
So there is no point in say moving your data from a DB to Dynamo for
scalability if you're using non-distributed code. If you're using
Hadoop, DataModel is not
I'm not sure if Mridul's suggestion does what you want. Do you want to
recommend items to users? then no, you do not start with item IDs and
recommend to them.
It sounds like your question is how to compute similarity data. The
first answer is that you do not use Hadoop unless you must use
(Please don't ping your questions on the list -- bad form and makes
people less likely to answer.)
You do not have to have equal numbers of positive/negative examples. I
think you need to go back and read up on the basics of how Bayesian
classification works before you dig in to Mahout. This is
Item-item similarity is a property of the information you have on two
items and just those items. Whether there are just those 2 items over
500K users, or 2M items over 500K users, makes no difference. So no I
don't think that this skew implies you should use any particular
algorithm, by itself.
No just set the bias term to 1 in all cases.
On Mon, Jul 2, 2012 at 10:13 AM, damodar shetyo akshay.she...@gmail.com wrote:
Is it required that i set the bias(intercept) equal to one only?Or can i
set it to any constant value x?
Also How can choose value of bias for different types of data
Well the inverse of a diagonal matrix like that is just going to be a
diagonal matrix holding the reciprocals (1/x) of the values. That much
is easy. But you need to invert more than that to fold in.
I admit even I don't know the details of the Mahout implementation
you're using, but I imagine
].
Thanks again for the help,
Chris
[1]
https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.data/SSVD-CLI.pdf
On Fri, Jun 29, 2012 at 4:31 PM, Sean Owen sro...@gmail.com wrote:
Well the inverse of a diagonal matrix like that is just going to be a
diagonal matrix holding
Because equals() is implemented. Two Points that are equals() will not
have the same hashCode(), which is wrong. It only matters, I suppose,
if Point is used in some context where it matters, like a HashMap key.
But it is used as a HashMap key here! It happens to succeed because
get() is only ever
I don't think this is something to do with Mahout. Looks like an error from
EMR. I have not seen anything like this.
On Jun 28, 2012 1:40 PM, Oliver B. Fischer mails...@swe-blog.net wrote:
Hi,
I try to run some test with the pseudodistributed recommender job at AWS
using one of the late 0.7
(The third dimension, 1, is the bias / intercept term. You will
probably see this in the literature -- go have a look at a basic intro
to logistic regression. I found Andrew Ng's videos on Coursera a good
intro-level survey of exactly this.)
On Thu, Jun 28, 2012 at 3:57 PM, Ted Dunning
It would be best to keep discussions about the book itself to the
Manning forum. This has been covered several times there, on this
list, and in the book.
As the error suggests, your input is not in the right format. You need
to convert it or change the mapper to read its format.
On Thu, Jun 28,
Those are both true; they may not be the issue here.
The test point definitely belongs in the first of the two groups you
created. Why is the result surprising?
On Wed, Jun 27, 2012 at 9:15 AM, Lance Norskog goks...@gmail.com wrote:
Not enough samples. Machine learning algorithms in general do
The error doesn't seem to relate to memory anyway:
java.lang.IllegalArgumentException: unresolved address
On Mon, Jun 25, 2012 at 7:06 AM, Something Something
mailinglist...@gmail.com wrote:
Please ignore the latest email. When I increased the memory size to 8g,
all steps worked. Now
:08 PM, Sean Owen sro...@gmail.com wrote:
Using 1 is just fine for the reasons you give. You would be surprised how
OK it is to use this even for dislikes. In fact just omit the third field
in your CSV.
However you need to set the boolean data flag and choose a similarity
metric that is defined
Using 1 is just fine for the reasons you give. You would be surprised how
OK it is to use this even for dislikes. In fact just omit the third field
in your CSV.
However you need to set the boolean data flag and choose a similarity
metric that is defined over such data. Pearson / cosine is not for
301 - 400 of 1295 matches
Mail list logo