It is a problem -- but should be are. IDs are hashed to 31-bit
integers, so the probability of collision is small. However you don't
have to have too many items before it's probable that some two have
collided. (IIRC, that's about 2 ^ (31/2) ? )
In practice it doesn't hurt much. It just means
It is necessary. We want to support input where IDs are possibly
64-bit longs, for consistency with the non-distributed code.
But, 64-bit values are too large to be used as indexes into a Vector.
So they are hashed and then un-hashed by a dictionary lookup.
On Tue, Sep 20, 2011 at 11:44 AM, 张玉东
Great I will commit this change then.
Lance in this case it's the code generating the query -- not sure it's
possible or helpful to use a view.
On Sat, Sep 17, 2011 at 3:48 AM, orbz orb...@gmail.com wrote:
Hopefully this will work as I didn't subscribe to the list before posting -
but yes,
That just looks like a bug. Yes you should add those two lines.
Maybe you are in a better position to then debug the syntax error, if you
have MySQL warmed up: does it work if you surround both sub-SELECTs with
parentheses?
On Sat, Sep 17, 2011 at 12:08 AM, orbz orb...@gmail.com wrote:
Java
There isn't. For the recommenders that work by computing an estimated
preference value for items, I suppose you could average their
estimates and rank by that.
More crudely, you could stitch together the recommendations of
recommender 1 and 2 by taking the top 10 amongst each of their top
What do you mean by isolate the data model building step? You can
run or re-run any step you want in the chain.
So I guess the answer to 2 is yes, if you mean computed item-item
similarities. But these will change slowly over time and need to be
recomputed sometimes.
MapReduce is never ever
On Tue, Sep 13, 2011 at 6:27 AM, Lance Norskog goks...@gmail.com wrote:
Machine learning has quite a few algorithms where data is processed in a way
foreign to its domain. Running SVD on user/item/preference matrices is a
great example: this makes no sense whatsoever.
(Why?? this is one of the
MapReduce ought to control the number of workers reasonably well, and
you can override with mapred.reduce.tasks if you want. I don't think
any fixed number works: what's right for 2 machines isn't right for
200.
2011/9/13 myn m...@163.com:
private static void startDFCounting(Path input, Path
I think we discussed several of these points on the mailing list.
I am not sure I would ever expect there to be a common format across
all jobs. They just don't all operate on the same information. Even
where two jobs ingest vectors, it doesn't mean vectors for one are
meaningful for another.
If
This is small enough that you can fit this into memory on one machine,
and you do not need Hadoop.
I would simply start with a GenericBooleanPrefItemBasedRecommender,
and attach it to a LogLikelihoodSimilarity similarity metric. Wrap the
LogLikelihoodSimilarity in a CachingItemSimilarity. You can
I think he or she is just referring to the method
ItemBasedRecommender.recommendedBecause(). This is as close to an explain
operation as there is in the API.
In reality recommendations are a function of all data. In practice, what you
are asking for is the items most similar to well-liked items.
I see. On EMR, I think the setting you need
is mapred.tasktracker.map.tasks.minimum. At least that's what I see digging
through my old EMR code.
Dhruv, yes a lot of these settings are just suggestions to the framework. I
am not entirely clear on the heuristics used, but I do know that Jake is
You haven't posted any information about the error...
On Wed, Sep 7, 2011 at 1:09 PM, Mani sree.9884516...@gmail.com wrote:
Hi
Am new to mahout .
Apache Maven 3.0.3 (r1075438; 2011-02-28 23:01:09+0530)
Maven home: /home/lnx-usr/Desktop/Paddu/apache-maven-3.0.3
Java version: 1.7.0, vendor:
Running on a real cluster increases the amount of work done, and
significantly, as compared to one node: now, data actually has to be
transferred on/off the machine!
Amazon EMR workers, in my experience, are bottlenecked on I/O. I am not sure
what instance type you are using but I got better
That's your biggest issue, certainly. Only 2 mappers are running, even
though you have 20 machines available. Hadoop determines the number of
mappers based on input size, and your input isn't so big that it thinks you
need 20 workers. It's launching 33 reducers, so your cluster is put to use
If you are referring to the non-distributed similarity function, it is quite
easy: implement UserSimilarity or ItemSimilarity and use that as your
similarity function. How you implement is up to you. If you say more about
what you need to do, maybe people can suggest the right logic.
On Sat, Sep
Your input needs to be CSV if you want to use it all as-is. But, it quickly
creates vectors out of things, so really you can comment out the first
mapper than creates user vectors, and just wire it to use yours instead. it
should do all the rest from there.
On Thu, Sep 1, 2011 at 2:58 PM, Grant
Try HEAD from Subversion. I think it's been addressed, but that change of
course would not have gone back and shown up in 0.5.
On Wed, Aug 31, 2011 at 2:06 PM, Andrea Leistra
andrea.leis...@concur.comwrote:
I've seen a number of people reporting this problem on the list on the past
few months,
. In between,
everything works in terms of those ints. It all works fine this way.
On Wed, Aug 31, 2011 at 4:21 PM, Grant Ingersoll gsing...@apache.orgwrote:
On Aug 22, 2011, at 12:14 PM, Sean Owen wrote:
Here are two ideas:
Recommend threads to users.
Users are people, items are threads
mapping.
On Wed, Aug 31, 2011 at 4:44 PM, Grant Ingersoll gsing...@apache.orgwrote:
On Aug 31, 2011, at 11:26 AM, Sean Owen wrote:
Is the problem not just a matter of translating from the original
identifiers to ints, so they can be used as offsets into a vector, and
then
back again
be left to the caller.
On Wed, Aug 31, 2011 at 4:52 PM, Grant Ingersoll gsing...@apache.orgwrote:
On Aug 31, 2011, at 11:47 AM, Sean Owen wrote:
No it still wants user,item[,rating] input. But otherwise yes, it's
translated and un-translated internally as needed.
You could change
These don't come into play, directly, in collaborative filtering. You can
carry them around and do whatever you like with them, separately.
On Tue, Aug 30, 2011 at 8:02 AM, Lance Norskog goks...@gmail.com wrote:
The GroupLens other datasets come with separate itemID-movie name/genre
and
Really, the best thing is to use numeric IDs. Hash the string or otherwise
turn them into numbers first.
if you really need to work with Strings, see the IDMigrator class which
provides a little automatic help in doing so.
On Mon, Aug 29, 2011 at 3:04 PM, Amit Mahale mahale.a...@gmail.com wrote:
No there is not.
On Sat, Aug 27, 2011 at 8:33 AM, Ramo Karahasan
ramo.karaha...@googlemail.com wrote:
Thank you Sean,
i'll try that today.
Is there an similar example for classification/classify with an web
application?
That's correct. Well you just have to recompose the user row you are
interested in. It will no longer be sparse, at all. Those new values are
your estimated ratings.
On Fri, Aug 26, 2011 at 12:07 AM, Jeff Hansen dsche...@gmail.com wrote:
I also think I may have missed a big step of the puzzle.
The problem may be your changes; I would start with exactly what's in the
distribution, since it works.
I am not sure that the demo will work when accesses through Jetty, as a web
service. I don't know that Jetty has Axis in it. But the servlet-based API
should work fine.
Really, you'd want to
Sean,
thank you for your hints. I used now the original pom, where packing is set
to .jar so I get no .war file created. Do you have an idea what to change?
Thanks,
RK
-Ursprüngliche Nachricht-
Von: Sean Owen [mailto:sro...@gmail.com]
Gesendet: Freitag, 26. August 2011 16:38
The 200x10 matrix is indeed a matrix of 10 singular vectors, which are
eigenvectors of AA'. It's the columns, not rows, that are
eigenvectors.
The rows do mean something. I think it's fair to interpret the 10
singular values / vectors as corresponding to some underlying features
of tastes. The
Here are two ideas:
Recommend users to users.
Your users and items are both e-mail senders. The strength of the
association could be the number of e-mails from A to B (or perhaps the
logarithm). This would find people that people like you e-mail a lot.
Sounds interesting, if not immediately
you could do this check a little earlier, in a prior stage,
and save some computation. But this simple change would definitely
work, as a start.
On Mon, Aug 22, 2011 at 4:05 PM, Varnit Khanna varn...@gmail.com wrote:
On Sat, Aug 20, 2011 at 2:31 AM, Sean Owen sro...@gmail.com wrote:
The non
If you have the project set up in eclipse, it already understands all the
dependencies for whenever you run a class through Eclipse.
However, Hadoop is something else. You need to give it one unified jar file
with all classes. Eclipse can't do this. That's why the supplied Maven-based
build does
You are using it correctly.
I think the API is a little bad in that the method is only declared on the
abstract superclass of common implementations, and not in the interfaces.
So, your cast is needed, but maybe shouldn't be.
In general, I would advise you to avoid preference transforms. They
Use the latest code from Subversion -- this is what the wiki is referring
to. You will find there is now an integration/ module.
On Wed, Aug 17, 2011 at 2:12 AM, Ozgun Erdogan ozg...@gmail.com wrote:
Hi all,
I'm following the instructions on the Mahout wiki for launching a
non-distributed
Yes... though this is almost identical to just retrieving recommendations
from an external server in the first place!
If you download the user's neighborhood, including things those users like,
you have effectively downloaded a list of all recommendable items, and info
to rank them.
This isn't
Yes, I also doubt that the cost of parsing a simple select a,b,c from
x query matters compared to sending 80K records across the network.
On Tue, Aug 16, 2011 at 6:23 AM, Lance Norskog goks...@gmail.com wrote:
The standard advice also applies: use stored procedures if you can. If
not, use
I'm not the authoritative voice here, but I would also agree with your
interpretation -- it's indices rather than values that I'd use.
I can imagine using min-hash on values, but that would not seem to be
the most natural thing to do.
(I don't understand the comment about set and get(). Vectors
We're talking about the first article, and CassandraDataModel?
That is just a mistake, I'll fix it.
On Tue, Aug 16, 2011 at 1:21 PM, Marko Ciric ciric.ma...@gmail.com wrote:
Hi Sean,
Why is only userCache cleared on refresh?
list? Or would it have been better just to go ahead and submit a JIRA?
Thanks!
On Tue, Aug 16, 2011 at 3:08 AM, Sean Owen sro...@gmail.com wrote:
I'm not the authoritative voice here, but I would also agree with your
interpretation -- it's indices rather than values that I'd use.
I can
(Since it's specifically about the book, might be better to post in the
Manning forums.)
The final version, which is a fair bit more up-to-date than the MEAP
version, is synced with 0.5. It was re-read by a technical proofreader to
make sure it all works, so I imagine most of this has been
It's more an artifact of history than design. When this project kicked off
it was pretty open-ended -- large scale machine learning. At some early
stage we merged in my (previous, independent) project called Taste, which
was all collaborative filtering and not Hadoop-based. So that's where this
? That way, we wouldn't have to pull down 80k records on every
reload?
On Mon, Aug 15, 2011 at 1:59 PM, Sean Owen sro...@gmail.com wrote:
That's more reasonable. It sounds a bit long still but could believe
it is due to the overhead of reading everything from the database.
It seems very
For the interested, I wrote a follow-up to this article, focusing on
using *Hadoop* with Cassandra and Mahout:
http://acunu.com/blogs/sean-owen/scaling-cassandra-and-mahout-hadoop/
On Tue, Jul 26, 2011 at 6:22 PM, Sean Owen sro...@gmail.com wrote:
http://www.acunu.com/blogs/sean-owen
Something's very wrong there. 80K rows is tiny, and loads in a second
or so from a file. I think you want to figure out where the slow-down
is with some debugging, since I do not think it's the library. Is
something locking a table, excluding reads, for instance?
On Mon, Aug 15, 2011 at 8:02 PM,
very fast for us.
We did experience slow writing back to the database. But after we
disable JDBC auto-commit, then INSERT got very fast again.
On Mon, Aug 15, 2011 at 3:50 PM, Sean Owen sro...@gmail.com wrote:
Something's very wrong there. 80K rows is tiny, and loads in a second
or so from
Yes, it's just that it's much slower and takes up much more memory. You are
strongly encouraged to use numeric IDs and not bother with this adapter at
all. It's not a question of interning strings, and they need not be
consecutive IDs, but avoiding them entirely.
On Thu, Aug 11, 2011 at 1:02 AM,
You may need to sharpen your terms / problem statement here :
What is a geometric value -- just mean a continuous real value?
So these are item-feature vectors?
The middle bit of the output of an SVD is not a singular vector -- it's a
diagonal matrix containing singular values on the diagonal.
against said tables.
-chuck
On Aug 10, 2011, at 11:34 PM, Sean Owen wrote:
Yes, it's just that it's much slower and takes up much more memory. You
are
strongly encouraged to use numeric IDs and not bother with this adapter
at
all. It's not a question of interning strings, and they need
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
temp/itemIDIndex already exists
Even if I clean before the /tmp/hadoop-clement/ folder...
And it don't seems to run on the cluster.
2011/8/10 Sean Owen sro...@gmail.com
You could just run the main() method with an array of the same arguments
you
passed
Oh my yes. The Java program that launches the job is quite separate from the
cluster. It could be on the same machine -- but by no mean needs to be.
2011/8/10 Clément Notin clement.no...@gmail.com
Just to be clear, is it possible to run an hadoop job from a machine
outside
of the cluster
Just put them in the same directory as the main input file. It will be smart
enough to not reload the main file if it has not changed.
I think the idea is that periodically you will just push and reload the new
complete data file, and remove old update files. But there's no limit to how
many
You don't need ARFF, no. You can write some Java code to write a
SequenceFile directly, one entry at a time. It would take a little study of
the code to understand how it works but it's probably just 10 lines.
What is the OME error?
Results can live wherever you want; HDFS is the most natural
:* Sean Owen sro...@gmail.com
*To:* user@mahout.apache.org; Jeffrey mycyber...@yahoo.com
*Sent:* Tuesday, August 9, 2011 2:54 PM
*Subject:* Re: Needs clue to create a Proof of Concept recommender
You don't need ARFF, no. You can write some Java code to write a
SequenceFile directly, one entry
,in fact no settings, I
just download them and unpack them, and then use them to do some test...
anyone meet this kind of problem ? thank you .
2011/8/5 Sean Owen sro...@gmail.com
I think you have changed your classpath to use 0.5, and in your new
classpath, you are not including all
. Unless
you can provide more info, that's about all one can say.
On Fri, Aug 5, 2011 at 4:28 PM, air cnwe...@gmail.com wrote:
but, what should I do now , it is a very strange problem, 0.4 works well,
but 0.5 does not [?]
2011/8/5 Sean Owen sro...@gmail.com
OK, well the underlying problem
(Josh and I had spoken separately.)
I think he's interested in perhaps learning those similarities, indeed.
As a rough-and-ready start, I'd suggested pure collaborative filtering based
on user and item associations only. Later, you can work in user-user
similarity, learned elsewhere, to improve
Great report here. I imagine the answer is to make 'license' into
'licenses'. Let me have a look and file a JIRA with patch.
Sean
On Sun, Jul 31, 2011 at 6:41 PM, Dan Brickley dan...@danbri.org wrote:
With SVN 'At revision 1152597.', and freshly rebuilt:
jar -tvf
This is a problem with the Magenta repo it seems. It is nothing to do with
Mahout.
On Jul 28, 2011 7:00 AM, 张涛 494671...@163.com wrote:
Hi,everyone:
I'm building Mahout sourcecode , but I encounter a problem which can not
be solved by myself , I hope get your help and I'll appreciate very much!
This is Crunchbase?
If your goal is to classify on what the company *does*, then I think
you are best ignoring most data (funding, employees, etc.) and cluster
their descriptions and/or text of articles about them as if they are
documents. In this sense it is similar to 20 newsgroups, yes. You'd
(This just posted to the list, but I believe it's a duplicate of a
message from several days ago. See my previous response.)
On Wed, Jul 27, 2011 at 8:33 AM, MT mael.tho...@telecom-bretagne.eu wrote:
I'm working on a common dataset that includes the user id, item id, and
timestamp (the moment
ted.dunn...@gmail.com wrote:
Actually, I think that recommending words to people and then doing the
search may add some mileage.
On Wed, Jul 27, 2011 at 12:38 AM, Sean Owen sro...@gmail.com wrote:
It's just a search problem as Ted says -- minus
even the recommendation phase.
Is that all
The problem you've described is actually simpler than the 'classic'
recommendation problem, which is personalized per user.
All you want is a list of most-similar items. That's a lot easier. You could
easily roll your own by using an ItemSimilarity implementation and iterating
over all items. No
http://www.acunu.com/blogs/sean-owen/recommending-cassandra/
I put together this quick-and-dirty writeup on using Cassandra as a
backend for recommenders. May be of interest to anyone using Cassandra
and/or the non-distributed recommenders.
Sean
(Abhik this is nothing to do with Mahout, but the Manning forum
system. I will reply privately as this is not the place.)
On Tue, Jul 26, 2011 at 6:41 PM, Abhik Banerjee
banerjee.abhik@gmail.com wrote:
I get a message saying your post is more than 80 characters, fix that
On Mon, Jul 25, 2011 at 10:05 AM, MT mael.tho...@telecom-bretagne.euwrote:
In fact, correct me if I'm wrong, but to me the evaluator will invariably
give us the same value for precision and recall. Since the items are all
rated with the binary 1.0 value, we give the recommender a threshold
I think Ted is suggesting augmenting the vectors to (1,0,0,100) and
(10,0,0,100) and projecting onto the unit sphere in 4 dimensions. Then the
distance is not 0 on the surface of that sphere.
On Fri, Jul 22, 2011 at 7:29 AM, Jake Mannix jake.man...@gmail.com wrote:
(1, 0, 0) and (10, 0, 0) have
You mean, have the user specify all items that are considered relevant? yes
that could be useful. Do you have a patch in mind?
Your analysis is correct, and I would not call it a bug. It's a symptom of
how little information the evaluation has to work with here without ratings.
It has to pick
Yes just include the item in the DataModel but filter it at runtime
with a Rescorer. This sounds like exactly what you are looking for.
On Mon, Jul 18, 2011 at 10:26 PM, Jamey Wood jamey.w...@gmail.com wrote:
Is there any best practice for including user preferences for certain items
as a
(PS that's exactly Rescorer's role... just a hook for whatever biz
logic you want to filter by)
On Mon, Jul 18, 2011 at 10:52 PM, Ted Dunning ted.dunn...@gmail.com wrote:
I usually just post process the recommendations using a variety of business
logic rules.
This is simply Euclidean distance squared. Take the square root if you
need the simple Euclidean distance.
On Fri, Jul 15, 2011 at 12:36 PM, marco turchi marco.tur...@gmail.com wrote:
Dear All,
I'm a newcomer in Mahout and I'm try to compute the cosine similarity
between two sparse vectors.
I
the initial cardinality
of each vector to a big value, but I was wondering if it is a huge waste of
memory or everything is optimized inside the RandomAccessSparseVector
class. In case, is there an optimal way to set the cardinality?
Thanks again
Marco
On Fri, Jul 15, 2011 at 1:50 PM, Sean Owen
yes that would probably be just fine for you too.
On Thu, Jul 14, 2011 at 4:14 PM, Abmar Barros abma...@gmail.com wrote:
Thanks for the reply Sean,
Another doubt: Does the ReloadFromJDBCDataModel fit my case? Is it a
all-in-memory strategy?
(for uniqueness) for user_id, item_id pairs. That
way, there cannot be two entries with the same user_id, item_id. I'm
not sure where to go from here.
Thanks for the help!
On Tue, Jul 12, 2011 at 12:47 AM, Sean Owen sro...@gmail.com wrote:
You can ignore it. It just doesn't know for sure you
at 10:58 AM, Sean Owen sro...@gmail.com wrote:
Yes it reloads after a configurable interval, or on demand.
Clearing the cache for a user ID only means that user's data is
recomputed.
It's not bad to call this frequently per se... I suppose you want to let
it
cache as much and for as long
What's in the project now is all I know about. Yes if you want to use it
with the Hadoop-based similarity calculator, that's what you would extend.
How do you apply this metric to vectors?
On Wed, Jul 13, 2011 at 10:09 PM, Ian Upright ian-pub...@upright.netwrote:
Hello,
I'm looking for more
have. Perhaps the
original poster can clarify if this is what they want?
[1] http://en.wikipedia.org/wiki/Hellinger_distance
On Wed, Jul 13, 2011 at 2:14 PM, Sean Owen sro...@gmail.com wrote:
How do you apply this metric to vectors?
You can ignore it. It just doesn't know for sure you have a pool.
I believe I have even removed this in a recent refactoring.
On Tue, Jul 12, 2011 at 2:21 AM, Salil Apte sa...@offlinelabs.com wrote:
So I keep getting this warning from either Mahout or the server (I'm
guessing the former):
That is an error in Maven then, not Mahout. Without seeing any more info it
would be hard to say more.
On Mon, Jul 11, 2011 at 9:04 AM, Sören Dierkes
soeren.dier...@informatik.uni-oldenburg.de wrote:
Hi everyone,
I got a few problems, I just build mahout with mvn install and I got a few
, 2011 at 12:50 AM, Sean Owen sro...@gmail.com wrote:
The tests passed at 0.5, and now at trunk. I have tested on Apple JVMs
and
I
assume Jenkins uses a Sun JVM.
I doubt it is a JVM issue, though possible. But I don't see what it has
to
do with being a product?
On Sat, Jul 9, 2011 at 6
Yeah that's not it. It's just internal exceptions because a test failed.
What were you referring to by org.apache.hadoop.security.UserGroupInformation
(initialization failure) ? That sounds closer to a problem.
(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest):
expected:{[D=0, E=1, A=0, B=0, C]=1} but was:{[A=0, B=0, C=1, D=0,
E]=1}
On Mon, Jul 11, 2011 at 12:05 PM, Sean Owen sro...@gmail.com wrote:
Yeah that's not it. It's just internal exceptions because a test failed.
What were you referring
I strongly suspect it is because we updated to Hadoop 0.20.203.0. 0.5
depended on 0.20.2. You could locally go back to depending on 0.20.2; it
ought to still work fine with the previous version.
On Mon, Jul 11, 2011 at 8:14 PM, Marc Millstone millst...@gmail.com wrote:
Ok. I am starting to
.
Feel free to contact me off list if this is getting to be too detail
oriented, etc for this list.
On Mon, Jul 11, 2011 at 12:18 PM, Sean Owen sro...@gmail.com wrote:
I strongly suspect it is because we updated to Hadoop 0.20.203.0. 0.5
depended on 0.20.2. You could locally go back
. (I can't be the only person using the IBM JVM with
Mahout).
On Mon, Jul 11, 2011 at 12:47 PM, Sean Owen sro...@gmail.com wrote:
It sounds like the actual test failure does not occur in HEAD, so sounds
like whatever it is is already fixed or rendered moot.
The issue about Hadoop
So it sounds like you want the SVD of the item-item similarity matrix? Sure,
you can use Mahout for that. If you are not in Hadoop land then look at
SVDRecomnender to crib some related code. It is decomposing the user item
matrix though.
But for this special case of a symmetric matrix your
The tests passed at 0.5, and now at trunk. I have tested on Apple JVMs and I
assume Jenkins uses a Sun JVM.
I doubt it is a JVM issue, though possible. But I don't see what it has to
do with being a product?
On Sat, Jul 9, 2011 at 6:11 AM, Lance Norskog goks...@gmail.com wrote:
Mahout releases
The link is http://www.occamslab.com/petricek/data/
The KDD or Netflix data are plenty big to play with. How big is big for your
purpose?
On Fri, Jul 8, 2011 at 7:05 AM, web service wbs...@gmail.com wrote:
Is it taken offline as well ?
On Thu, Jul 7, 2011 at 10:40 PM, Alex Kozlov
(I've just removed that old page to avoid confusion.)
On Fri, Jul 8, 2011 at 1:46 PM, Maël Thomas mael.tho...@telecom-bretagne.eu
wrote:
Hello
The page http://taste.sourceforge.net/ about Apache Mahout has multiple
broken links, it could be good to update them to redirect to
A block is a piece of a file. It does not (necessarily) have a meaning, or a
file format, by itself. You would not address HDFS blocks individually
from this level. So I suppose the first answer is, no, they do not have
different formats, though the question is not well-formed.
You can have
Yes, but, my point is that it doesn't quite make sense to do such a thing in
MapReduce. Only one mapper will see the header, but, presumably all mappers
need that info. If it's a bit of metadata, pass it in the Configuration
object as a String. If it's a lot, put it in the DistributedCache (or on
I think it's just CSV, but I don't know.
On Wed, Jul 6, 2011 at 11:32 AM, Xiaobo Gu guxiaobo1...@gmail.com wrote:
OK, that's why mahout need a file descriptor, and wha't the difference
between CSV and UCI?
On Wed, Jul 6, 2011 at 10:02 PM, Carlos Seminario recsysu...@gmail.comwrote:
Although this is certainly a sound approach, other approaches have been
suggested in the literature as cited in
https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation.
Can you please provide
I committed a change to make the parsing bits I found in .bayes. use space
and tab. You can try again. I confess I don't know this code and there's a
lot of little pieces of parsing here and there so don't know if this is the
heart of the issue.
On Mon, Jul 4, 2011 at 4:08 PM, Vijay Santhanam
Erm, yes. What is your question?
On Tue, Jul 5, 2011 at 1:30 PM, rmx ruimax...@hotmail.com wrote:
Is this project still alive??
Please...
Thanks
--
View this message in context:
This could be my doing. I noticed that various bits of code split
input files in different ways: StringTokenizer, Pattern, Splitter. And
using different delimiters: space, space/tab, or the weird collection
of delimiters from StringTokenizer. (BTW StringTokenizer is all but
deprecated for this
Yes, this is trading memory for speed. If you can fit everything in memory,
then you should. FileDataModel is in memory.
MySQLJDBCDataModel is not in memory and queries the DB every time. This is
pretty slow, though by caching item-item similarity as you do, a lot of the
load is removed. However
Yes. Both are just fine to use in production. For speed and avoiding abuse
of the database, I'd load into memory and tell it to periodically reload.
But that too is a bit of a choice between how often you want to consume new
data and how much work you want to do to recompute new values.
On Mon,
On Sun, Jul 3, 2011 at 8:05 AM, Ted Dunning ted.dunn...@gmail.com wrote:
For instance, if the recommendation engine recommends B if you have seen A
and there is little other way to discover C which is ranked rather low (and
thus never seen), then there is no way for the engine to even get
I don't see why one would believe that the randomly selected items
farther down the list are more likely to engage a user. If anything,
the recommender says they are less likely to be engaging.
(Or put another way, by this reasoning, we ought to pick
recommendations at random.)
I do think that
Yes, it's a good idea. Usually it serves a purpose for evaluation
only. You know the relative strength of recommendations, and know how
much ranking them 1st, 2nd, 3rd, etc biases the user to click on them.
So you can predict how many clicks each should relatively get. And you
can easily pull up
is probably
justified.
On Sat, Jul 2, 2011 at 12:56 AM, Sean Owen sro...@gmail.com wrote:
Yes, it's a good idea. Usually it serves a purpose for evaluation
only. You know the relative strength of recommendations, and know how
much ranking them 1st, 2nd, 3rd, etc biases the user to click on them
801 - 900 of 1295 matches
Mail list logo