Re: why use the job 'itemIDIndex' to convert the itemid to index?

2011-09-20 Thread Sean Owen
It is a problem -- but should be are. IDs are hashed to 31-bit integers, so the probability of collision is small. However you don't have to have too many items before it's probable that some two have collided. (IIRC, that's about 2 ^ (31/2) ? ) In practice it doesn't hurt much. It just means

Re: why use the job 'itemIDIndex' to convert the itemid to index?

2011-09-20 Thread Sean Owen
It is necessary. We want to support input where IDs are possibly 64-bit longs, for consistency with the non-distributed code. But, 64-bit values are too large to be used as indexes into a Vector. So they are hashed and then un-hashed by a dictionary lookup. On Tue, Sep 20, 2011 at 11:44 AM, 张玉东

Re: Re: Problem with MySQLJDBCDiffStorage

2011-09-17 Thread Sean Owen
Great I will commit this change then. Lance in this case it's the code generating the query -- not sure it's possible or helpful to use a view. On Sat, Sep 17, 2011 at 3:48 AM, orbz orb...@gmail.com wrote: Hopefully this will work as I didn't subscribe to the list before posting - but yes,

Re: Problem with MySQLJDBCDiffStorage

2011-09-16 Thread Sean Owen
That just looks like a bug. Yes you should add those two lines. Maybe you are in a better position to then debug the syntax error, if you have MySQL warmed up: does it work if you surround both sub-SELECTs with parentheses? On Sat, Sep 17, 2011 at 12:08 AM, orbz orb...@gmail.com wrote: Java

Re: Ensemble recommender?

2011-09-14 Thread Sean Owen
There isn't. For the recommenders that work by computing an estimated preference value for items, I suppose you could average their estimates and rank by that. More crudely, you could stitch together the recommendations of recommender 1 and 2 by taking the top 10 amongst each of their top

Re: RecommenderJob Mahout Creating a data model

2011-09-14 Thread Sean Owen
What do you mean by isolate the data model building step? You can run or re-run any step you want in the chain. So I guess the answer to 2 is yes, if you mean computed item-item similarities. But these will change slowly over time and need to be recomputed sometimes. MapReduce is never ever

Re: Spec for a common import/export service for Mahout jobs

2011-09-13 Thread Sean Owen
On Tue, Sep 13, 2011 at 6:27 AM, Lance Norskog goks...@gmail.com wrote: Machine learning has quite a few algorithms where data is processed in a way foreign to its domain. Running SVD on user/item/preference matrices is a great example: this makes no sense whatsoever. (Why?? this is one of the

Re: why so many place does`t set job.setNumReduceTasks

2011-09-13 Thread Sean Owen
MapReduce ought to control the number of workers reasonably well, and you can override with mapred.reduce.tasks if you want. I don't think any fixed number works: what's right for 2 machines isn't right for 200. 2011/9/13 myn m...@163.com:  private static void startDFCounting(Path input, Path

Re: Spec for a common import/export service for Mahout jobs

2011-09-12 Thread Sean Owen
I think we discussed several of these points on the mailing list. I am not sure I would ever expect there to be a common format across all jobs. They just don't all operate on the same information. Even where two jobs ingest vectors, it doesn't mean vectors for one are meaningful for another. If

Re: Recommendation with a dataset with no/same preference

2011-09-11 Thread Sean Owen
This is small enough that you can fit this into memory on one machine, and you do not need Hadoop. I would simply start with a GenericBooleanPrefItemBasedRecommender, and attach it to a LogLikelihoodSimilarity similarity metric. Wrap the LogLikelihoodSimilarity in a CachingItemSimilarity. You can

Re: Mahout explain process

2011-09-08 Thread Sean Owen
I think he or she is just referring to the method ItemBasedRecommender.recommendedBecause(). This is as close to an explain operation as there is in the API. In reality recommendations are a function of all data. In practice, what you are asking for is the items most similar to well-liked items.

Re: LDA on single node is much faster than 20 nodes

2011-09-07 Thread Sean Owen
I see. On EMR, I think the setting you need is mapred.tasktracker.map.tasks.minimum. At least that's what I see digging through my old EMR code. Dhruv, yes a lot of these settings are just suggestions to the framework. I am not entirely clear on the heuristics used, but I do know that Jake is

Re: Problem in building mahout

2011-09-07 Thread Sean Owen
You haven't posted any information about the error... On Wed, Sep 7, 2011 at 1:09 PM, Mani sree.9884516...@gmail.com wrote: Hi Am new to mahout . Apache Maven 3.0.3 (r1075438; 2011-02-28 23:01:09+0530) Maven home: /home/lnx-usr/Desktop/Paddu/apache-maven-3.0.3 Java version: 1.7.0, vendor:

Re: LDA on single node is much faster than 20 nodes

2011-09-06 Thread Sean Owen
Running on a real cluster increases the amount of work done, and significantly, as compared to one node: now, data actually has to be transferred on/off the machine! Amazon EMR workers, in my experience, are bottlenecked on I/O. I am not sure what instance type you are using but I got better

Re: LDA on single node is much faster than 20 nodes

2011-09-06 Thread Sean Owen
That's your biggest issue, certainly. Only 2 mappers are running, even though you have 20 machines available. Hadoop determines the number of mappers based on input size, and your input isn't so big that it thinks you need 20 workers. It's launching 33 reducers, so your cluster is put to use

Re: question on writing a customized item similiarity function

2011-09-03 Thread Sean Owen
If you are referring to the non-distributed similarity function, it is quite easy: implement UserSimilarity or ItemSimilarity and use that as your similarity function. How you implement is up to you. If you say more about what you need to do, maybe people can suggest the right logic. On Sat, Sep

Re: Email and Collab. Filtering

2011-09-01 Thread Sean Owen
Your input needs to be CSV if you want to use it all as-is. But, it quickly creates vectors out of things, so really you can comment out the first mapper than creates user vectors, and just wire it to use yours instead. it should do all the rest from there. On Thu, Sep 1, 2011 at 2:58 PM, Grant

Re: seq2sparse fails: org.apache.lucene.analysis.Analyzer not found

2011-08-31 Thread Sean Owen
Try HEAD from Subversion. I think it's been addressed, but that change of course would not have gone back and shown up in 0.5. On Wed, Aug 31, 2011 at 2:06 PM, Andrea Leistra andrea.leis...@concur.comwrote: I've seen a number of people reporting this problem on the list on the past few months,

Re: Email and Collab. Filtering

2011-08-31 Thread Sean Owen
. In between, everything works in terms of those ints. It all works fine this way. On Wed, Aug 31, 2011 at 4:21 PM, Grant Ingersoll gsing...@apache.orgwrote: On Aug 22, 2011, at 12:14 PM, Sean Owen wrote: Here are two ideas: Recommend threads to users. Users are people, items are threads

Re: Email and Collab. Filtering

2011-08-31 Thread Sean Owen
mapping. On Wed, Aug 31, 2011 at 4:44 PM, Grant Ingersoll gsing...@apache.orgwrote: On Aug 31, 2011, at 11:26 AM, Sean Owen wrote: Is the problem not just a matter of translating from the original identifiers to ints, so they can be used as offsets into a vector, and then back again

Re: Email and Collab. Filtering

2011-08-31 Thread Sean Owen
be left to the caller. On Wed, Aug 31, 2011 at 4:52 PM, Grant Ingersoll gsing...@apache.orgwrote: On Aug 31, 2011, at 11:47 AM, Sean Owen wrote: No it still wants user,item[,rating] input. But otherwise yes, it's translated and un-translated internally as needed. You could change

Re: Mahout : To read a custom input fileString,String,Double

2011-08-30 Thread Sean Owen
These don't come into play, directly, in collaborative filtering. You can carry them around and do whatever you like with them, separately. On Tue, Aug 30, 2011 at 8:02 AM, Lance Norskog goks...@gmail.com wrote: The GroupLens other datasets come with separate itemID-movie name/genre and

Re: Mahout : To read a custom input fileString,String,Double

2011-08-29 Thread Sean Owen
Really, the best thing is to use numeric IDs. Hash the string or otherwise turn them into numbers first. if you really need to work with Strings, see the IDMigrator class which provides a little automatic help in doing so. On Mon, Aug 29, 2011 at 3:04 PM, Amit Mahale mahale.a...@gmail.com wrote:

Re: How to get recommendation demo example working

2011-08-27 Thread Sean Owen
No there is not. On Sat, Aug 27, 2011 at 8:33 AM, Ramo Karahasan ramo.karaha...@googlemail.com wrote: Thank you Sean, i'll try that today. Is there an similar example for classification/classify with an web application?

Re: Singular vectors of a recommendation Item-Item space

2011-08-26 Thread Sean Owen
That's correct. Well you just have to recompose the user row you are interested in. It will no longer be sparse, at all. Those new values are your estimated ratings. On Fri, Aug 26, 2011 at 12:07 AM, Jeff Hansen dsche...@gmail.com wrote: I also think I may have missed a big step of the puzzle.

Re: How to get recommendation demo example working

2011-08-26 Thread Sean Owen
The problem may be your changes; I would start with exactly what's in the distribution, since it works. I am not sure that the demo will work when accesses through Jetty, as a web service. I don't know that Jetty has Axis in it. But the servlet-based API should work fine. Really, you'd want to

Re: How to get recommendation demo example working

2011-08-26 Thread Sean Owen
Sean, thank you for your hints. I used now the original pom, where packing is set to .jar so I get no .war file created. Do you have an idea what to change? Thanks, RK -Ursprüngliche Nachricht- Von: Sean Owen [mailto:sro...@gmail.com] Gesendet: Freitag, 26. August 2011 16:38

Re: Singular vectors of a recommendation Item-Item space

2011-08-25 Thread Sean Owen
The 200x10 matrix is indeed a matrix of 10 singular vectors, which are eigenvectors of AA'. It's the columns, not rows, that are eigenvectors. The rows do mean something. I think it's fair to interpret the 10 singular values / vectors as corresponding to some underlying features of tastes. The

Re: Email and Collab. Filtering

2011-08-22 Thread Sean Owen
Here are two ideas: Recommend users to users. Your users and items are both e-mail senders. The strength of the association could be the number of e-mails from A to B (or perhaps the logarithm). This would find people that people like you e-mail a lot. Sounds interesting, if not immediately

Re: Recommending items with temporal restrictions

2011-08-22 Thread Sean Owen
you could do this check a little earlier, in a prior stage, and save some computation. But this simple change would definitely work, as a start. On Mon, Aug 22, 2011 at 4:05 PM, Varnit Khanna varn...@gmail.com wrote: On Sat, Aug 20, 2011 at 2:31 AM, Sean Owen sro...@gmail.com wrote: The non

Re: Mahout project running in eclipse

2011-08-19 Thread Sean Owen
If you have the project set up in eclipse, it already understands all the dependencies for whenever you run a class through Eclipse. However, Hadoop is something else. You need to give it one unified jar file with all classes. Eclipse can't do this. That's why the supplied Maven-based build does

Re: How to use PreferenceTransform?

2011-08-18 Thread Sean Owen
You are using it correctly. I think the API is a little bad in that the method is only declared on the abstract superclass of common implementations, and not in the interfaces. So, your cast is needed, but maybe shouldn't be. In general, I would advise you to avoid preference transforms. They

Re: How to launch a single-node recommender service?

2011-08-17 Thread Sean Owen
Use the latest code from Subversion -- this is what the wiki is referring to. You will find there is now an integration/ module. On Wed, Aug 17, 2011 at 2:12 AM, Ozgun Erdogan ozg...@gmail.com wrote: Hi all, I'm following the instructions on the Mahout wiki for launching a non-distributed

Re: Single-user recommenders?

2011-08-17 Thread Sean Owen
Yes... though this is almost identical to just retrieving recommendations from an external server in the first place! If you download the user's neighborhood, including things those users like, you have effectively downloaded a list of all recommendable items, and info to rank them. This isn't

Re: Slow ReloadFromJDBCDataModel

2011-08-16 Thread Sean Owen
Yes, I also doubt that the cost of parsing a simple select a,b,c from x query matters compared to sending 80K records across the network. On Tue, Aug 16, 2011 at 6:23 AM, Lance Norskog goks...@gmail.com wrote: The standard advice also applies: use stored procedures if you can. If not, use

Re: MinHash implementation

2011-08-16 Thread Sean Owen
I'm not the authoritative voice here, but I would also agree with your interpretation -- it's indices rather than values that I'd use. I can imagine using min-hash on values, but that would not seem to be the most natural thing to do. (I don't understand the comment about set and get(). Vectors

Re: Article on Mahout recommenders and Cassandra

2011-08-16 Thread Sean Owen
We're talking about the first article, and CassandraDataModel? That is just a mistake, I'll fix it. On Tue, Aug 16, 2011 at 1:21 PM, Marko Ciric ciric.ma...@gmail.com wrote: Hi Sean, Why is only userCache cleared on refresh?

Re: MinHash implementation

2011-08-16 Thread Sean Owen
list?  Or would it have been better just to go ahead and submit a JIRA? Thanks! On Tue, Aug 16, 2011 at 3:08 AM, Sean Owen sro...@gmail.com wrote: I'm not the authoritative voice here, but I would also agree with your interpretation -- it's indices rather than values that I'd use. I can

Re: df-count/data does not exist

2011-08-16 Thread Sean Owen
(Since it's specifically about the book, might be better to post in the Manning forums.) The final version, which is a fair bit more up-to-date than the MEAP version, is synced with 0.5. It was re-read by a technical proofreader to make sure it all works, so I imagine most of this has been

Re: Vectors vs Preferences

2011-08-16 Thread Sean Owen
It's more an artifact of history than design. When this project kicked off it was pretty open-ended -- large scale machine learning. At some early stage we merged in my (previous, independent) project called Taste, which was all collaborative filtering and not Hadoop-based. So that's where this

Re: Slow ReloadFromJDBCDataModel

2011-08-16 Thread Sean Owen
? That way, we wouldn't have to pull down 80k records on every reload? On Mon, Aug 15, 2011 at 1:59 PM, Sean Owen sro...@gmail.com wrote: That's more reasonable. It sounds a bit long still but could believe it is due to the overhead of reading everything from the database. It seems very

Re: Article on Mahout recommenders and Cassandra

2011-08-15 Thread Sean Owen
For the interested, I wrote a follow-up to this article, focusing on using *Hadoop* with Cassandra and Mahout: http://acunu.com/blogs/sean-owen/scaling-cassandra-and-mahout-hadoop/ On Tue, Jul 26, 2011 at 6:22 PM, Sean Owen sro...@gmail.com wrote: http://www.acunu.com/blogs/sean-owen

Re: Slow ReloadFromJDBCDataModel

2011-08-15 Thread Sean Owen
Something's very wrong there. 80K rows is tiny, and loads in a second or so from a file. I think you want to figure out where the slow-down is with some debugging, since I do not think it's the library. Is something locking a table, excluding reads, for instance? On Mon, Aug 15, 2011 at 8:02 PM,

Re: Slow ReloadFromJDBCDataModel

2011-08-15 Thread Sean Owen
very fast for us. We did experience slow writing back to the database. But after we disable JDBC auto-commit, then INSERT got very fast again. On Mon, Aug 15, 2011 at 3:50 PM, Sean Owen sro...@gmail.com wrote: Something's very wrong there. 80K rows is tiny, and loads in a second or so from

Re: FileDataModel / FileIDMigrator

2011-08-11 Thread Sean Owen
Yes, it's just that it's much slower and takes up much more memory. You are strongly encouraged to use numeric IDs and not bother with this adapter at all. It's not a question of interning strings, and they need not be consecutive IDs, but avoiding them entirely. On Thu, Aug 11, 2011 at 1:02 AM,

Re: Singular vectors of a recommendation Item-Item space

2011-08-11 Thread Sean Owen
You may need to sharpen your terms / problem statement here : What is a geometric value -- just mean a continuous real value? So these are item-feature vectors? The middle bit of the output of an SVD is not a singular vector -- it's a diagonal matrix containing singular values on the diagonal.

Re: FileDataModel / FileIDMigrator

2011-08-11 Thread Sean Owen
against said tables. -chuck On Aug 10, 2011, at 11:34 PM, Sean Owen wrote: Yes, it's just that it's much slower and takes up much more memory. You are strongly encouraged to use numeric IDs and not bother with this adapter at all. It's not a question of interning strings, and they need

Re: How to launch an Hadoop Recommender Job from Java ?

2011-08-10 Thread Sean Owen
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory temp/itemIDIndex already exists Even if I clean before the /tmp/hadoop-clement/ folder... And it don't seems to run on the cluster. 2011/8/10 Sean Owen sro...@gmail.com You could just run the main() method with an array of the same arguments you passed

Re: How to launch an Hadoop Recommender Job from Java ?

2011-08-10 Thread Sean Owen
Oh my yes. The Java program that launches the job is quite separate from the cluster. It could be on the same machine -- but by no mean needs to be. 2011/8/10 Clément Notin clement.no...@gmail.com Just to be clear, is it possible to run an hadoop job from a machine outside of the cluster

Re: Update files and refresh

2011-08-10 Thread Sean Owen
Just put them in the same directory as the main input file. It will be smart enough to not reload the main file if it has not changed. I think the idea is that periodically you will just push and reload the new complete data file, and remove old update files. But there's no limit to how many

Re: Needs clue to create a Proof of Concept recommender

2011-08-09 Thread Sean Owen
You don't need ARFF, no. You can write some Java code to write a SequenceFile directly, one entry at a time. It would take a little study of the code to understand how it works but it's probably just 10 lines. What is the OME error? Results can live wherever you want; HDFS is the most natural

Re: Needs clue to create a Proof of Concept recommender

2011-08-09 Thread Sean Owen
:* Sean Owen sro...@gmail.com *To:* user@mahout.apache.org; Jeffrey mycyber...@yahoo.com *Sent:* Tuesday, August 9, 2011 2:54 PM *Subject:* Re: Needs clue to create a Proof of Concept recommender You don't need ARFF, no. You can write some Java code to write a SequenceFile directly, one entry

Re: this is a bug of mahout 0.5 ?

2011-08-05 Thread Sean Owen
,in fact no settings, I just download them and unpack them, and then use them to do some test... anyone meet this kind of problem ? thank you . 2011/8/5 Sean Owen sro...@gmail.com I think you have changed your classpath to use 0.5, and in your new classpath, you are not including all

Re: this is a bug of mahout 0.5 ?

2011-08-05 Thread Sean Owen
. Unless you can provide more info, that's about all one can say. On Fri, Aug 5, 2011 at 4:28 PM, air cnwe...@gmail.com wrote: but, what should I do now , it is a very strange problem, 0.4 works well, but 0.5 does not [?] 2011/8/5 Sean Owen sro...@gmail.com OK, well the underlying problem

Re: Understanding Mahout Algos and Applications

2011-08-04 Thread Sean Owen
(Josh and I had spoken separately.) I think he's interested in perhaps learning those similarities, indeed. As a rough-and-ready start, I'd suggested pure collaborative filtering based on user and item associations only. Later, you can work in user-user similarity, learned elsewhere, to improve

Re: OSX/Hadoop problem: filename 'LICENSE' and dir 'license/' clash in mahout-examples-0.6-SNAPSHOT-job.jar

2011-08-01 Thread Sean Owen
Great report here. I imagine the answer is to make 'license' into 'licenses'. Let me have a look and file a JIRA with patch. Sean On Sun, Jul 31, 2011 at 6:41 PM, Dan Brickley dan...@danbri.org wrote: With SVN 'At revision 1152597.', and freshly rebuilt: jar -tvf

Re: About Building Mahout

2011-07-28 Thread Sean Owen
This is a problem with the Magenta repo it seems. It is nothing to do with Mahout. On Jul 28, 2011 7:00 AM, 张涛 494671...@163.com wrote: Hi,everyone: I'm building Mahout sourcecode , but I encounter a problem which can not be solved by myself , I hope get your help and I'll appreciate very much!

Re: Classification on Techcrunch

2011-07-27 Thread Sean Owen
This is Crunchbase? If your goal is to classify on what the company *does*, then I think you are best ignoring most data (funding, employees, etc.) and cluster their descriptions and/or text of articles about them as if they are documents. In this sense it is similar to 20 newsgroups, yes. You'd

Re: Mahout Binary Recommender Evaluation

2011-07-27 Thread Sean Owen
(This just posted to the list, but I believe it's a duplicate of a message from several days ago. See my previous response.) On Wed, Jul 27, 2011 at 8:33 AM, MT mael.tho...@telecom-bretagne.eu wrote: I'm working on a common dataset that includes the user id, item id, and timestamp (the moment

Re: Advice request

2011-07-27 Thread Sean Owen
ted.dunn...@gmail.com wrote: Actually, I think that recommending words to people and then doing the search may add some mileage. On Wed, Jul 27, 2011 at 12:38 AM, Sean Owen sro...@gmail.com wrote: It's just a search problem as Ted says -- minus even the recommendation phase. Is that all

Re: Item based recommendations

2011-07-26 Thread Sean Owen
The problem you've described is actually simpler than the 'classic' recommendation problem, which is personalized per user. All you want is a list of most-similar items. That's a lot easier. You could easily roll your own by using an ItemSimilarity implementation and iterating over all items. No

Article on Mahout recommenders and Cassandra

2011-07-26 Thread Sean Owen
http://www.acunu.com/blogs/sean-owen/recommending-cassandra/ I put together this quick-and-dirty writeup on using Cassandra as a backend for recommenders. May be of interest to anyone using Cassandra and/or the non-distributed recommenders. Sean

Re: using Integer array with NamedVector or RandomAccessSparseVector

2011-07-26 Thread Sean Owen
(Abhik this is nothing to do with Mahout, but the Manning forum system. I will reply privately as this is not the place.) On Tue, Jul 26, 2011 at 6:41 PM, Abhik Banerjee banerjee.abhik@gmail.com wrote: I get a message saying your post is more than 80 characters, fix that

Re: Mahout Binary Recommender Evaluator

2011-07-25 Thread Sean Owen
On Mon, Jul 25, 2011 at 10:05 AM, MT mael.tho...@telecom-bretagne.euwrote: In fact, correct me if I'm wrong, but to me the evaluator will invariably give us the same value for precision and recall. Since the items are all rated with the binary 1.0 value, we give the recommender a threshold

Re: Preserving pairwise distances while normalizing vectors

2011-07-22 Thread Sean Owen
I think Ted is suggesting augmenting the vectors to (1,0,0,100) and (10,0,0,100) and projecting onto the unit sphere in 4 dimensions. Then the distance is not 0 on the surface of that sphere. On Fri, Jul 22, 2011 at 7:29 AM, Jake Mannix jake.man...@gmail.com wrote: (1, 0, 0) and (10, 0, 0) have

Re: Evaluating boolean preference data sets

2011-07-21 Thread Sean Owen
You mean, have the user specify all items that are considered relevant? yes that could be useful. Do you have a patch in mind? Your analysis is correct, and I would not call it a bug. It's a symptom of how little information the evaluation has to work with here without ratings. It has to pick

Re: Including Unrecommendable Items

2011-07-18 Thread Sean Owen
Yes just include the item in the DataModel but filter it at runtime with a Rescorer. This sounds like exactly what you are looking for. On Mon, Jul 18, 2011 at 10:26 PM, Jamey Wood jamey.w...@gmail.com wrote: Is there any best practice for including user preferences for certain items as a

Re: Including Unrecommendable Items

2011-07-18 Thread Sean Owen
(PS that's exactly Rescorer's role... just a hook for whatever biz logic you want to filter by) On Mon, Jul 18, 2011 at 10:52 PM, Ted Dunning ted.dunn...@gmail.com wrote: I usually just post process the recommendations using a variety of business logic rules.

Re: Similarity between sparse vectors

2011-07-15 Thread Sean Owen
This is simply Euclidean distance squared. Take the square root if you need the simple Euclidean distance. On Fri, Jul 15, 2011 at 12:36 PM, marco turchi marco.tur...@gmail.com wrote: Dear All, I'm a newcomer in Mahout and I'm try to compute the cosine similarity between two sparse vectors. I

Re: Similarity between sparse vectors

2011-07-15 Thread Sean Owen
the initial cardinality of each vector to a big value, but I was wondering if it is a huge waste of memory or everything is optimized inside the   RandomAccessSparseVector class. In case, is there an optimal way to set the cardinality? Thanks again Marco On Fri, Jul 15, 2011 at 1:50 PM, Sean Owen

Re: ItemSimilarity pre-processing

2011-07-14 Thread Sean Owen
yes that would probably be just fine for you too. On Thu, Jul 14, 2011 at 4:14 PM, Abmar Barros abma...@gmail.com wrote: Thanks for the reply Sean, Another doubt: Does the ReloadFromJDBCDataModel fit my case? Is it a all-in-memory strategy?

Re: Connection Pooling

2011-07-13 Thread Sean Owen
(for uniqueness) for user_id, item_id pairs. That way, there cannot be two entries with the same user_id, item_id. I'm not sure where to go from here. Thanks for the help! On Tue, Jul 12, 2011 at 12:47 AM, Sean Owen sro...@gmail.com wrote: You can ignore it. It just doesn't know for sure you

Re: Connection Pooling

2011-07-13 Thread Sean Owen
at 10:58 AM, Sean Owen sro...@gmail.com wrote: Yes it reloads after a configurable interval, or on demand. Clearing the cache for a user ID only means that user's data is recomputed. It's not bad to call this frequently per se... I suppose you want to let it cache as much and for as long

Re: similarity metrics?

2011-07-13 Thread Sean Owen
What's in the project now is all I know about. Yes if you want to use it with the Hadoop-based similarity calculator, that's what you would extend. How do you apply this metric to vectors? On Wed, Jul 13, 2011 at 10:09 PM, Ian Upright ian-pub...@upright.netwrote: Hello, I'm looking for more

Re: similarity metrics?

2011-07-13 Thread Sean Owen
have. Perhaps the original poster can clarify if this is what they want? [1] http://en.wikipedia.org/wiki/Hellinger_distance On Wed, Jul 13, 2011 at 2:14 PM, Sean Owen sro...@gmail.com wrote: How do you apply this metric to vectors?

Re: Connection Pooling

2011-07-12 Thread Sean Owen
You can ignore it. It just doesn't know for sure you have a pool. I believe I have even removed this in a recent refactoring. On Tue, Jul 12, 2011 at 2:21 AM, Salil Apte sa...@offlinelabs.com wrote: So I keep getting this warning from either Mahout or the server (I'm guessing the former):

Re: Build Failure in Math and Core

2011-07-11 Thread Sean Owen
That is an error in Maven then, not Mahout. Without seeing any more info it would be hard to say more. On Mon, Jul 11, 2011 at 9:04 AM, Sören Dierkes soeren.dier...@informatik.uni-oldenburg.de wrote: Hi everyone, I got a few problems, I just build mahout with mvn install and I got a few

Re: Failing unit test: testStartParallelFPGrowth

2011-07-11 Thread Sean Owen
, 2011 at 12:50 AM, Sean Owen sro...@gmail.com wrote: The tests passed at 0.5, and now at trunk. I have tested on Apple JVMs and I assume Jenkins uses a Sun JVM. I doubt it is a JVM issue, though possible. But I don't see what it has to do with being a product? On Sat, Jul 9, 2011 at 6

Re: Failing unit test: testStartParallelFPGrowth

2011-07-11 Thread Sean Owen
Yeah that's not it. It's just internal exceptions because a test failed. What were you referring to by org.apache.hadoop.security.UserGroupInformation (initialization failure) ? That sounds closer to a problem.

Re: Failing unit test: testStartParallelFPGrowth

2011-07-11 Thread Sean Owen
(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest): expected:{[D=0, E=1, A=0, B=0, C]=1} but was:{[A=0, B=0, C=1, D=0, E]=1} On Mon, Jul 11, 2011 at 12:05 PM, Sean Owen sro...@gmail.com wrote: Yeah that's not it. It's just internal exceptions because a test failed. What were you referring

Re: Failing unit test: testStartParallelFPGrowth

2011-07-11 Thread Sean Owen
I strongly suspect it is because we updated to Hadoop 0.20.203.0. 0.5 depended on 0.20.2. You could locally go back to depending on 0.20.2; it ought to still work fine with the previous version. On Mon, Jul 11, 2011 at 8:14 PM, Marc Millstone millst...@gmail.com wrote: Ok. I am starting to

Re: Failing unit test: testStartParallelFPGrowth

2011-07-11 Thread Sean Owen
. Feel free to contact me off list if this is getting to be too detail oriented, etc for this list. On Mon, Jul 11, 2011 at 12:18 PM, Sean Owen sro...@gmail.com wrote: I strongly suspect it is because we updated to Hadoop 0.20.203.0. 0.5 depended on 0.20.2. You could locally go back

Re: Failing unit test: testStartParallelFPGrowth

2011-07-11 Thread Sean Owen
. (I can't be the only person using the IBM JVM with Mahout). On Mon, Jul 11, 2011 at 12:47 PM, Sean Owen sro...@gmail.com wrote: It sounds like the actual test failure does not occur in HEAD, so sounds like whatever it is is already fixed or rendered moot. The issue about Hadoop

Re: Singular vectors of a recommendation Item-Item space

2011-07-10 Thread Sean Owen
So it sounds like you want the SVD of the item-item similarity matrix? Sure, you can use Mahout for that. If you are not in Hadoop land then look at SVDRecomnender to crib some related code. It is decomposing the user item matrix though. But for this special case of a symmetric matrix your

Re: Failing unit test: testStartParallelFPGrowth

2011-07-09 Thread Sean Owen
The tests passed at 0.5, and now at trunk. I have tested on Apple JVMs and I assume Jenkins uses a Sun JVM. I doubt it is a JVM issue, though possible. But I don't see what it has to do with being a product? On Sat, Jul 9, 2011 at 6:11 AM, Lance Norskog goks...@gmail.com wrote: Mahout releases

Re: Available datasets for recommendations

2011-07-08 Thread Sean Owen
The link is http://www.occamslab.com/petricek/data/ The KDD or Netflix data are plenty big to play with. How big is big for your purpose? On Fri, Jul 8, 2011 at 7:05 AM, web service wbs...@gmail.com wrote: Is it taken offline as well ? On Thu, Jul 7, 2011 at 10:40 PM, Alex Kozlov

Re: Broken links

2011-07-08 Thread Sean Owen
(I've just removed that old page to avoid confusion.) On Fri, Jul 8, 2011 at 1:46 PM, Maël Thomas mael.tho...@telecom-bretagne.eu wrote: Hello The page http://taste.sourceforge.net/ about Apache Mahout has multiple broken links, it could be good to update them to redirect to

Re: File format question when write map-reduce applications

2011-07-06 Thread Sean Owen
A block is a piece of a file. It does not (necessarily) have a meaning, or a file format, by itself. You would not address HDFS blocks individually from this level. So I suppose the first answer is, no, they do not have different formats, though the question is not well-formed. You can have

Re: File format question when write map-reduce applications

2011-07-06 Thread Sean Owen
Yes, but, my point is that it doesn't quite make sense to do such a thing in MapReduce. Only one mapper will see the header, but, presumably all mappers need that info. If it's a bit of metadata, pass it in the Configuration object as a String. If it's a lot, put it in the DistributedCache (or on

Re: File format question when write map-reduce applications

2011-07-06 Thread Sean Owen
I think it's just CSV, but I don't know. On Wed, Jul 6, 2011 at 11:32 AM, Xiaobo Gu guxiaobo1...@gmail.com wrote: OK, that's why mahout need a file descriptor, and wha't the difference between CSV and UCI?

Re: Generic Recommender algorithm questions (using Mahout 0.4)

2011-07-06 Thread Sean Owen
On Wed, Jul 6, 2011 at 10:02 PM, Carlos Seminario recsysu...@gmail.comwrote: Although this is certainly a sound approach, other approaches have been suggested in the literature as cited in https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation. Can you please provide

Re: 20news

2011-07-05 Thread Sean Owen
I committed a change to make the parsing bits I found in .bayes. use space and tab. You can try again. I confess I don't know this code and there's a lot of little pieces of parsing here and there so don't know if this is the heart of the issue. On Mon, Jul 4, 2011 at 4:08 PM, Vijay Santhanam

Re: Using with seq2spars org.apache.lucene.analysis.Analyzer

2011-07-05 Thread Sean Owen
Erm, yes. What is your question? On Tue, Jul 5, 2011 at 1:30 PM, rmx ruimax...@hotmail.com wrote: Is this project still alive?? Please... Thanks -- View this message in context:

Re: 20news

2011-07-04 Thread Sean Owen
This could be my doing. I noticed that various bits of code split input files in different ways: StringTokenizer, Pattern, Splitter. And using different delimiters: space, space/tab, or the weird collection of delimiters from StringTokenizer. (BTW StringTokenizer is all but deprecated for this

Re: MySQLJDBCDataModel vs FileDataModel

2011-07-04 Thread Sean Owen
Yes, this is trading memory for speed. If you can fit everything in memory, then you should. FileDataModel is in memory. MySQLJDBCDataModel is not in memory and queries the DB every time. This is pretty slow, though by caching item-item similarity as you do, a lot of the load is removed. However

Re: MySQLJDBCDataModel vs FileDataModel

2011-07-04 Thread Sean Owen
Yes. Both are just fine to use in production. For speed and avoiding abuse of the database, I'd load into memory and tell it to periodically reload. But that too is a bit of a choice between how often you want to consume new data and how much work you want to do to recompute new values. On Mon,

Re: Introducing randomness into my results

2011-07-03 Thread Sean Owen
On Sun, Jul 3, 2011 at 8:05 AM, Ted Dunning ted.dunn...@gmail.com wrote: For instance, if the recommendation engine recommends B if you have seen A and there is little other way to discover C which is ranked rather low (and thus never seen), then there is no way for the engine to even get

Re: Introducing randomness into my results

2011-07-03 Thread Sean Owen
I don't see why one would believe that the randomly selected items farther down the list are more likely to engage a user. If anything, the recommender says they are less likely to be engaging. (Or put another way, by this reasoning, we ought to pick recommendations at random.) I do think that

Re: Introducing randomness into my results

2011-07-02 Thread Sean Owen
Yes, it's a good idea. Usually it serves a purpose for evaluation only. You know the relative strength of recommendations, and know how much ranking them 1st, 2nd, 3rd, etc biases the user to click on them. So you can predict how many clicks each should relatively get. And you can easily pull up

Re: Introducing randomness into my results

2011-07-02 Thread Sean Owen
is probably justified. On Sat, Jul 2, 2011 at 12:56 AM, Sean Owen sro...@gmail.com wrote: Yes, it's a good idea. Usually it serves a purpose for evaluation only. You know the relative strength of recommendations, and know how much ranking them 1st, 2nd, 3rd, etc biases the user to click on them

<    4   5   6   7   8   9   10   11   12   13   >