Re: Which database should I use with Mahout

2013-05-23 Thread Ted Dunning
I think the simplest implementation is to just get extra results from the recommender and rescore after the rough retrieval. Integrating this into the actual scoring engine is very hard since it depends on global characteristics of the final result. The same applies to result set clustering.

Re: Which database should I use with Mahout

2013-05-22 Thread Ted Dunning
On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte johannes.schu...@gmail.com wrote: Thanks for the list...as a non native speaker I got problems understanding the meaning of dithering here. Sorry about that. Your English is good enough that I hadn't noticed any deficit. Dithering is

Re: Which database should I use with Mahout

2013-05-22 Thread Ted Dunning
Yes what you are describing with diversification is something that I have called anti-flood. It comes from the fact that we really are optimizing a portfolio of recommendations rather than a batch of independent recommendations. Doing this from first principles is very hard but there are very

Re: Which database should I use with Mahout

2013-05-21 Thread Ted Dunning
Johannes, Your summary is good. I would add that the precalculated recommendations can be large enough that the lookup becomes more expensive. Your point about staleness is very on-point. On Mon, May 20, 2013 at 10:15 PM, Johannes Schulte johannes.schu...@gmail.com wrote: I think Pat is

Re: Which database should I use with Mahout

2013-05-21 Thread Johannes Schulte
Thanks! Could you also add how to learn the weights you talked about, or at least a hint? Learning weights for search engine query terms always sounds like learning to rank to me but this always seemed pretty complicated and i never managed to try it out.. On Tue, May 21, 2013 at 8:01 AM, Ted

Re: Which database should I use with Mahout

2013-05-21 Thread Pat Ferrel
In the interest of getting some empirical data out about various architectures: On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel pat.fer...@gmail.com wrote: ... You use the user history vector as a query? The most recent suffix of the history vector. How much is used varies by the purpose. We

Re: Which database should I use with Mahout

2013-05-21 Thread Ted Dunning
I have so far just used the weights that Solr applies natively. In my experience, what makes a recommendation engine work better is, in order of importance, a) dithering so that you gather wider data b) using multiple sources of input c) returning results quickly and reliably d) the actual

Re: Which database should I use with Mahout

2013-05-21 Thread Ted Dunning
Inline On Tue, May 21, 2013 at 8:59 AM, Pat Ferrel p...@occamsmachete.com wrote: In the interest of getting some empirical data out about various architectures: On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel pat.fer...@gmail.com wrote: ... You use the user history vector as a query?

Re: Which database should I use with Mahout

2013-05-21 Thread Johannes Schulte
Thanks for the list...as a non native speaker I got problems understanding the meaning of dithering here. I got the feeling that somewhere between a) and d) there is also diversification of items in the recommendation list, so increasing the distance between the list items according to some

Re: Which database should I use with Mahout

2013-05-20 Thread Pat Ferrel
I certainly have questions about this architecture mentioned below but first let me make sure I understand. You use the user history vector as a query? This will be a list of item IDs and strength-of-preference values (maybe 1s for purchases). The cooccurrence matrix has columns treated like

Re: Which database should I use with Mahout

2013-05-20 Thread Ken Krugler
Hi Pat, On May 20, 2013, at 9:46am, Pat Ferrel wrote: I certainly have questions about this architecture mentioned below but first let me make sure I understand. You use the user history vector as a query? This will be a list of item IDs and strength-of-preference values (maybe 1s for

Re: Which database should I use with Mahout

2013-05-20 Thread Ted Dunning
Inline answers. On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel pat.fer...@gmail.com wrote: ... You use the user history vector as a query? The most recent suffix of the history vector. How much is used varies by the purpose. This will be a list of item IDs and strength-of-preference values

Re: Which database should I use with Mahout

2013-05-20 Thread Johannes Schulte
I think Pat is just saying that time(history_lookup) (1) + time (recommendation_calculation) (2) time(precalc_lookop) (3) since 1 and 3 are assumed to be served by the same system class (key value store, db) with a single key and 2 0. ed is using a lot of information that is available at

Which database should I use with Mahout

2013-05-19 Thread Ahmet Ylmaz
Hi, I would like to use Mahout to make recommendations on my web site. Since the data is going to be big, hopefully, I plan to use hadoop implementations of the recommender algorithms. I'm currently storing the data in mysql. Should I continue with it or should I switch to a nosql database

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
It doesn't matter, in the sense that it is never going to be fast enough for real-time at any reasonable scale if actually run off a database directly. One operation results in thousands of queries. It's going to read data into memory anyway and cache it there. So, whatever is easiest for you. The

Re: Which database should I use with Mahout

2013-05-19 Thread Tevfik Aytekin
Thanks Sean, but I could not get your answer. Can you please explain it again? On Sun, May 19, 2013 at 8:00 PM, Sean Owen sro...@gmail.com wrote: It doesn't matter, in the sense that it is never going to be fast enough for real-time at any reasonable scale if actually run off a database

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
I'm first saying that you really don't want to use the database as a data model directly. It is far too slow. Instead you want to use a data model implementation that reads all of the data, once, serially, into memory. And in that case, it makes no difference where the data is being read from,

Re: Which database should I use with Mahout

2013-05-19 Thread Tevfik Aytekin
ok, got it, thanks. On Sun, May 19, 2013 at 8:20 PM, Sean Owen sro...@gmail.com wrote: I'm first saying that you really don't want to use the database as a data model directly. It is far too slow. Instead you want to use a data model implementation that reads all of the data, once, serially,

Re: Which database should I use with Mahout

2013-05-19 Thread Manuel Blechschmidt
Hi Tevfik, one request to the recommender could become more then 1000 queries to the database depending on which recommender you use and the amount of preferences for the given user. The problem is not if you are using SQL, NoSQL, or any other query language. The problem is the latency of the

Re: Which database should I use with Mahout

2013-05-19 Thread Tevfik Aytekin
Hi Manuel, But if one uses matrix factorization and stores the user and item factors in memory then there will be no database access during recommendation. I thought that the original question was where to store the data and how to give it to hadoop. On Sun, May 19, 2013 at 9:01 PM, Manuel

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
I think everyone is agreeing that it is essential to only access information in memory at run-time, yes, whatever that info may be. I don't think the original question was about Hadoop, but, the answer is the same: Hadoop mappers are just reading the input serially. There is no advantage to a

Re: Which database should I use with Mahout

2013-05-19 Thread Ahmet Ylmaz
  From: Sean Owen sro...@gmail.com To: Mahout User List user@mahout.apache.org Sent: Sunday, May 19, 2013 9:26 PM Subject: Re: Which database should I use with Mahout I think everyone is agreeing that it is essential to only access information in memory at run-time, yes, whatever that info may

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
for showing the past ratings of a user. Ahmet From: Sean Owen sro...@gmail.com To: Mahout User List user@mahout.apache.org Sent: Sunday, May 19, 2013 9:26 PM Subject: Re: Which database should I use with Mahout I think everyone is agreeing

Re: Which database should I use with Mahout

2013-05-19 Thread Manuel Blechschmidt
Hi Tevfik, I am working with mysql but I would guess that HDFS like Sean suggested would be a good idea as well. There is also a project called sqoop which can be used to transfer data from relation databases to Hadoop. http://sqoop.apache.org/ Scribe might be also an option for transferring

Re: Which database should I use with Mahout

2013-05-19 Thread Pat Ferrel
Using a Hadoop version of a Mahout recommender will create some number of recs for all users as its output. Sean is talking about Myrrix I think which uses factorization to get much smaller models and so can calculate the recs at runtime for fairly large user sets. However if you are using

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
(I had in mind non distributed parts of Mahout but the principle is similar, yes.) On May 19, 2013 6:27 PM, Pat Ferrel pat.fer...@gmail.com wrote: Using a Hadoop version of a Mahout recommender will create some number of recs for all users as its output. Sean is talking about Myrrix I think

Re: Which database should I use with Mahout

2013-05-19 Thread Ted Dunning
On Sun, May 19, 2013 at 6:26 PM, Pat Ferrel pat.fer...@gmail.com wrote: Using a Hadoop version of a Mahout recommender will create some number of recs for all users as its output. Sean is talking about Myrrix I think which uses factorization to get much smaller models and so can calculate the

Re: Which database should I use with Mahout

2013-05-19 Thread Pat Ferrel
Ah, which for completeness, brings up another scaling issue with Mahout. The in-memory mahout recommenders do not pre-calculate all users recs. They keep the preference matrix in-memory and calculate the recommendations at runtime. At some point the size of your data will max a single machine.

Re: Which database should I use with Mahout

2013-05-19 Thread Ted Dunning
On Sun, May 19, 2013 at 8:04 PM, Pat Ferrel p...@occamsmachete.com wrote: Two basic solutions to this are: factorize (reduces 100s of thousands of items to hundreds of 'features') and continue to calculate recs at runtime, which you have to do with Myrrix since mahout does not have an

Re: Which database should I use with Mahout

2013-05-19 Thread Pat Ferrel
Won't argue with how fast Solr is, It's another fast and scalable lookup engine and another option. Especially if you don't need to lookup anything else by user, in which case you are back to a db... Using a cooccurrence matrix means you are doing item similairty since there is no user data in

Re: Which database should I use with Mahout

2013-05-19 Thread Ted Dunning
On Sun, May 19, 2013 at 8:34 PM, Pat Ferrel p...@occamsmachete.com wrote: Won't argue with how fast Solr is, It's another fast and scalable lookup engine and another option. Especially if you don't need to lookup anything else by user, in which case you are back to a db... But remember, it