Which database should I use with Mahout

2013-05-19 Thread Ahmet Ylmaz
Hi, I would like to use Mahout to make recommendations on my web site. Since the data is going to be big, hopefully, I plan to use hadoop implementations of the recommender algorithms. I'm currently storing the data in mysql. Should I continue with it or should I switch to a nosql database

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
It doesn't matter, in the sense that it is never going to be fast enough for real-time at any reasonable scale if actually run off a database directly. One operation results in thousands of queries. It's going to read data into memory anyway and cache it there. So, whatever is easiest for you. The

Re: Which database should I use with Mahout

2013-05-19 Thread Tevfik Aytekin
Thanks Sean, but I could not get your answer. Can you please explain it again? On Sun, May 19, 2013 at 8:00 PM, Sean Owen sro...@gmail.com wrote: It doesn't matter, in the sense that it is never going to be fast enough for real-time at any reasonable scale if actually run off a database

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
I'm first saying that you really don't want to use the database as a data model directly. It is far too slow. Instead you want to use a data model implementation that reads all of the data, once, serially, into memory. And in that case, it makes no difference where the data is being read from,

Re: Which database should I use with Mahout

2013-05-19 Thread Tevfik Aytekin
ok, got it, thanks. On Sun, May 19, 2013 at 8:20 PM, Sean Owen sro...@gmail.com wrote: I'm first saying that you really don't want to use the database as a data model directly. It is far too slow. Instead you want to use a data model implementation that reads all of the data, once, serially,

Re: Which database should I use with Mahout

2013-05-19 Thread Manuel Blechschmidt
Hi Tevfik, one request to the recommender could become more then 1000 queries to the database depending on which recommender you use and the amount of preferences for the given user. The problem is not if you are using SQL, NoSQL, or any other query language. The problem is the latency of the

Re: Which database should I use with Mahout

2013-05-19 Thread Tevfik Aytekin
Hi Manuel, But if one uses matrix factorization and stores the user and item factors in memory then there will be no database access during recommendation. I thought that the original question was where to store the data and how to give it to hadoop. On Sun, May 19, 2013 at 9:01 PM, Manuel

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
I think everyone is agreeing that it is essential to only access information in memory at run-time, yes, whatever that info may be. I don't think the original question was about Hadoop, but, the answer is the same: Hadoop mappers are just reading the input serially. There is no advantage to a

Re: Which database should I use with Mahout

2013-05-19 Thread Ahmet Ylmaz
Hi Sean, If I understood you correctly you are saying that I will not need mysql. But if I store my data on HDFS will I be make fast queries such as Return all the ratings of a specific user  which will be needed for showing the past ratings of a user. Ahmet 

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
(Oh, by the way, I realize the original question was about Hadoop. I can't read carefully.) No, HDFS is not good for anything like random access. For input, that's OK, because you don't need random access. So HDFS is just fine. For output, if you are going to then serve these precomputed results

Re: Which database should I use with Mahout

2013-05-19 Thread Manuel Blechschmidt
Hi Tevfik, I am working with mysql but I would guess that HDFS like Sean suggested would be a good idea as well. There is also a project called sqoop which can be used to transfer data from relation databases to Hadoop. http://sqoop.apache.org/ Scribe might be also an option for transferring

mahout colt collections

2013-05-19 Thread Sophie Sperner
Dear, I'm experiencing difficulties with hppchttp://labs.carrotsearch.com/hppc.htmllibrary that I'm using. My algorithms work perfectly fine for small inputs, but when I go for amazon machine and want to compute larger inputs, my code hangs on forever as a result of some hidden bugs in that

Re: mahout colt collections

2013-05-19 Thread Stevo Slavić
Hello Sophie, Mahout 0.7 Math module is available on Maven Central repository: http://repo1.maven.org/maven2/org/apache/mahout/mahout-math/0.7/ Besides jar with binaries there is also a javadoc and sources jar. I've just counted, since 0.7 release there have been 60 commits which included math

Re: mahout colt collections

2013-05-19 Thread Sophie Sperner
Dear Stevo, By this link https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/math/package-summary.html there is no OpenIntHashSet or OpenIntIntHashMap classes or with similar names, do they exist there? Thank you for reply, Best wishes On 19 May 2013 22:50, Stevo Slavić

Re: mahout colt collections

2013-05-19 Thread Sophie Sperner
I found it here, seems okay now. That link just strange. On 19 May 2013 23:15, Sophie Sperner sophie.sper...@gmail.com wrote: Dear Stevo, By this link https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/math/package-summary.html there is no OpenIntHashSet or

Re: mahout colt collections

2013-05-19 Thread Stevo Slavić
They do, but it seems javadoc generation is not configured well - doesn't generate report for generated sources. org.apache.mahout.math.set.OpenIntHashSet org.apache.mahout.math.map.OpenIntIntHashMap On Mon, May 20, 2013 at 12:15 AM, Sophie Sperner sophie.sper...@gmail.comwrote: Dear Stevo,

Re: mahout colt collections

2013-05-19 Thread Ted Dunning
Sophie, Can you say a bit more about what you want to do? On Sun, May 19, 2013 at 2:22 PM, Sophie Sperner sophie.sper...@gmail.comwrote: Dear, I'm experiencing difficulties with hppchttp://labs.carrotsearch.com/hppc.htmllibrary that I'm using. My algorithms work perfectly fine for small

Re: Which database should I use with Mahout

2013-05-19 Thread Pat Ferrel
Using a Hadoop version of a Mahout recommender will create some number of recs for all users as its output. Sean is talking about Myrrix I think which uses factorization to get much smaller models and so can calculate the recs at runtime for fairly large user sets. However if you are using

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
(I had in mind non distributed parts of Mahout but the principle is similar, yes.) On May 19, 2013 6:27 PM, Pat Ferrel pat.fer...@gmail.com wrote: Using a Hadoop version of a Mahout recommender will create some number of recs for all users as its output. Sean is talking about Myrrix I think

Re: Which database should I use with Mahout

2013-05-19 Thread Ted Dunning
On Sun, May 19, 2013 at 6:26 PM, Pat Ferrel pat.fer...@gmail.com wrote: Using a Hadoop version of a Mahout recommender will create some number of recs for all users as its output. Sean is talking about Myrrix I think which uses factorization to get much smaller models and so can calculate the

Re: Which database should I use with Mahout

2013-05-19 Thread Pat Ferrel
Ah, which for completeness, brings up another scaling issue with Mahout. The in-memory mahout recommenders do not pre-calculate all users recs. They keep the preference matrix in-memory and calculate the recommendations at runtime. At some point the size of your data will max a single machine.

Re: Which database should I use with Mahout

2013-05-19 Thread Ted Dunning
On Sun, May 19, 2013 at 8:04 PM, Pat Ferrel p...@occamsmachete.com wrote: Two basic solutions to this are: factorize (reduces 100s of thousands of items to hundreds of 'features') and continue to calculate recs at runtime, which you have to do with Myrrix since mahout does not have an

Re: Which database should I use with Mahout

2013-05-19 Thread Pat Ferrel
Won't argue with how fast Solr is, It's another fast and scalable lookup engine and another option. Especially if you don't need to lookup anything else by user, in which case you are back to a db... Using a cooccurrence matrix means you are doing item similairty since there is no user data in

Re: Which database should I use with Mahout

2013-05-19 Thread Ted Dunning
On Sun, May 19, 2013 at 8:34 PM, Pat Ferrel p...@occamsmachete.com wrote: Won't argue with how fast Solr is, It's another fast and scalable lookup engine and another option. Especially if you don't need to lookup anything else by user, in which case you are back to a db... But remember, it