(I had in mind non distributed parts of Mahout but the principle is similar, yes.) On May 19, 2013 6:27 PM, "Pat Ferrel" <pat.fer...@gmail.com> wrote:
> Using a Hadoop version of a Mahout recommender will create some number of > recs for all users as its output. Sean is talking about Myrrix I think > which uses factorization to get much smaller models and so can calculate > the recs at runtime for fairly large user sets. > > However if you are using Mahout and Hadoop the question is how to store > and lookup recommendations in the quickest scalable way. You will have a > user ID and perhaps an item ID as a key to the list of recommendations. The > fastest thing to do is have a hashmap in memory, perhaps read in from HDFS. > Remember that Mahout will output the recommendations with internal Mahout > IDs so you will have to replace these in the data with your actual user and > item ids. > > I use a NoSQL DB, either MongoDB or Cassandra but others are fine too, > even MySQL if you can scale it to meet your needs. I end up with two > tables, one has my user ID as a key and recommendations with my item IDs > either ordered or with strengths. The second table has my item ID as the > key with a list of similar items (again sorted or with strengths). At > runtime I may have both a user ID and an item ID context so I get a list > from both tables and combine them at runtime. > > I use a DB for many reasons and let it handle the caching. I never need to > worry about memory management. If you have scaled your DB properly the > lookups will actually be executed like an in-memory hashmap with indexed > keys for ids. Scaling the DB can be done as your user base grows when > needed without affecting the rest of the calculation pipeline. Yes there > will be overhead due to network traffic in a cluster but the flexibility is > worth it for me. If high availability is important you can spread out your > db cluster over multiple data centers without affecting the API for serving > recommendations. I set up the recommendation calculation to run > continuously in the background, replacing values in the two tables as fast > as I can. This allows you to scale update speed (how many machines in the > mahout/hadoop cluster) independently from lookup performance scaling (how > many machines in your db cluster, how much memory do the db machine have). > > On May 19, 2013, at 11:45 AM, Manuel Blechschmidt < > manuel.blechschm...@gmx.de> wrote: > > Hi Tevfik, > I am working with mysql but I would guess that HDFS like Sean suggested > would be a good idea as well. > > There is also a project called sqoop which can be used to transfer data > from relation databases to Hadoop. > > http://sqoop.apache.org/ > > Scribe might be also an option for transferring a lot of data: > https://github.com/facebook/scribe#readme > > I would suggest that you just start with the technology that you know best > and then if you solve the problem as soon as you get them. > > /Manuel > > Am 19.05.2013 um 20:26 schrieb Sean Owen: > > > I think everyone is agreeing that it is essential to only access > > information in memory at run-time, yes, whatever that info may be. > > I don't think the original question was about Hadoop, but, the answer > > is the same: Hadoop mappers are just reading the input serially. There > > is no advantage to a relational database or NoSQL database; they're > > just overkill. HDFS is sufficient, and probably even best of these at > > allowing fast serial access to the data. > > > > On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin > > <tevfik.ayte...@gmail.com> wrote: > >> Hi Manuel, > >> But if one uses matrix factorization and stores the user and item > >> factors in memory then there will be no database access during > >> recommendation. > >> I thought that the original question was where to store the data and > >> how to give it to hadoop. > >> > >> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt > >> <manuel.blechschm...@gmx.de> wrote: > >>> Hi Tevfik, > >>> one request to the recommender could become more then 1000 queries to > the database depending on which recommender you use and the amount of > preferences for the given user. > >>> > >>> The problem is not if you are using SQL, NoSQL, or any other query > language. The problem is the latency of the answers. > >>> > >>> An average tcp package in the same data center takes 500 µs. A main > memory reference 0,1 µs. This means that your main memory of your java > process can be accessed 5000 times faster then any other process like a > database connected via TCP/IP. > >>> > >>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html > >>> > >>> Here you can see a screenshot that shows that database communication > is by far (99%) the slowest component of a recommender request: > >>> > >>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png > >>> > >>> If you do not want to cache your data in your Java process you can use > a complete in memory database technology like SAP HANA > http://www.saphana.com/welcome or EXASOL http://www.exasol.com/ > >>> > >>> Nevertheless if you are using these you do not need Mahout anymore. > >>> > >>> An architecture of a Mahout system can be seen here: > >>> > https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png > >>> > >>> Hope that helps > >>> Manuel > >>> > >>> Am 19.05.2013 um 19:20 schrieb Sean Owen: > >>> > >>>> I'm first saying that you really don't want to use the database as a > >>>> data model directly. It is far too slow. > >>>> Instead you want to use a data model implementation that reads all of > >>>> the data, once, serially, into memory. And in that case, it makes no > >>>> difference where the data is being read from, because it is read just > >>>> once, serially. A file is just as fine as a fancy database. In fact > >>>> it's probably easier and faster. > >>>> > >>>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin > >>>> <tevfik.ayte...@gmail.com> wrote: > >>>>> Thanks Sean, but I could not get your answer. Can you please explain > it again? > >>>>> > >>>>> > >>>>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen <sro...@gmail.com> wrote: > >>>>>> It doesn't matter, in the sense that it is never going to be fast > >>>>>> enough for real-time at any reasonable scale if actually run off a > >>>>>> database directly. One operation results in thousands of queries. > It's > >>>>>> going to read data into memory anyway and cache it there. So, > whatever > >>>>>> is easiest for you. The simplest solution is a file. > >>>>>> > >>>>>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz > >>>>>> <ahmetyilmazefe...@yahoo.com> wrote: > >>>>>>> Hi, > >>>>>>> I would like to use Mahout to make recommendations on my web site. > Since the data is going to be big, hopefully, I plan to use hadoop > implementations of the recommender algorithms. > >>>>>>> > >>>>>>> I'm currently storing the data in mysql. Should I continue with it > or should I switch to a nosql database such as mongodb or something else? > >>>>>>> > >>>>>>> Thanks > >>>>>>> Ahmet > >>> > >>> -- > >>> Manuel Blechschmidt > >>> M.Sc. IT Systems Engineering > >>> Dortustr. 57 > >>> 14467 Potsdam > >>> Mobil: 0173/6322621 > >>> Twitter: http://twitter.com/Manuel_B > >>> > > -- > Manuel Blechschmidt > M.Sc. IT Systems Engineering > Dortustr. 57 > 14467 Potsdam > Mobil: 0173/6322621 > Twitter: http://twitter.com/Manuel_B > > >