Ah, which for completeness, brings up another scaling issue with Mahout. The in-memory mahout recommenders do not pre-calculate all users recs. They keep the preference matrix in-memory and calculate the recommendations at runtime. At some point the size of your data will max a single machine. In my experience this happens by maxing CPU usage before the memory is maxed. I began to hit performance limits with 200,000 items and around 1M users.
Two basic solutions to this are: factorize (reduces 100s of thousands of items to hundreds of 'features') and continue to calculate recs at runtime, which you have to do with Myrrix since mahout does not have an in-memory ALS impl, or move to the mahout hadoop recommenders and pre-calculate recs. On May 19, 2013, at 6:34 PM, Sean Owen <[email protected]> wrote: (I had in mind non distributed parts of Mahout but the principle is similar, yes.) On May 19, 2013 6:27 PM, "Pat Ferrel" <[email protected]> wrote: > Using a Hadoop version of a Mahout recommender will create some number of > recs for all users as its output. Sean is talking about Myrrix I think > which uses factorization to get much smaller models and so can calculate > the recs at runtime for fairly large user sets. > > However if you are using Mahout and Hadoop the question is how to store > and lookup recommendations in the quickest scalable way. You will have a > user ID and perhaps an item ID as a key to the list of recommendations. The > fastest thing to do is have a hashmap in memory, perhaps read in from HDFS. > Remember that Mahout will output the recommendations with internal Mahout > IDs so you will have to replace these in the data with your actual user and > item ids. > > I use a NoSQL DB, either MongoDB or Cassandra but others are fine too, > even MySQL if you can scale it to meet your needs. I end up with two > tables, one has my user ID as a key and recommendations with my item IDs > either ordered or with strengths. The second table has my item ID as the > key with a list of similar items (again sorted or with strengths). At > runtime I may have both a user ID and an item ID context so I get a list > from both tables and combine them at runtime. > > I use a DB for many reasons and let it handle the caching. I never need to > worry about memory management. If you have scaled your DB properly the > lookups will actually be executed like an in-memory hashmap with indexed > keys for ids. Scaling the DB can be done as your user base grows when > needed without affecting the rest of the calculation pipeline. Yes there > will be overhead due to network traffic in a cluster but the flexibility is > worth it for me. If high availability is important you can spread out your > db cluster over multiple data centers without affecting the API for serving > recommendations. I set up the recommendation calculation to run > continuously in the background, replacing values in the two tables as fast > as I can. This allows you to scale update speed (how many machines in the > mahout/hadoop cluster) independently from lookup performance scaling (how > many machines in your db cluster, how much memory do the db machine have). > > On May 19, 2013, at 11:45 AM, Manuel Blechschmidt < > [email protected]> wrote: > > Hi Tevfik, > I am working with mysql but I would guess that HDFS like Sean suggested > would be a good idea as well. > > There is also a project called sqoop which can be used to transfer data > from relation databases to Hadoop. > > http://sqoop.apache.org/ > > Scribe might be also an option for transferring a lot of data: > https://github.com/facebook/scribe#readme > > I would suggest that you just start with the technology that you know best > and then if you solve the problem as soon as you get them. > > /Manuel > > Am 19.05.2013 um 20:26 schrieb Sean Owen: > >> I think everyone is agreeing that it is essential to only access >> information in memory at run-time, yes, whatever that info may be. >> I don't think the original question was about Hadoop, but, the answer >> is the same: Hadoop mappers are just reading the input serially. There >> is no advantage to a relational database or NoSQL database; they're >> just overkill. HDFS is sufficient, and probably even best of these at >> allowing fast serial access to the data. >> >> On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin >> <[email protected]> wrote: >>> Hi Manuel, >>> But if one uses matrix factorization and stores the user and item >>> factors in memory then there will be no database access during >>> recommendation. >>> I thought that the original question was where to store the data and >>> how to give it to hadoop. >>> >>> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt >>> <[email protected]> wrote: >>>> Hi Tevfik, >>>> one request to the recommender could become more then 1000 queries to > the database depending on which recommender you use and the amount of > preferences for the given user. >>>> >>>> The problem is not if you are using SQL, NoSQL, or any other query > language. The problem is the latency of the answers. >>>> >>>> An average tcp package in the same data center takes 500 µs. A main > memory reference 0,1 µs. This means that your main memory of your java > process can be accessed 5000 times faster then any other process like a > database connected via TCP/IP. >>>> >>>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html >>>> >>>> Here you can see a screenshot that shows that database communication > is by far (99%) the slowest component of a recommender request: >>>> >>>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png >>>> >>>> If you do not want to cache your data in your Java process you can use > a complete in memory database technology like SAP HANA > http://www.saphana.com/welcome or EXASOL http://www.exasol.com/ >>>> >>>> Nevertheless if you are using these you do not need Mahout anymore. >>>> >>>> An architecture of a Mahout system can be seen here: >>>> > https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png >>>> >>>> Hope that helps >>>> Manuel >>>> >>>> Am 19.05.2013 um 19:20 schrieb Sean Owen: >>>> >>>>> I'm first saying that you really don't want to use the database as a >>>>> data model directly. It is far too slow. >>>>> Instead you want to use a data model implementation that reads all of >>>>> the data, once, serially, into memory. And in that case, it makes no >>>>> difference where the data is being read from, because it is read just >>>>> once, serially. A file is just as fine as a fancy database. In fact >>>>> it's probably easier and faster. >>>>> >>>>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin >>>>> <[email protected]> wrote: >>>>>> Thanks Sean, but I could not get your answer. Can you please explain > it again? >>>>>> >>>>>> >>>>>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen <[email protected]> wrote: >>>>>>> It doesn't matter, in the sense that it is never going to be fast >>>>>>> enough for real-time at any reasonable scale if actually run off a >>>>>>> database directly. One operation results in thousands of queries. > It's >>>>>>> going to read data into memory anyway and cache it there. So, > whatever >>>>>>> is easiest for you. The simplest solution is a file. >>>>>>> >>>>>>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz >>>>>>> <[email protected]> wrote: >>>>>>>> Hi, >>>>>>>> I would like to use Mahout to make recommendations on my web site. > Since the data is going to be big, hopefully, I plan to use hadoop > implementations of the recommender algorithms. >>>>>>>> >>>>>>>> I'm currently storing the data in mysql. Should I continue with it > or should I switch to a nosql database such as mongodb or something else? >>>>>>>> >>>>>>>> Thanks >>>>>>>> Ahmet >>>> >>>> -- >>>> Manuel Blechschmidt >>>> M.Sc. IT Systems Engineering >>>> Dortustr. 57 >>>> 14467 Potsdam >>>> Mobil: 0173/6322621 >>>> Twitter: http://twitter.com/Manuel_B >>>> > > -- > Manuel Blechschmidt > M.Sc. IT Systems Engineering > Dortustr. 57 > 14467 Potsdam > Mobil: 0173/6322621 > Twitter: http://twitter.com/Manuel_B > > >
