Re: Which database should I use with Mahout

Sean Owen Sun, 19 May 2013 18:34:49 -0700

(I had in mind non distributed parts of Mahout but the principle is
similar, yes.)
On May 19, 2013 6:27 PM, "Pat Ferrel" <pat.fer...@gmail.com> wrote:


> Using a Hadoop version of a Mahout recommender will create some number of
> recs for all users as its output. Sean is talking about Myrrix I think
> which uses factorization to get much smaller models and so can calculate
> the recs at runtime for fairly large user sets.
>
> However if you are using Mahout and Hadoop the question is how to store
> and lookup recommendations in the quickest scalable way. You will have a
> user ID and perhaps an item ID as a key to the list of recommendations. The
> fastest thing to do is have a hashmap in memory, perhaps read in from HDFS.
> Remember that Mahout will output the recommendations with internal Mahout
> IDs so you will have to replace these in the data with your actual user and
> item ids.
>
> I use a NoSQL DB, either MongoDB or Cassandra but others are fine too,
> even MySQL if you can scale it to meet your needs. I end up with two
> tables, one has my user ID as a key and recommendations with my item IDs
> either ordered or with strengths. The second table has my item ID as the
> key with a list of similar items (again sorted or with strengths). At
> runtime I may have both a user ID and an item ID context so I get a list
> from both tables and combine them at runtime.
>
> I use a DB for many reasons and let it handle the caching. I never need to
> worry about memory management. If you have scaled your DB properly the
> lookups will actually be executed like an in-memory hashmap with indexed
> keys for ids. Scaling the DB can be done as your user base grows when
> needed without affecting the rest of the calculation pipeline. Yes there
> will be overhead due to network traffic in a cluster but the flexibility is
> worth it for me. If high availability is important you can spread out your
> db cluster over multiple data centers without affecting the API for serving
> recommendations. I set up the recommendation calculation to run
> continuously in the background, replacing values in the two tables as fast
> as I can. This allows you to scale update speed (how many machines in the
> mahout/hadoop cluster) independently from lookup performance scaling (how
> many machines in your db cluster, how much memory do the db machine have).
>
> On May 19, 2013, at 11:45 AM, Manuel Blechschmidt <
> manuel.blechschm...@gmx.de> wrote:
>
> Hi Tevfik,
> I am working with mysql but I would guess that HDFS like Sean suggested
> would be a good idea as well.
>
> There is also a project called sqoop which can be used to transfer data
> from relation databases to Hadoop.
>
> http://sqoop.apache.org/
>
> Scribe might be also an option for transferring a lot of data:
> https://github.com/facebook/scribe#readme
>
> I would suggest that you just start with the technology that you know best
> and then if you solve the problem as soon as you get them.
>
> /Manuel
>
> Am 19.05.2013 um 20:26 schrieb Sean Owen:
>
> > I think everyone is agreeing that it is essential to only access
> > information in memory at run-time, yes, whatever that info may be.
> > I don't think the original question was about Hadoop, but, the answer
> > is the same: Hadoop mappers are just reading the input serially. There
> > is no advantage to a relational database or NoSQL database; they're
> > just overkill. HDFS is sufficient, and probably even best of these at
> > allowing fast serial access to the data.
> >
> > On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
> > <tevfik.ayte...@gmail.com> wrote:
> >> Hi Manuel,
> >> But if one uses matrix factorization and stores the user and item
> >> factors in memory then there will be no database access during
> >> recommendation.
> >> I thought that the original question was where to store the data and
> >> how to give it to hadoop.
> >>
> >> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
> >> <manuel.blechschm...@gmx.de> wrote:
> >>> Hi Tevfik,
> >>> one request to the recommender could become more then 1000 queries to
> the database depending on which recommender you use and the amount of
> preferences for the given user.
> >>>
> >>> The problem is not if you are using SQL, NoSQL, or any other query
> language. The problem is the latency of the answers.
> >>>
> >>> An average tcp package in the same data center takes 500 µs. A main
> memory reference 0,1 µs. This means that your main memory of your java
> process can be accessed 5000 times faster then any other process like a
> database connected via TCP/IP.
> >>>
> >>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
> >>>
> >>> Here you can see a screenshot that shows that database communication
> is by far (99%) the slowest component of a recommender request:
> >>>
> >>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png
> >>>
> >>> If you do not want to cache your data in your Java process you can use
> a complete in memory database technology like SAP HANA
> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/
> >>>
> >>> Nevertheless if you are using these you do not need Mahout anymore.
> >>>
> >>> An architecture of a Mahout system can be seen here:
> >>>
> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png
> >>>
> >>> Hope that helps
> >>>   Manuel
> >>>
> >>> Am 19.05.2013 um 19:20 schrieb Sean Owen:
> >>>
> >>>> I'm first saying that you really don't want to use the database as a
> >>>> data model directly. It is far too slow.
> >>>> Instead you want to use a data model implementation that reads all of
> >>>> the data, once, serially, into memory. And in that case, it makes no
> >>>> difference where the data is being read from, because it is read just
> >>>> once, serially. A file is just as fine as a fancy database. In fact
> >>>> it's probably easier and faster.
> >>>>
> >>>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
> >>>> <tevfik.ayte...@gmail.com> wrote:
> >>>>> Thanks Sean, but I could not get your answer. Can you please explain
> it again?
> >>>>>
> >>>>>
> >>>>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen <sro...@gmail.com> wrote:
> >>>>>> It doesn't matter, in the sense that it is never going to be fast
> >>>>>> enough for real-time at any reasonable scale if actually run off a
> >>>>>> database directly. One operation results in thousands of queries.
> It's
> >>>>>> going to read data into memory anyway and cache it there. So,
> whatever
> >>>>>> is easiest for you. The simplest solution is a file.
> >>>>>>
> >>>>>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
> >>>>>> <ahmetyilmazefe...@yahoo.com> wrote:
> >>>>>>> Hi,
> >>>>>>> I would like to use Mahout to make recommendations on my web site.
> Since the data is going to be big, hopefully, I plan to use hadoop
> implementations of the recommender algorithms.
> >>>>>>>
> >>>>>>> I'm currently storing the data in mysql. Should I continue with it
> or should I switch to a nosql database such as mongodb or something else?
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> Ahmet
> >>>
> >>> --
> >>> Manuel Blechschmidt
> >>> M.Sc. IT Systems Engineering
> >>> Dortustr. 57
> >>> 14467 Potsdam
> >>> Mobil: 0173/6322621
> >>> Twitter: http://twitter.com/Manuel_B
> >>>
>
> --
> Manuel Blechschmidt
> M.Sc. IT Systems Engineering
> Dortustr. 57
> 14467 Potsdam
> Mobil: 0173/6322621
> Twitter: http://twitter.com/Manuel_B
>
>
>

Re: Which database should I use with Mahout

Reply via email to