Re: Which database should I use with Mahout

Manuel Blechschmidt Sun, 19 May 2013 11:46:06 -0700

Hi Tevfik,
I am working with mysql but I would guess that HDFS like Sean suggested would 
be a good idea as well.


There is also a project called sqoop which can be used to transfer data from 
relation databases to Hadoop.

http://sqoop.apache.org/

Scribe might be also an option for transferring a lot of data:
https://github.com/facebook/scribe#readme

I would suggest that you just start with the technology that you know best and 
then if you solve the problem as soon as you get them.

/Manuel
 
Am 19.05.2013 um 20:26 schrieb Sean Owen:

> I think everyone is agreeing that it is essential to only access
> information in memory at run-time, yes, whatever that info may be.
> I don't think the original question was about Hadoop, but, the answer
> is the same: Hadoop mappers are just reading the input serially. There
> is no advantage to a relational database or NoSQL database; they're
> just overkill. HDFS is sufficient, and probably even best of these at
> allowing fast serial access to the data.
> 
> On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
> <[email protected]> wrote:
>> Hi Manuel,
>> But if one uses matrix factorization and stores the user and item
>> factors in memory then there will be no database access during
>> recommendation.
>> I thought that the original question was where to store the data and
>> how to give it to hadoop.
>> 
>> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
>> <[email protected]> wrote:
>>> Hi Tevfik,
>>> one request to the recommender could become more then 1000 queries to the 
>>> database depending on which recommender you use and the amount of 
>>> preferences for the given user.
>>> 
>>> The problem is not if you are using SQL, NoSQL, or any other query 
>>> language. The problem is the latency of the answers.
>>> 
>>> An average tcp package in the same data center takes 500 µs. A main memory 
>>> reference 0,1 µs. This means that your main memory of your java process can 
>>> be accessed 5000 times faster then any other process like a database 
>>> connected via TCP/IP.
>>> 
>>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
>>> 
>>> Here you can see a screenshot that shows that database communication is by 
>>> far (99%) the slowest component of a recommender request:
>>> 
>>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png
>>> 
>>> If you do not want to cache your data in your Java process you can use a 
>>> complete in memory database technology like SAP HANA 
>>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/
>>> 
>>> Nevertheless if you are using these you do not need Mahout anymore.
>>> 
>>> An architecture of a Mahout system can be seen here:
>>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png
>>> 
>>> Hope that helps
>>>    Manuel
>>> 
>>> Am 19.05.2013 um 19:20 schrieb Sean Owen:
>>> 
>>>> I'm first saying that you really don't want to use the database as a
>>>> data model directly. It is far too slow.
>>>> Instead you want to use a data model implementation that reads all of
>>>> the data, once, serially, into memory. And in that case, it makes no
>>>> difference where the data is being read from, because it is read just
>>>> once, serially. A file is just as fine as a fancy database. In fact
>>>> it's probably easier and faster.
>>>> 
>>>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
>>>> <[email protected]> wrote:
>>>>> Thanks Sean, but I could not get your answer. Can you please explain it 
>>>>> again?
>>>>> 
>>>>> 
>>>>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen <[email protected]> wrote:
>>>>>> It doesn't matter, in the sense that it is never going to be fast
>>>>>> enough for real-time at any reasonable scale if actually run off a
>>>>>> database directly. One operation results in thousands of queries. It's
>>>>>> going to read data into memory anyway and cache it there. So, whatever
>>>>>> is easiest for you. The simplest solution is a file.
>>>>>> 
>>>>>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
>>>>>> <[email protected]> wrote:
>>>>>>> Hi,
>>>>>>> I would like to use Mahout to make recommendations on my web site. 
>>>>>>> Since the data is going to be big, hopefully, I plan to use hadoop 
>>>>>>> implementations of the recommender algorithms.
>>>>>>> 
>>>>>>> I'm currently storing the data in mysql. Should I continue with it or 
>>>>>>> should I switch to a nosql database such as mongodb or something else?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Ahmet
>>> 
>>> --
>>> Manuel Blechschmidt
>>> M.Sc. IT Systems Engineering
>>> Dortustr. 57
>>> 14467 Potsdam
>>> Mobil: 0173/6322621
>>> Twitter: http://twitter.com/Manuel_B
>>> 

-- 
Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B

Re: Which database should I use with Mahout

Reply via email to