Hi Tevfik, I am working with mysql but I would guess that HDFS like Sean suggested would be a good idea as well.
There is also a project called sqoop which can be used to transfer data from relation databases to Hadoop. http://sqoop.apache.org/ Scribe might be also an option for transferring a lot of data: https://github.com/facebook/scribe#readme I would suggest that you just start with the technology that you know best and then if you solve the problem as soon as you get them. /Manuel Am 19.05.2013 um 20:26 schrieb Sean Owen: > I think everyone is agreeing that it is essential to only access > information in memory at run-time, yes, whatever that info may be. > I don't think the original question was about Hadoop, but, the answer > is the same: Hadoop mappers are just reading the input serially. There > is no advantage to a relational database or NoSQL database; they're > just overkill. HDFS is sufficient, and probably even best of these at > allowing fast serial access to the data. > > On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin > <[email protected]> wrote: >> Hi Manuel, >> But if one uses matrix factorization and stores the user and item >> factors in memory then there will be no database access during >> recommendation. >> I thought that the original question was where to store the data and >> how to give it to hadoop. >> >> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt >> <[email protected]> wrote: >>> Hi Tevfik, >>> one request to the recommender could become more then 1000 queries to the >>> database depending on which recommender you use and the amount of >>> preferences for the given user. >>> >>> The problem is not if you are using SQL, NoSQL, or any other query >>> language. The problem is the latency of the answers. >>> >>> An average tcp package in the same data center takes 500 µs. A main memory >>> reference 0,1 µs. This means that your main memory of your java process can >>> be accessed 5000 times faster then any other process like a database >>> connected via TCP/IP. >>> >>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html >>> >>> Here you can see a screenshot that shows that database communication is by >>> far (99%) the slowest component of a recommender request: >>> >>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png >>> >>> If you do not want to cache your data in your Java process you can use a >>> complete in memory database technology like SAP HANA >>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/ >>> >>> Nevertheless if you are using these you do not need Mahout anymore. >>> >>> An architecture of a Mahout system can be seen here: >>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png >>> >>> Hope that helps >>> Manuel >>> >>> Am 19.05.2013 um 19:20 schrieb Sean Owen: >>> >>>> I'm first saying that you really don't want to use the database as a >>>> data model directly. It is far too slow. >>>> Instead you want to use a data model implementation that reads all of >>>> the data, once, serially, into memory. And in that case, it makes no >>>> difference where the data is being read from, because it is read just >>>> once, serially. A file is just as fine as a fancy database. In fact >>>> it's probably easier and faster. >>>> >>>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin >>>> <[email protected]> wrote: >>>>> Thanks Sean, but I could not get your answer. Can you please explain it >>>>> again? >>>>> >>>>> >>>>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen <[email protected]> wrote: >>>>>> It doesn't matter, in the sense that it is never going to be fast >>>>>> enough for real-time at any reasonable scale if actually run off a >>>>>> database directly. One operation results in thousands of queries. It's >>>>>> going to read data into memory anyway and cache it there. So, whatever >>>>>> is easiest for you. The simplest solution is a file. >>>>>> >>>>>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz >>>>>> <[email protected]> wrote: >>>>>>> Hi, >>>>>>> I would like to use Mahout to make recommendations on my web site. >>>>>>> Since the data is going to be big, hopefully, I plan to use hadoop >>>>>>> implementations of the recommender algorithms. >>>>>>> >>>>>>> I'm currently storing the data in mysql. Should I continue with it or >>>>>>> should I switch to a nosql database such as mongodb or something else? >>>>>>> >>>>>>> Thanks >>>>>>> Ahmet >>> >>> -- >>> Manuel Blechschmidt >>> M.Sc. IT Systems Engineering >>> Dortustr. 57 >>> 14467 Potsdam >>> Mobil: 0173/6322621 >>> Twitter: http://twitter.com/Manuel_B >>> -- Manuel Blechschmidt M.Sc. IT Systems Engineering Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
