> In this case, there is no data on the grid a priori; the data has to come > into the grid from a DB. So what would the C/M mappers run on? Is there a > way to run say 5 mappers without having 5 blocks of data on HDFS?
No, but the data doesn't have to come from DB. The first copy yes, but other 4 can be replicated within the cluster. Which is exactly what export + dump does - except that it is cumbersome to use, so it would be better if it could be done automatically (within Pig loader). That's how I see it at least... :) Anze On Thursday 04 November 2010, Jai Krishna wrote: > Ankur, > > In this case, there is no data on the grid a priori; the data has to come > into the grid from a DB. So what would the C/M mappers run on? Is there a > way to run say 5 mappers without having 5 blocks of data on HDFS? > > Just trying to wrap my head around this; pl. excuse me if Im missing > something obvious. > > Thanks > Jai > > On 11/3/10 7:48 PM, "Ankur C. Goel" <[email protected]> wrote: > > Hitting the database from multiple mappers is not such a great idea IF > there are hundreds/thousands of mappers involved processing hundreds of > GBs. of data. This could easily saturate the I/O bandwidth of the database > server creating a bottleneck in the processing. Export and dump to HDFS > is a better option > > -...@nkur > > On 11/3/10 5:02 PM, "Anze" <[email protected]> wrote: > > Sonal, > > Thanks for answering! > > Hiho sounds nice, but from what I gathered, it is more a low-level > interface for efficient loading from and storing to SQL DBs? > (in other words, there is no loader and storage for Pig yet) > > I wrote a batch to export DB to local files and then copy them to HDFS, so > there is no gain for me in using another type of export (unless it can be > used directly from Pig and/or keeps the schema intact), but it's nice to > know it exists. > > It just seems weird that there is no DB loader for Pig yet. I tried writing > it but it would take more time than I have at the moment... I have a > problem to solve ASAP. :) > > Thanks, > > Anze > > On Wednesday 03 November 2010, Sonal Goyal wrote: > > Anze, > > > > You can check hiho as well: > > > > http://code.google.com/p/hiho/wiki/DatabaseImportFAQ > > > > Let me know if you need any help. > > > > Thanks and Regards, > > Sonal > > > > Sonal Goyal | Founder and CEO | Nube Technologies LLP > > http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal > > > > > > > > > > > > 2010/11/3 Anze <[email protected]> > > > > > Alejandro, thanks for answering! > > > > > > I was hoping it could be done directly from Pig, but... :) > > > > > > I'll take a look at Sqoop then, and if that doesn't help, I'll just > > > write a simple batch to export data to TXT/CSV. Thanks for the > > > pointer! > > > > > > Anze > > > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > > > > Not a 100% Pig solution, but you could use Sqoop to get the data in > > > > as a pre-processing step. And if you want to handle all as single > > > > job, you > > > > > > could > > > > > > > use Oozie to create a workflow that does Sqoop and then your Pig > > > > processing. > > > > > > > > Alejandro > > > > > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze <[email protected]> wrote: > > > > > Hi! > > > > > > > > > > Part of data I have resides in MySQL. Is there a loader that would > > > > > > allow > > > > > > > > loading directly from it? > > > > > > > > > > I can't find anything on the net, but it seems to me this must be a > > > > > > quite > > > > > > > > common problem. > > > > > I checked piggybank but there is only DBStorage (and no DBLoader). > > > > > > > > > > Is some DBLoader out there too? > > > > > > > > > > Thanks, > > > > > > > > > > Anze
