Yeah, we never used HBase client api(puts) for loading a batch of millions of records. Can you tell me by default where the o/p HFile(s) from MR job are stored in HDFS?
On Tue, Oct 23, 2012 at 11:31 PM, Anoop John <[email protected]> wrote: > I think as per your explanation of need for unique id it is okey.. No need > to worry abt data loss.. As long as you can make sure you make a unique id > things are fine.. MR will make sure it run the job on whole data and the > o/p is persisted in file.. Yes this file is HFile(s) only.. Then finally > the HBase cluster is used for loading the HFiles to the Region stores.. > Bulk loading huge data using this way will be much much faster than normal > put()s > > -Anoop- > > On Wed, Oct 24, 2012 at 11:44 AM, anil gupta <[email protected]> > wrote: > > > Anoop: Only thing is that some > > mappers crashed.. So thin MR fw will run that mapper again on the same > data > > set.. Then the unique id will be different? > > > > Anil: Yes, for the same dataset also the UniqueId will be different. > > UniqueID does not depends on the data. > > > > Thanks, > > Anil Gupta > > > > On Tue, Oct 23, 2012 at 11:07 PM, Anoop John <[email protected]> > > wrote: > > > > > >. Is there a way that i can explicitly turn on WAL for bulk loading? > > > no.. > > > How you generate the unique id? Remember that initial steps wont need > > the > > > HBase cluster at all. MR generates the HFiles and the o/p will be in > file > > > only.. Mappers also will write o/p to file... Only thing is that some > > > mappers crashed.. So thin MR fw will run that mapper again on the same > > data > > > set.. Then the unique id will be different? I think you no need to > worry > > > about data loss from Hbase side.. So WAL is not required.. > > > > > > -Anoop- > > > > > > > > > > > > > > > On Wed, Oct 24, 2012 at 10:58 AM, anil gupta <[email protected]> > > > wrote: > > > > > > > That's a very interesting fact. You made it clear but my custom Bulk > > > Loader > > > > generates an unique ID for every row in map phase. So, all my data is > > not > > > > in csv or text. Is there a way that i can explicitly turn on WAL for > > bulk > > > > loading? > > > > > > > > On Tue, Oct 23, 2012 at 10:14 PM, Anoop John <[email protected]> > > > > wrote: > > > > > > > > > Hi Anil > > > > > In case of bulk loading it is not like data is put > > into > > > > > HBase one by one.. The MR job will create an o/p like HFile.. It > will > > > > > create the KVs and write to file in order as how HFile will look > > like.. > > > > The > > > > > the file is loaded into HBase finally.. Only for this final step > > HBase > > > RS > > > > > will be used.. So there is no point in WAL there... I am making it > > > clear > > > > > for you? The data is already present in form of raw data in some > > txt > > > or > > > > > csv file :) > > > > > > > > > > -Anoop- > > > > > > > > > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John < > [email protected]> > > > > > wrote: > > > > > > > > > > > Hi Anil > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta < > > [email protected] > > > > > >wrote: > > > > > > > > > > > >> Hi Anoop, > > > > > >> > > > > > >> As per your last email, did you mean that WAL is not used while > > > using > > > > > >> HBase > > > > > >> Bulk Loader? If yes, then how we ensure "no data loss" in case > of > > > > > >> RegionServer failure? > > > > > >> > > > > > >> Thanks, > > > > > >> Anil Gupta > > > > > >> > > > > > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < > > > > > >> [email protected]> wrote: > > > > > >> > > > > > >> > As Kevin suggested we can make use of bulk load that goes thro > > WAL > > > > and > > > > > >> > Memstore. Or the second option will be to use the o/p of > > mappers > > > to > > > > > >> create > > > > > >> > HFiles directly. > > > > > >> > > > > > > >> > Regards > > > > > >> > Ram > > > > > >> > > > > > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John < > > > [email protected]> > > > > > >> wrote: > > > > > >> > > > > > > >> > > Hi > > > > > >> > > Using ImportTSV tool you are trying to bulk load your > > data. > > > > Can > > > > > >> you > > > > > >> > see > > > > > >> > > and tell how many mappers and reducers were there. Out of > > total > > > > time > > > > > >> what > > > > > >> > > is the time taken by the mapper phase and by the reducer > > phase. > > > > > Seems > > > > > >> > like > > > > > >> > > MR related issue (may be some conf issue). In this bulk load > > > case > > > > > >> most of > > > > > >> > > the work is done by the MR job. It will read the raw data > and > > > > > convert > > > > > >> it > > > > > >> > > into Puts and write to HFiles. MR o/p is HFiles itself. The > > next > > > > > part > > > > > >> in > > > > > >> > > ImportTSV will just put the HFiles under the table region > > > store.. > > > > > >> There > > > > > >> > > wont be WAL usage in this bulk load. > > > > > >> > > > > > > > >> > > -Anoop- > > > > > >> > > > > > > > >> > > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard < > > > > > >> > > [email protected]> wrote: > > > > > >> > > > > > > > >> > > > Hi everyone > > > > > >> > > > > > > > > >> > > > I'm starting with hbase and testing for our needs. I have > > set > > > > up a > > > > > >> > hadoop > > > > > >> > > > cluster of Three machines and A Hbase cluster atop on the > > same > > > > > three > > > > > >> > > > machines, > > > > > >> > > > one master two slaves. > > > > > >> > > > > > > > > >> > > > I am testing the Import of a 5GB csv file with the > importTsv > > > > > tool. I > > > > > >> > > > import the > > > > > >> > > > file in the HDFS and use the importTsv tool to import in > > > Hbase. > > > > > >> > > > > > > > > >> > > > Right now it takes a little over an hour to complete. It > > > creates > > > > > >> > around 2 > > > > > >> > > > million entries in one table with a single family. > > > > > >> > > > If I use bulk uploading it goes down to 20 minutes. > > > > > >> > > > > > > > > >> > > > My hadoop has 21 map tasks but they all seem to be taking > a > > > very > > > > > >> long > > > > > >> > > time > > > > > >> > > > to > > > > > >> > > > finish many tasks end up in time out. > > > > > >> > > > > > > > > >> > > > I am wondering what I have missed in my configuration. I > > have > > > > > >> followed > > > > > >> > > the > > > > > >> > > > different prerequisites in the documentations but I am > > really > > > > > >> unsure as > > > > > >> > > to > > > > > >> > > > what > > > > > >> > > > is causing this slow down. If I were to apply the > wordcount > > > > > example > > > > > >> to > > > > > >> > > the > > > > > >> > > > same > > > > > >> > > > file it takes only minutes to complete so I am guessing > the > > > > issue > > > > > >> lies > > > > > >> > in > > > > > >> > > > my > > > > > >> > > > Hbase configuration. > > > > > >> > > > > > > > > >> > > > Any help or pointers would by appreciated > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> -- > > > > > >> Thanks & Regards, > > > > > >> Anil Gupta > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Thanks & Regards, > > > > Anil Gupta > > > > > > > > > > > > > > > -- > > Thanks & Regards, > > Anil Gupta > > > -- Thanks & Regards, Anil Gupta
