> Another problem is with data locality immediately after bulk loading > through MR.
You might find this recent discussion about that useful: [1] Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr [1] The start is here: http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201207.mbox/%3CCAA7+SiBcu_yB45=wearkcpdw1hgnksuv4cevxhjf8k5yrwv...@mail.gmail.com%3Ebut then the thread gets broken due to "FWD"/"RES" adding into subj. Also you can find it here: http://search-hadoop.com/?q=bulk+import+and+data+locality On Fri, Jul 27, 2012 at 9:46 AM, Sever Fundatureanu < [email protected]> wrote: > After digging a bit I've found my problem comes from the following > lines in the Store class: > > void bulkLoadHFile(String srcPathStr) throws IOException { > Path srcPath = new Path(srcPathStr); > > // Move the file if it's on another filesystem > FileSystem srcFs = srcPath.getFileSystem(conf); > if (!srcFs.equals(fs)) { > LOG.info("File " + srcPath + " on different filesystem than " + > "destination store - moving to this filesystem."); > Path tmpPath = getTmpPath(); > FileUtil.copy(srcFs, srcPath, fs, tmpPath, false, conf); > LOG.info("Copied to temporary path on dst filesystem: " + tmpPath); > srcPath = tmpPath; > } > > The equality for the 2 filesystems fails in my case and I get the > following log: > > 2012-07-27 14:47:25,321 INFO > org.apache.hadoop.hbase.regionserver.Store: File > > hdfs://fs0.cm.cluster:8020/user/sfu200/outputBsbm/string2Id/F/e6cf2d1b69354e268b79597bf3855357 > on different filesystem than destination store - moving to this > filesystem. > 2012-07-27 14:47:27,286 INFO > org.apache.hadoop.hbase.regionserver.Store: Copied to temporary path > on dst filesystem: > > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 > 2012-07-27 14:47:27,286 DEBUG > org.apache.hadoop.hbase.regionserver.Store: Renaming bulk load file > > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 > to > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712 > 2012-07-27 14:47:27,297 INFO > org.apache.hadoop.hbase.regionserver.StoreFile: HFile Bloom filter > type for c4bbf70a6654422db81884f15f34c712: NONE, but ROW specified in > column family configuration > 2012-07-27 14:47:27,297 INFO > org.apache.hadoop.hbase.regionserver.Store: Moved hfile > > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 > into store directory > > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F > - updating store file list. > 2012-07-27 14:47:27,297 INFO > org.apache.hadoop.hbase.regionserver.Store: Successfully loaded store > file > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 > into store F (new location: > > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712) > > In my hbase-site.xml I have: > <property> > <name>hbase.rootdir</name> > <value>hdfs://fs0.cm.cluster:8020/hbase</value> > <description>The directory shared by RegionServers. > </description> > </property> > > and in my hdfs-site.xml I have: > <property> > <name>fs.default.name</name> > <value>hdfs://fs0.cm.cluster:8020</value> > </property> > > As you can see they point to the same namenode. So I really don't > understand why the above check fails.. > > Regards, > Sever > > On Fri, Jul 27, 2012 at 1:17 PM, Sever Fundatureanu > <[email protected]> wrote: > > Hi Anil, > > > > I am using HBase 0.94.0 with Hadoop 1.0.0. The directories are indeed > > the ones mentioned my Bijeet. I can also add that I am doing the 2nd > > stage programatically by calling doBulkLoad(org.apache.hadoop.fs.Path > > sourceDir, HTable table) on a LoadIncrementalHFiles object. > > > > Best, > > Sever > > > > > > On Fri, Jul 27, 2012 at 5:40 AM, Anil Gupta <[email protected]> > wrote: > >> Hi Sever, > >> > >> That's a very interesting thing. Which Hadoop and hbase version you are > using? I am going to run bulk loads tomorrow. If you can tell me which > directories in hdfs you compared with /hbase/$table then I will try to > check the same. > >> > >> Best Regards, > >> Anil > >> > >> On Jul 26, 2012, at 3:46 PM, Sever Fundatureanu < > [email protected]> wrote: > >> > >>> On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <[email protected]> > wrote: > >>>>> > >>>>> > >>>>> For the bulkloading process, the HBase documentation mentions that in > >>>>> a 2nd stage "the appropriate Region Server adopts the HFile, moving > it > >>>>> into its storage directory and making the data available to clients." > >>>>> But from my experience the files also remain in the original location > >>>>> from where they are "adopted". So I guess the data is actually copied > >>>>> into the HBase directory right? This means that, compared to the > >>>>> online importing, when bulk loading you essentially need twice the > >>>>> disk space on HDFS, right? > >>>>> > >>>> > >>>> Yes, if you are generating HFiles on one cluster and loading into a > >>>> separate hbase cluster. If they are co-located, its just a hdfs mv. > >>> > >>> Hmm, both the HFile generation and the HBase cluster runs on top of > >>> the same HDFS cluster. I did a "du" on both the source HDFS directory > >>> and the destination "/hbase" directory and I got the same sizes (+- > >>> few bytes). I deleted the source directory from HDFS and then scanned > >>> the table without any problems. Maybe there is a config parameter I'm > >>> missing? > >>> > >>> Sever > > > > > > > > -- > > Sever Fundatureanu > > > > Vrije Universiteit Amsterdam > > E-mail: [email protected] > > > > -- > Sever Fundatureanu > > Vrije Universiteit Amsterdam > E-mail: [email protected] > -- Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
