Re: Bulk loading disadvantages

Alex Baranau Fri, 27 Jul 2012 07:02:18 -0700

> Another problem is with data locality immediately after bulk loading
> through MR.


You might find this recent discussion about that useful: [1]

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

[1] The start is here:
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201207.mbox/%3CCAA7+SiBcu_yB45=wearkcpdw1hgnksuv4cevxhjf8k5yrwv...@mail.gmail.com%3Ebut
then the thread gets broken due to "FWD"/"RES" adding into subj. Also
you can find it here:
http://search-hadoop.com/?q=bulk+import+and+data+locality

On Fri, Jul 27, 2012 at 9:46 AM, Sever Fundatureanu <
[email protected]> wrote:

> After digging a bit I've found my problem comes from the following
> lines in the Store class:
>
> void bulkLoadHFile(String srcPathStr) throws IOException {
>     Path srcPath = new Path(srcPathStr);
>
>     // Move the file if it's on another filesystem
>     FileSystem srcFs = srcPath.getFileSystem(conf);
>     if (!srcFs.equals(fs)) {
>       LOG.info("File " + srcPath + " on different filesystem than " +
>           "destination store - moving to this filesystem.");
>       Path tmpPath = getTmpPath();
>       FileUtil.copy(srcFs, srcPath, fs, tmpPath, false, conf);
>       LOG.info("Copied to temporary path on dst filesystem: " + tmpPath);
>       srcPath = tmpPath;
>     }
>
> The equality for the 2 filesystems fails in my case and I get the
> following log:
>
> 2012-07-27 14:47:25,321 INFO
> org.apache.hadoop.hbase.regionserver.Store: File
>
> hdfs://fs0.cm.cluster:8020/user/sfu200/outputBsbm/string2Id/F/e6cf2d1b69354e268b79597bf3855357
> on different filesystem than destination store - moving to this
> filesystem.
> 2012-07-27 14:47:27,286 INFO
> org.apache.hadoop.hbase.regionserver.Store: Copied to temporary path
> on dst filesystem:
>
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
> 2012-07-27 14:47:27,286 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: Renaming bulk load file
>
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
> to
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712
> 2012-07-27 14:47:27,297 INFO
> org.apache.hadoop.hbase.regionserver.StoreFile: HFile Bloom filter
> type for c4bbf70a6654422db81884f15f34c712: NONE, but ROW specified in
> column family configuration
> 2012-07-27 14:47:27,297 INFO
> org.apache.hadoop.hbase.regionserver.Store: Moved hfile
>
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
> into store directory
>
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F
> - updating store file list.
> 2012-07-27 14:47:27,297 INFO
> org.apache.hadoop.hbase.regionserver.Store: Successfully loaded store
> file
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
> into store F (new location:
>
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712)
>
> In my hbase-site.xml I have:
> <property>
>       <name>hbase.rootdir</name>
>       <value>hdfs://fs0.cm.cluster:8020/hbase</value>
>       <description>The directory shared by RegionServers.
>       </description>
>     </property>
>
> and in my hdfs-site.xml I have:
> <property>
>   <name>fs.default.name</name>
>   <value>hdfs://fs0.cm.cluster:8020</value>
> </property>
>
> As you can see they point to the same namenode. So I really don't
> understand why the above check fails..
>
> Regards,
> Sever
>
> On Fri, Jul 27, 2012 at 1:17 PM, Sever Fundatureanu
> <[email protected]> wrote:
> > Hi Anil,
> >
> > I am using HBase 0.94.0 with Hadoop 1.0.0. The directories are indeed
> > the ones mentioned my Bijeet. I can also add that I am doing the 2nd
> > stage programatically by calling doBulkLoad(org.apache.hadoop.fs.Path
> > sourceDir, HTable table) on a LoadIncrementalHFiles object.
> >
> > Best,
> > Sever
> >
> >
> > On Fri, Jul 27, 2012 at 5:40 AM, Anil Gupta <[email protected]>
> wrote:
> >> Hi Sever,
> >>
> >> That's a very interesting thing. Which Hadoop and hbase version you are
> using? I am going to run bulk loads tomorrow. If you can tell me which
> directories in hdfs you compared with /hbase/$table then I will try to
> check the same.
> >>
> >> Best Regards,
> >> Anil
> >>
> >> On Jul 26, 2012, at 3:46 PM, Sever Fundatureanu <
> [email protected]> wrote:
> >>
> >>> On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <[email protected]>
> wrote:
> >>>>>
> >>>>>
> >>>>> For the bulkloading process, the HBase documentation mentions that in
> >>>>> a 2nd stage "the appropriate Region Server adopts the HFile, moving
> it
> >>>>> into its storage directory and making the data available to clients."
> >>>>> But from my experience the files also remain in the original location
> >>>>> from where they are "adopted". So I guess the data is actually copied
> >>>>> into the HBase directory right? This means that, compared to the
> >>>>> online importing, when bulk loading you essentially need twice the
> >>>>> disk space on HDFS, right?
> >>>>>
> >>>>
> >>>> Yes, if you are generating HFiles on one cluster and loading into a
> >>>> separate hbase cluster. If they are co-located, its just a hdfs mv.
> >>>
> >>> Hmm, both the HFile generation and the HBase cluster runs on top of
> >>> the same HDFS cluster. I did a "du" on both the source HDFS directory
> >>> and the destination "/hbase" directory and I got the same sizes (+-
> >>> few bytes). I deleted the source directory from HDFS and then scanned
> >>> the table without any problems. Maybe there is a config parameter I'm
> >>> missing?
> >>>
> >>> Sever
> >
> >
> >
> > --
> > Sever Fundatureanu
> >
> > Vrije Universiteit Amsterdam
> > E-mail: [email protected]
>
>
>
> --
> Sever Fundatureanu
>
> Vrije Universiteit Amsterdam
> E-mail: [email protected]
>



-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

Re: Bulk loading disadvantages

Reply via email to