Hmmm, we didn't change anything with the config (at least that we know of) and we certainly didn't change any of the ordering of performing the load and the distcp off the cluster.

One interesting thing we were noticing after the upgrade is that distcp would copy HFiles to the backup cluster, but then the files would there would be deleted. That was actually how we first noticed the change as we were tracking the total number of HFiles there and the count would increase as normal after the distcp but then mysteriously decrease. I presume it was due to the HFile loader marking the files for deletion while distcp was running, and then the remote HDFS completing the deletion.

Between distcp and LoadIncrementalHFiles some bit of behavior definitely changed, I just don't know where it is. Regardless we now have a working solution/work-around. If this is the expected behavior rather than a bug then all is fine.

- Adam

On 4/30/11 10:50 PM, Todd Lipcon wrote:
Hi Adam,

It's always been this way.

The only time you'll see them copied is if you run the load from a
remote filesystem - ie if you specify a URL that doesn't match the URL
used in hbase.rootdir.

See th bulkLoadHFile() method in Store.java:
     // Move the file if it's on another filesystem
     FileSystem srcFs = srcPath.getFileSystem(conf);
     if (!srcFs.equals(fs)) {
       LOG.info("File " + srcPath + " on different filesystem than " +
"destination store - moving to this filesystem.");
       Path tmpPath = getTmpPath();
       FileUtil.copy(srcFs, srcPath, fs, tmpPath, false, conf);
       LOG.info("Copied to temporary path on dst filesystem: " + tmpPath);
       srcPath = tmpPath;
     }

Perhaps your config changed slightly during the upgrade?

-Todd

On Fri, Apr 29, 2011 at 1:11 PM, Adam Phelps <[email protected]
<mailto:[email protected]>> wrote:

    I could believe that, although I was under the impression that these
    files are actually incorporated into the existing region files.
      Still, its definitely a different behavior than what we were
    seeing before our recent upgrade.

    - Adam


    On 4/29/11 10:41 AM, Patrick Angeles wrote:

        Adam,

        They are probably not deleted, but moved to the appropriate region
        subdirectory under /hbase.

        On Fri, Apr 29, 2011 at 1:15 PM, Adam Phelps<[email protected]
        <mailto:[email protected]>>  wrote:

            I just verified this, and the hfiles seem to be deleted one
            at a time as
            the bulk load runs.

            - Adam


            On 4/28/11 4:28 PM, Stack wrote:

                I took a look through the code and don't see any
                explicit removes and
                looking through history of changes to the file, I don't
                see any change
                of substance.

                Can you figure what is doing the delete? At what stage?
                  Is it as
                completebulkload runs?

                St.Ack

                On Thu, Apr 28, 2011 at 10:59 AM, Adam
                Phelps<[email protected] <mailto:[email protected]>>   wrote:

                    We were using a backup scheme for our system where
                    we have map-reduce
                    jobs
                    generating HFiles, which we then loaded using
                    LoadIncrementalHFiles
                    before
                    making a remote copy of them using distcp.

                    However we just upgraded hbase (we're using
                    cloudera's package, so we
                    went
                    from CDH3B4 to CDH3U0, both of which are versions of
                    0.90.1), and
                    discovered
                    that the HFiles now get deleted by the load
                    operation.  Is this a recent
                    change?  Is there a configuration variable to revert
                    this behavior?

                    We can work around it by doing the copy before the
                    load, but that is less
                    than optimal in our scenario as we'd prefer to have
                    quicker access to the
                    data in HBase.

                    - Adam









--
Todd Lipcon
Software Engineer, Cloudera

Reply via email to