Hi Marc,
Actually, HFileOutputFormat is what you need to target, the below is
for other file formats and their compression. HFOF has support for
compressing the data as it is written, so either add this to your
configuration
conf.set("hfile.compression", "lzo");
or add this to the job startup command
-Dhfile.compression=lzo
(or with another compression codec obviously).
Lars
On Tue, Dec 28, 2010 at 2:07 AM, Marc Limotte <[email protected]> wrote:
> Lars, Todd,
>
> Thanks for the info. If I understand correctly, the importtsv command line
> tool will not compress by default and there is no command line switch for
> it, but I can modify the source at
> hbase-0.89.20100924+28/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
> to call FileOutputFormat.setCompressOutput/setOutputCompressorClass() on the
> Job; in order to turn on compression.
>
> Does that sound right?
>
> Marc
>
>
> On Thu, Dec 23, 2010 at 2:34 PM, Todd Lipcon <[email protected]> wrote:
>
>> You beat me to it, Lars! Was writing a response when some family arrived
>> for
>> the holidays, and when I came back, you had written just what I had started
>> :)
>>
>> On Thu, Dec 23, 2010 at 1:51 PM, Lars George <[email protected]>
>> wrote:
>>
>> > live ones and then moved into place from their temp location. Not sure
>> > what happens if the local cluster has no /hbase etc.
>> >
>> > Todd, could you help here?
>> >
>>
>> Yep, there is a code path where if the HFiles are on a different
>> filesystem,
>> it will copy them to the HBase filesystem first. It's not very efficient,
>> though, so it's probably better to distcp them to the local cluster first.
>>
>> -Todd
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>