Re: Controlling compression during import

Ken Krugler Tue, 06 Sep 2011 07:08:13 -0700

On Sep 6, 2011, at 6:58am, Kate Ting wrote:

> Hi Ken, you make some good points, to which I've added comments individually.
> 
> re: the degree of parallelism during the next step of processing is
> constrained by the number of mappers used during sqooping: does
> https://issues.cloudera.org/browse/SQOOP-137 address it? If so, you
> might want to add your comments there.


Thanks for the ref, and yes that would help.

> re: winding up with unsplittable files and heavily skewed sizes: you
> can file separate JIRAs for those if desired.

That's not an issue for Sqoop - rather just how Hadoop works.

> re: partitioning isn't great: for some databases such as Oracle, the
> problem of heavily skewed sizes can be overcome using row-ids, you can
> file a JIRA for that if you feel it's needed.

Again, not really a Sqoop issue. Things are fine with OraOop.

When we fall back to regular Sqoop, we don't have a good column to use for 
partitioning, so the results wind up being heavily skewed. But I don't think 
there's anything Sqoop could do to easily solve that problem.

Regards,

-- Ken


> On Mon, Sep 5, 2011 at 12:32 PM, Ken Krugler
> <kkrugler_li...@transpac.com> wrote:
>> 
>> On Sep 5, 2011, at 12:12pm, Arvind Prabhakar wrote:
>> 
>>> On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <kkrugler_li...@transpac.com> 
>>> wrote:
>>>> Hi there,
>>>> The current documentation says:
>>>> 
>>>> By default, data is not compressed. You can compress your data by using the
>>>> deflate (gzip) algorithm with the -z or --compress argument, or specify any
>>>> Hadoop compression codec using the --compression-codec argument. This
>>>> applies to both SequenceFiles or text files.
>>>> 
>>>> But I think this is a bit misleading.
>>>> Currently if output compression is enabled in a cluster, then the Sqooped
>>>> data is alway compressed, regardless of the setting of this flag.
>>>> It seems better to actually make compression controllable via --compress,
>>>> which means changing ImportJobBase.configureOutputFormat()
>>>>     if (options.shouldUseCompression()) {
>>>>       FileOutputFormat.setCompressOutput(job, true);
>>>>       FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>>>>       SequenceFileOutputFormat.setOutputCompressionType(job,
>>>>           CompressionType.BLOCK);
>>>>     }
>>>>    // new stuff
>>>>     else {
>>>>       FileOutputFormat.setCompressOutput(job, false);
>>>>     }
>>>> Thoughts?
>>> 
>>> This is a good point Ken. However, IMO it is better left as is since
>>> there may be a wider cluster management policy in effect that requires
>>> compression for all output files. One way to look at it is that for
>>> normal use, there is a predefined compression scheme configured
>>> cluster wide, and occasionally when required, Sqoop users can use a
>>> different scheme where necessary.
>> 
>> The problem is that when you use text files as Sqoop output, these get 
>> compressed at the file level by (typically) deflate, gzip or lzo.
>> 
>> So you wind up with unsplittable files, which means that the degree of 
>> parallelism during the next step of processing is constrained by the number 
>> of mappers used during sqooping. But you typically set the number of mappers 
>> based on DB load & size of the data set.
>> 
>> And if partitioning isn't great, then you also wind up with heavily skewed 
>> sizes for these unsplittable files, which makes things even worse.
>> 
>> The current work-around is to use binary or Avro output instead of text, but 
>> that's an odd requirement to be able to avoid the above problem.
>> 
>> If the argument is to avoid implicitly changing the cluster's default 
>> compression policy, then I'd suggest supporting a -nocompression flag.
>> 
>> Regards,
>> 
>> -- Ken
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> custom big data solutions & training
>> Hadoop, Cascading, Mahout & Solr
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: Controlling compression during import

Reply via email to