On Sep 6, 2011, at 6:58am, Kate Ting wrote: > Hi Ken, you make some good points, to which I've added comments individually. > > re: the degree of parallelism during the next step of processing is > constrained by the number of mappers used during sqooping: does > https://issues.cloudera.org/browse/SQOOP-137 address it? If so, you > might want to add your comments there.
Thanks for the ref, and yes that would help. > re: winding up with unsplittable files and heavily skewed sizes: you > can file separate JIRAs for those if desired. That's not an issue for Sqoop - rather just how Hadoop works. > re: partitioning isn't great: for some databases such as Oracle, the > problem of heavily skewed sizes can be overcome using row-ids, you can > file a JIRA for that if you feel it's needed. Again, not really a Sqoop issue. Things are fine with OraOop. When we fall back to regular Sqoop, we don't have a good column to use for partitioning, so the results wind up being heavily skewed. But I don't think there's anything Sqoop could do to easily solve that problem. Regards, -- Ken > On Mon, Sep 5, 2011 at 12:32 PM, Ken Krugler > <kkrugler_li...@transpac.com> wrote: >> >> On Sep 5, 2011, at 12:12pm, Arvind Prabhakar wrote: >> >>> On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <kkrugler_li...@transpac.com> >>> wrote: >>>> Hi there, >>>> The current documentation says: >>>> >>>> By default, data is not compressed. You can compress your data by using the >>>> deflate (gzip) algorithm with the -z or --compress argument, or specify any >>>> Hadoop compression codec using the --compression-codec argument. This >>>> applies to both SequenceFiles or text files. >>>> >>>> But I think this is a bit misleading. >>>> Currently if output compression is enabled in a cluster, then the Sqooped >>>> data is alway compressed, regardless of the setting of this flag. >>>> It seems better to actually make compression controllable via --compress, >>>> which means changing ImportJobBase.configureOutputFormat() >>>> if (options.shouldUseCompression()) { >>>> FileOutputFormat.setCompressOutput(job, true); >>>> FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); >>>> SequenceFileOutputFormat.setOutputCompressionType(job, >>>> CompressionType.BLOCK); >>>> } >>>> // new stuff >>>> else { >>>> FileOutputFormat.setCompressOutput(job, false); >>>> } >>>> Thoughts? >>> >>> This is a good point Ken. However, IMO it is better left as is since >>> there may be a wider cluster management policy in effect that requires >>> compression for all output files. One way to look at it is that for >>> normal use, there is a predefined compression scheme configured >>> cluster wide, and occasionally when required, Sqoop users can use a >>> different scheme where necessary. >> >> The problem is that when you use text files as Sqoop output, these get >> compressed at the file level by (typically) deflate, gzip or lzo. >> >> So you wind up with unsplittable files, which means that the degree of >> parallelism during the next step of processing is constrained by the number >> of mappers used during sqooping. But you typically set the number of mappers >> based on DB load & size of the data set. >> >> And if partitioning isn't great, then you also wind up with heavily skewed >> sizes for these unsplittable files, which makes things even worse. >> >> The current work-around is to use binary or Avro output instead of text, but >> that's an odd requirement to be able to avoid the above problem. >> >> If the argument is to avoid implicitly changing the cluster's default >> compression policy, then I'd suggest supporting a -nocompression flag. >> >> Regards, >> >> -- Ken >> >> -------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://bixolabs.com >> custom big data solutions & training >> Hadoop, Cascading, Mahout & Solr >> >> >> >> -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr