Re: Using CsvBulkInsert With compressed Hive data

Gabriel Reid Thu, 29 Sep 2016 13:06:05 -0700

Hi Zack,

Am I correct in understanding the the files are under a structure like
x/.deflate/csv_file.csv ?

In that case, I believe everything under the .deflate directory will
simply be ignored, as directories whose name start with a period are
considered "hidden" files.

However, assuming the data under those directories is compressed using
a compression codec supported on your cluster (e.g. gz, snappy, etc),
there shouldn't be a problem using them as input for the CSV import.
In other words, the compression probably isn't an issue, but the
directory naming probably is.

- Gabriel

On Thu, Sep 29, 2016 at 7:14 PM, Riesland, Zack
<zack.riesl...@sensus.com> wrote:
> For a very long time, we’ve had a workflow that looks like this:
>
>
>
> Export data from a compressed, orc hive table to another hive table that is
> “external stored as text file”. No compression specified.
>
>
>
> Then, we point to the folder “x” behind that new table and use CsvBulkInsert
> to get data to Hbase.
>
>
>
> Today, I noticed that the data has not been getting into HBase since late
> August.
>
>
>
> After some clicking around, it looks like this is happening because we have
> hive.exec.compress.output set to true, so the data in folder “x” is
> compressed in “.deflate” folders.
>
>
>
> However, it looks like someone changed this setting to true 4 months ago.
>
>
>
> So we should either be missing 4 months of data, or this should work.
>
>
>
> Thus my question: does CSV bulk insert work with compressed output like
> this?
>
>
>
>

Re: Using CsvBulkInsert With compressed Hive data

Reply via email to