Re: --as-parquet-file, Oraoop and Decimal and Timestamp types

Joshua Baxter Wed, 04 Feb 2015 04:53:24 -0800

The version of hive is 13.1. Timestamps are working fine in hive with
parquet though. It looks like the fixes have been backported.


http://archive.cloudera.com/cdh5/cdh/5/hive-0.13.1-cdh5.3.0.releasenotes.html


On Wed, Feb 4, 2015 at 3:51 AM, Raviprasad N Pentakota <[email protected]>
wrote:

> Hi,
> There was known issues with the previous versions of hive with parquet &
> timestamp combination. Please check it once which version of hive you are
> using in your cluster.
>
>    Regards,
>    Ravi Prasad Pentakota
>    India Software Lab, IBM Software Group
>    Phone: +9180-43328520  Mobile: 919620959477
>    e-mail:[email protected]
>
>
>
> [image: Inactive hide details for Joshua Baxter ---02/03/2015 09:05:50
> PM---I've had a little more luck with this after upgrading to CD]Joshua
> Baxter ---02/03/2015 09:05:50 PM---I've had a little more luck with this
> after upgrading to CDH 5.3. The oracle direct connector seems
>
> From: Joshua Baxter <[email protected]>
> To: [email protected]
> Date: 02/03/2015 09:05 PM
> Subject: Re: --as-parquet-file, Oraoop and Decimal and Timestamp types
> ------------------------------
>
>
>
> I've had a little more luck with this after upgrading to CDH 5.3. The
> oracle direct connector seems to be working well with hcatalog integration
> and the various output file formats. However its seems that parquet doesn't
> work with hcatalog integration. When using "stored as parquet" as the
> --hcatalog-storage-stanza all the mappers are erroring with the below.
>
>
> 15/02/02 17:17:03 INFO mapreduce.Job: Task Id :
> attempt_1422914679712_0003_m_000042_1, Status : FAILED
> Error: java.lang.RuntimeException: Should never be used
>         at
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter(MapredParquetOutputFormat.java:79)
>         at
> org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter(FileOutputFormatContainer.java:103)
>         at
> org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter(HCatOutputFormat.java:260)
>         at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:644)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
>
> Anyone had any luck sqooping directly to parquet with Decimal and
> Timestamp types?
>
> On Tue, Dec 2, 2014 at 6:17 PM, Joshua Baxter <*[email protected]*
> <[email protected]>> wrote:
>
>    I'm using Sqoop, Oraoop and the --as-parquet-file switch to pull down
>    partitions of a large fact table and getting some great speed. There are
>    not any columns i can evenly split by with the default connector but with
>    Oraoop I can get evenly sized parquet files that i can use directly in
>    impala and hive without incurring remote reads. A couple things i have
>    noticed though.
>       - Decimal fields are getting exported as strings. SQOOP-1445 refers
>       to this but it sounds like a fix isn't planned due to the HCatalog 
> support.
>       Unfortunately the direct connectors, apart from Netezza, are not 
> currently
>       not supported.
>       - You need to use option -Doraoop.timestamp.string=false otherwise
>       you get an Not in union ["long","null"]: 2014-07-24 00:00:00 exception 
> due
>       to the intermediary file format. However the resulting parquet file is a
>       double rather then a hive or impala compatible timestamp.
>    Here is what i am running now.
>
>    sqoop import  -Doraoop.chunk.method=ROWID
>    -Doraoop.timestamp.string=false -Doraoop.import.partitions=${PARTITION} \
>    --direct \
>    --connect jdbc:oracle:thin:@//${DATABASE}  \
>    --table "${TABLE}" \
>    --columns COL1,COL2,COL3,COL4,COL5,COL6 \
>    --map-column-java  COL1=Long,COL2=Long,COL3=Long,COL4=Long \
>    --m 48 \
>    --target-dir /user/joshba/LANDING_PAD/TABLE-${PARTITION}/ \
>    --delete-target-dir
>
>    COL1-4 are stored as NUMBER(38,0) but don't hold anything more than a
>    the size of a long so I've remapped those to save space. COL5 is a Decimal
>    and COL6 is a DATE. Is there any way I can remap these also so that they
>    are written into the parquet file as DECIMAL and timestamp compatible types
>    respectively so there isn't a needed to redefine these columns.
>
>    Many Thanks
>
>    Josh
>
>
>

Re: --as-parquet-file, Oraoop and Decimal and Timestamp types

Reply via email to