Re: --as-parquet-file, Oraoop and Decimal and Timestamp types

Raviprasad N Pentakota Tue, 03 Feb 2015 19:54:07 -0800

Hi,
There was known issues with the previous versions of hive with parquet &
timestamp combination. Please check it once which version of hive you are
using in your cluster.
Regards,
Ravi Prasad Pentakota
India Software Lab, IBM Software Group
Phone: +9180-43328520  Mobile: 919620959477
e-mail:[email protected]






From:   Joshua Baxter <[email protected]>
To:     [email protected]
Date:   02/03/2015 09:05 PM
Subject:        Re: --as-parquet-file, Oraoop and Decimal and Timestamp types



I've had a little more luck with this after upgrading to CDH 5.3. The
oracle direct connector seems to be working well with hcatalog integration
and the various output file formats. However its seems that parquet doesn't
work with hcatalog integration. When using "stored as parquet" as the
--hcatalog-storage-stanza all the mappers are erroring with the below.


15/02/02 17:17:03 INFO mapreduce.Job: Task Id :
attempt_1422914679712_0003_m_000042_1, Status : FAILED
Error: java.lang.RuntimeException: Should never be used
        at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter
(MapredParquetOutputFormat.java:79)
        at
org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter
(FileOutputFormatContainer.java:103)
        at
org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter
(HCatOutputFormat.java:260)
        at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>
(MapTask.java:644)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs
(UserGroupInformation.java:1642)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)


Anyone had any luck sqooping directly to parquet with Decimal and Timestamp
types?

On Tue, Dec 2, 2014 at 6:17 PM, Joshua Baxter <[email protected]>
wrote:
  I'm using Sqoop, Oraoop and the --as-parquet-file switch to pull down
  partitions of a large fact table and getting some great speed. There are
  not any columns i can evenly split by with the default connector but with
  Oraoop I can get evenly sized parquet files that i can use directly in
  impala and hive without incurring remote reads. A couple things i have
  noticed though.
        Decimal fields are getting exported as strings. SQOOP-1445 refers
        to this but it sounds like a fix isn't planned due to the HCatalog
        support. Unfortunately the direct connectors, apart from Netezza,
        are not currently not supported.
        You need to use option -Doraoop.timestamp.string=false otherwise
        you get an Not in union ["long","null"]: 2014-07-24 00:00:00
        exception due to the intermediary file format. However the
        resulting parquet file is a double rather then a hive or impala
        compatible timestamp.
  Here is what i am running now.

  sqoop import  -Doraoop.chunk.method=ROWID -Doraoop.timestamp.string=false
  -Doraoop.import.partitions=${PARTITION} \
  --direct \
  --connect jdbc:oracle:thin:@//${DATABASE}  \
  --table "${TABLE}" \
  --columns COL1,COL2,COL3,COL4,COL5,COL6 \
  --map-column-java  COL1=Long,COL2=Long,COL3=Long,COL4=Long \
  --m 48 \
  --target-dir /user/joshba/LANDING_PAD/TABLE-${PARTITION}/ \
  --delete-target-dir

  COL1-4 are stored as NUMBER(38,0) but don't hold anything more than a the
  size of a long so I've remapped those to save space. COL5 is a Decimal
  and COL6 is a DATE. Is there any way I can remap these also so that they
  are written into the parquet file as DECIMAL and timestamp compatible
  types respectively so there isn't a needed to redefine these columns.

  Many Thanks

  Josh

Re: --as-parquet-file, Oraoop and Decimal and Timestamp types

Reply via email to