Hi, There was known issues with the previous versions of hive with parquet & timestamp combination. Please check it once which version of hive you are using in your cluster. Regards, Ravi Prasad Pentakota India Software Lab, IBM Software Group Phone: +9180-43328520 Mobile: 919620959477 e-mail:[email protected]
From: Joshua Baxter <[email protected]> To: [email protected] Date: 02/03/2015 09:05 PM Subject: Re: --as-parquet-file, Oraoop and Decimal and Timestamp types I've had a little more luck with this after upgrading to CDH 5.3. The oracle direct connector seems to be working well with hcatalog integration and the various output file formats. However its seems that parquet doesn't work with hcatalog integration. When using "stored as parquet" as the --hcatalog-storage-stanza all the mappers are erroring with the below. 15/02/02 17:17:03 INFO mapreduce.Job: Task Id : attempt_1422914679712_0003_m_000042_1, Status : FAILED Error: java.lang.RuntimeException: Should never be used at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter (MapredParquetOutputFormat.java:79) at org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter (FileOutputFormatContainer.java:103) at org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter (HCatOutputFormat.java:260) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init> (MapTask.java:644) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs (UserGroupInformation.java:1642) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Anyone had any luck sqooping directly to parquet with Decimal and Timestamp types? On Tue, Dec 2, 2014 at 6:17 PM, Joshua Baxter <[email protected]> wrote: I'm using Sqoop, Oraoop and the --as-parquet-file switch to pull down partitions of a large fact table and getting some great speed. There are not any columns i can evenly split by with the default connector but with Oraoop I can get evenly sized parquet files that i can use directly in impala and hive without incurring remote reads. A couple things i have noticed though. Decimal fields are getting exported as strings. SQOOP-1445 refers to this but it sounds like a fix isn't planned due to the HCatalog support. Unfortunately the direct connectors, apart from Netezza, are not currently not supported. You need to use option -Doraoop.timestamp.string=false otherwise you get an Not in union ["long","null"]: 2014-07-24 00:00:00 exception due to the intermediary file format. However the resulting parquet file is a double rather then a hive or impala compatible timestamp. Here is what i am running now. sqoop import -Doraoop.chunk.method=ROWID -Doraoop.timestamp.string=false -Doraoop.import.partitions=${PARTITION} \ --direct \ --connect jdbc:oracle:thin:@//${DATABASE} \ --table "${TABLE}" \ --columns COL1,COL2,COL3,COL4,COL5,COL6 \ --map-column-java COL1=Long,COL2=Long,COL3=Long,COL4=Long \ --m 48 \ --target-dir /user/joshba/LANDING_PAD/TABLE-${PARTITION}/ \ --delete-target-dir COL1-4 are stored as NUMBER(38,0) but don't hold anything more than a the size of a long so I've remapped those to save space. COL5 is a Decimal and COL6 is a DATE. Is there any way I can remap these also so that they are written into the parquet file as DECIMAL and timestamp compatible types respectively so there isn't a needed to redefine these columns. Many Thanks Josh
