I'm about to investigate the following situation, but I'd appreciate any
insight that can be given.
We have an external table which is comprised of 3 HDFS files.
We then run an INSERT OVERWRITE which is just a SELECT * from the external
table.
The table being overwritten has N buckets.
The issue
This can be accomplished with a custom input format.
Here's a snippet of the relevant code in the customer RecordReader
compressionCodecs = new CompressionCodecFactory(jobConf);
Path file = split.getPath();
final CompressionCodec codec =
To be clear, you would then create the table with the clause:
STORED AS
INPUTFORMAT 'your.custom.input.format'
If you make an external table, you'll then be able to point to a directory
(or file) that contains gzipped files, or uncompressed files.
On Fri, Jan 28, 2011 at 4:52 PM, phil
I found the source code is very helpful for this.
There's a custom serde in the source, with a test case you can review, which
really speeds up development of your SerDe.
org.apache.hadoop.hive.contrib.serde2.TestRegexSerDe
One thing to watch out for though, is that the framework will
.
Sorry to bother and thanks a bunch for the help! Forcing me to go read
more about InputFormats is a long term help anyway.
Pat
*From:* phil young [mailto:phil.wills.yo...@gmail.com]
*Sent:* Friday, January 28, 2011 1:54 PM
*To:* user@hive.apache.org
*Subject:* Re: Custom SerDe
I'm wondering if my configuration/stack is wrong, or if I'm trying to do
something that is not supported in Hive.
My goal is to choose a compression scheme for Hadoop/Hive and while
comparing configurations, I'm finding that I can't get BZip2 or Gzip to work
with the RCfile format.
Is that
If you trace the source code, you'll find it's not too hard to change to
let a user specify a UDF. But, that's changing the code...
Ed Capriolo posted a more useful response a while back, on the general Hive
mailing list:
You have the option now to run HQL by creating a hiverc file