Thanks John. When you say "hexed" data, do you mean binary encoded to ascii 
hex? This would remove the raw newline characters.

We considered Base64 encoding our data, a similar idea, which would also remove 
raw newlines. But my preference is to put real binary data into Hive, and find 
a way to make this work.

Chuck

________________________________
From: John Omernik [[email protected]]
Sent: Saturday, December 01, 2012 4:22 PM
To: [email protected]
Subject: Re: BINARY column type

Hi Chuck -

I've used binary columns with Newlines in the data. I used RCFile format for my 
storage method. Works great so far. Whether or not this is "the" way to get 
data in, I use hexed data (my transform script outputs hex encoded) and the 
final insert into the table gets a unhex(sourcedata).  That's never been a 
problem for me, seems a bit hackish, but works well.

On Sat, Dec 1, 2012 at 10:50 AM, Connell, Chuck 
<[email protected]<mailto:[email protected]>> wrote:

I am trying to use BINARY columns and believe I have the perfect use-case for 
it, but I am missing something. Has anyone used this for true binary data 
(which may contain newlines)?


Here is the background... I have some files that each contain just one logical 
field, which is a binary object. (The files are Google Protobuf format.) I want 
to put these binary files into a larger file, where each protobuf is a logical 
record. Then I want to define a Hive table that stores each protobuf as one 
row, with the entire protobuf object in one BINARY column. Then I will use a 
custom UDF to select/query the binary object.


This is about as simple as can be for putting binary data into Hive.


What file format should I use to package the binary rows? What should the Hive 
table definition be? Which SerDe option (LazySimpleBinary?). I cannot use 
TEXTFILE, since the binary may contain newlines. Many of my attempts have 
choked on the newlines.


Thank you,

Chuck Connell

Nuance

Burlington, MA


Reply via email to