The hex idea is clever. But does this mean that the files you brought into Hive (with a LOAD statement) were essentially ascii (hexed), not raw binary?
________________________________ From: John Omernik [[email protected]] Sent: Saturday, December 01, 2012 11:58 PM To: [email protected] Subject: Re: BINARY column type No, I didn't remove any newline characters. newline became 0A By using perl or python in a transform if I had "Hi how are you\n" It would be come 486920686f772061726520796f75200A >From there it would pass that to the unhex() function in hive in the insert >statement. That allowed me to move the data with newline around easily, but on >the final step (on insert) it would unhex it and put it in as actual binary, >no bytes were harmed in the hexing (or unhexing) of my data. On Sat, Dec 1, 2012 at 4:11 PM, Connell, Chuck <[email protected]<mailto:[email protected]>> wrote: Thanks John. When you say "hexed" data, do you mean binary encoded to ascii hex? This would remove the raw newline characters. We considered Base64 encoding our data, a similar idea, which would also remove raw newlines. But my preference is to put real binary data into Hive, and find a way to make this work. Chuck ________________________________ From: John Omernik [[email protected]<mailto:[email protected]>] Sent: Saturday, December 01, 2012 4:22 PM To: [email protected]<mailto:[email protected]> Subject: Re: BINARY column type Hi Chuck - I've used binary columns with Newlines in the data. I used RCFile format for my storage method. Works great so far. Whether or not this is "the" way to get data in, I use hexed data (my transform script outputs hex encoded) and the final insert into the table gets a unhex(sourcedata). That's never been a problem for me, seems a bit hackish, but works well. On Sat, Dec 1, 2012 at 10:50 AM, Connell, Chuck <[email protected]<mailto:[email protected]>> wrote: I am trying to use BINARY columns and believe I have the perfect use-case for it, but I am missing something. Has anyone used this for true binary data (which may contain newlines)? Here is the background... I have some files that each contain just one logical field, which is a binary object. (The files are Google Protobuf format.) I want to put these binary files into a larger file, where each protobuf is a logical record. Then I want to define a Hive table that stores each protobuf as one row, with the entire protobuf object in one BINARY column. Then I will use a custom UDF to select/query the binary object. This is about as simple as can be for putting binary data into Hive. What file format should I use to package the binary rows? What should the Hive table definition be? Which SerDe option (LazySimpleBinary?). I cannot use TEXTFILE, since the binary may contain newlines. Many of my attempts have choked on the newlines. Thank you, Chuck Connell Nuance Burlington, MA
