I found the problem. Basically BytesColumnVector.stringifyValue is broken. I'll update ORC-115.
On Tue, Dec 6, 2016 at 9:31 AM, Owen O'Malley <[email protected]> wrote: > It looks like your writer is correct. Maybe the > VectorizedRowBatch.toString is wonky. Can you try printing the output using > the standard dumper: > > % java -jar tools/target/orc-tools-1.2.2-uber.jar data my-file.orc > > Thanks, > Owen > > On Tue, Dec 6, 2016 at 8:48 AM, Scott Wells <[email protected]> wrote: > >> Thanks, Owen. I'd tried using references but it didn't resolve the >> issue. Here's the code: >> >> ======================================================== >> new File("my-file.orc").delete(); >> >> Configuration conf = new Configuration(); >> TypeDescription schema = TypeDescription.fromString("st >> ruct<x:int,str:string>"); >> Writer writer = OrcFile.createWriter(new Path("my-file.orc"), >> OrcFile.writerOptions(conf) >> .setSchema(schema)); >> >> VectorizedRowBatch writeBatch = schema.createRowBatch(); >> LongColumnVector x = (LongColumnVector) writeBatch.cols[0]; >> BytesColumnVector str = (BytesColumnVector) writeBatch.cols[1]; >> for (int r = 0; r < 10; ++r) >> { >> int row = writeBatch.size++; >> x.vector[row] = r; >> byte[] lastNameBytes = ("String-" + (r * >> 3)).getBytes(StandardCharsets.UTF_8); >> str.setRef(row, lastNameBytes, 0, lastNameBytes.length); >> >> // If the batch is full, write it out and start over. >> if (writeBatch.size == writeBatch.getMaxSize()) >> { >> writer.addRowBatch(writeBatch); >> writeBatch.reset(); >> } >> } >> if (writeBatch.size > 0) >> { >> writer.addRowBatch(writeBatch); >> } >> writer.close(); >> >> Reader reader = OrcFile.createReader(new Path("my-file.orc"), >> OrcFile.readerOptions(conf)); >> >> RecordReader rows = reader.rows(); >> VectorizedRowBatch readBatch = reader.getSchema().createRowBatch(); >> while (rows.nextBatch(readBatch)) >> { >> System.out.println(readBatch); >> } >> rows.close(); >> ======================================================== >> >> and here's the result of running it: >> >> [0, " "] >> [1, " "] >> [2, " "] >> [3, " "] >> [4, " "] >> [5, " "] >> [6, " "] >> [7, " "] >> [8, " "] >> [9, " "] >> >> Any idea why the strings are coming back empty? Am I missing something >> on the reader? For what it's worth, I've tried to put this ORC file into >> S3 for access via Hive/PrestoDB (using AWS' new Athena service) and it also >> doesn't like it. >> >> Thanks again! >> Scott >> >> On Tue, Dec 6, 2016 at 10:41 AM, Owen O'Malley <[email protected]> >> wrote: >> >>> As an example of why having the code be executable is a good idea, I >>> noticed that I was dropping the last batch and needed to add: >>> >>> if (batch.size != 0) { >>> writer.addRowBatch(batch); >>> } >>> >>> before the close. >>> >>> .. Owen >>> >>> On Tue, Dec 6, 2016 at 8:35 AM, Owen O'Malley <[email protected]> >>> wrote: >>> >>>> You need to call setRef on the BytesColumnVectors. The relevant part is: >>>> >>>> byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8); >>>> y.setRef(row, buffer, 0, buffer.length); >>>> >>>> I've created a gist with the example modified to do one int and one >>>> string, here: >>>> >>>> https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf >>>> >>>> I realized that we should include the example code in the code base and >>>> created ORC-116. >>>> >>>> .. Owen >>>> >>>> On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <[email protected]> >>>> wrote: >>>> >>>>> I'm trying to create a little utility to convert CSV files into ORC >>>>> files. I've noticed that the resulting ORC files don't seem quite >>>>> correct, >>>>> though. In an effort to create a simple reproducible test case, I just >>>>> changed the "Writing/Reading ORC Files" examples here: >>>>> >>>>> https://orc.apache.org/docs/core-java.html >>>>> >>>>> to create a file based on a pair of strings instead of integers. The >>>>> first issue I hit is that TypeDescription.fromString() isn't available in >>>>> 2.1.0, but instead I did the following: >>>>> >>>>> TypeDescription schema = TypeDescription.createStruct() >>>>> .addField("first", TypeDescription.createString()) >>>>> .addField("last", TypeDescription.createString()); >>>>> >>>>> Then I changed the loop as follows: >>>>> >>>>> BytesColumnVector first = (BytesColumnVector) >>>>> writeBatch.cols[0]; >>>>> BytesColumnVector last = (BytesColumnVector) >>>>> writeBatch.cols[1]; >>>>> for (int r = 0; r < 10; ++r) >>>>> { >>>>> String firstName = ("First-" + r).intern(); >>>>> String lastName = ("Last-" + (r * 3)).intern(); >>>>> ... >>>>> } >>>>> >>>>> The file writes without errors, and if I write it with no compression, >>>>> I can see the data using "strings my-file.orc". However, when I then try >>>>> to read the data back from the file and print out the resulting batches to >>>>> the console, I get the following: >>>>> >>>>> [" ", " "] >>>>> [" ", " "] >>>>> [" ", " "] >>>>> [" ", " "] >>>>> [" ", " "] >>>>> [" ", " "] >>>>> [" ", " "] >>>>> [" ", " "] >>>>> [" ", " "] >>>>> [" ", " "] >>>>> >>>>> Any insights about what I may be doing wrong here would be greatly >>>>> appreciated! >>>>> >>>>> Regards, >>>>> Scott >>>>> >>>> >>>> >>> >> >
