It looks like your writer is correct. Maybe the VectorizedRowBatch.toString is wonky. Can you try printing the output using the standard dumper:
% java -jar tools/target/orc-tools-1.2.2-uber.jar data my-file.orc Thanks, Owen On Tue, Dec 6, 2016 at 8:48 AM, Scott Wells <[email protected]> wrote: > Thanks, Owen. I'd tried using references but it didn't resolve the > issue. Here's the code: > > ======================================================== > new File("my-file.orc").delete(); > > Configuration conf = new Configuration(); > TypeDescription schema = TypeDescription.fromString(" > struct<x:int,str:string>"); > Writer writer = OrcFile.createWriter(new Path("my-file.orc"), > OrcFile.writerOptions(conf) > .setSchema(schema)); > > VectorizedRowBatch writeBatch = schema.createRowBatch(); > LongColumnVector x = (LongColumnVector) writeBatch.cols[0]; > BytesColumnVector str = (BytesColumnVector) writeBatch.cols[1]; > for (int r = 0; r < 10; ++r) > { > int row = writeBatch.size++; > x.vector[row] = r; > byte[] lastNameBytes = ("String-" + (r * 3)).getBytes(StandardCharsets. > UTF_8); > str.setRef(row, lastNameBytes, 0, lastNameBytes.length); > > // If the batch is full, write it out and start over. > if (writeBatch.size == writeBatch.getMaxSize()) > { > writer.addRowBatch(writeBatch); > writeBatch.reset(); > } > } > if (writeBatch.size > 0) > { > writer.addRowBatch(writeBatch); > } > writer.close(); > > Reader reader = OrcFile.createReader(new Path("my-file.orc"), > OrcFile.readerOptions(conf)); > > RecordReader rows = reader.rows(); > VectorizedRowBatch readBatch = reader.getSchema().createRowBatch(); > while (rows.nextBatch(readBatch)) > { > System.out.println(readBatch); > } > rows.close(); > ======================================================== > > and here's the result of running it: > > [0, " "] > [1, " "] > [2, " "] > [3, " "] > [4, " "] > [5, " "] > [6, " "] > [7, " "] > [8, " "] > [9, " "] > > Any idea why the strings are coming back empty? Am I missing something on > the reader? For what it's worth, I've tried to put this ORC file into S3 > for access via Hive/PrestoDB (using AWS' new Athena service) and it also > doesn't like it. > > Thanks again! > Scott > > On Tue, Dec 6, 2016 at 10:41 AM, Owen O'Malley <[email protected]> wrote: > >> As an example of why having the code be executable is a good idea, I >> noticed that I was dropping the last batch and needed to add: >> >> if (batch.size != 0) { >> writer.addRowBatch(batch); >> } >> >> before the close. >> >> .. Owen >> >> On Tue, Dec 6, 2016 at 8:35 AM, Owen O'Malley <[email protected]> wrote: >> >>> You need to call setRef on the BytesColumnVectors. The relevant part is: >>> >>> byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8); >>> y.setRef(row, buffer, 0, buffer.length); >>> >>> I've created a gist with the example modified to do one int and one >>> string, here: >>> >>> https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf >>> >>> I realized that we should include the example code in the code base and >>> created ORC-116. >>> >>> .. Owen >>> >>> On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <[email protected]> >>> wrote: >>> >>>> I'm trying to create a little utility to convert CSV files into ORC >>>> files. I've noticed that the resulting ORC files don't seem quite correct, >>>> though. In an effort to create a simple reproducible test case, I just >>>> changed the "Writing/Reading ORC Files" examples here: >>>> >>>> https://orc.apache.org/docs/core-java.html >>>> >>>> to create a file based on a pair of strings instead of integers. The >>>> first issue I hit is that TypeDescription.fromString() isn't available in >>>> 2.1.0, but instead I did the following: >>>> >>>> TypeDescription schema = TypeDescription.createStruct() >>>> .addField("first", TypeDescription.createString()) >>>> .addField("last", TypeDescription.createString()); >>>> >>>> Then I changed the loop as follows: >>>> >>>> BytesColumnVector first = (BytesColumnVector) >>>> writeBatch.cols[0]; >>>> BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1]; >>>> for (int r = 0; r < 10; ++r) >>>> { >>>> String firstName = ("First-" + r).intern(); >>>> String lastName = ("Last-" + (r * 3)).intern(); >>>> ... >>>> } >>>> >>>> The file writes without errors, and if I write it with no compression, >>>> I can see the data using "strings my-file.orc". However, when I then try >>>> to read the data back from the file and print out the resulting batches to >>>> the console, I get the following: >>>> >>>> [" ", " "] >>>> [" ", " "] >>>> [" ", " "] >>>> [" ", " "] >>>> [" ", " "] >>>> [" ", " "] >>>> [" ", " "] >>>> [" ", " "] >>>> [" ", " "] >>>> [" ", " "] >>>> >>>> Any insights about what I may be doing wrong here would be greatly >>>> appreciated! >>>> >>>> Regards, >>>> Scott >>>> >>> >>> >> >
