Re: Unable to write string data into ORC file (or at least read it back)

Owen O'Malley Tue, 06 Dec 2016 09:33:00 -0800

It looks like your writer is correct. Maybe the VectorizedRowBatch.toString
is wonky. Can you try printing the output using the standard dumper:


% java -jar tools/target/orc-tools-1.2.2-uber.jar data my-file.orc

Thanks,
   Owen

On Tue, Dec 6, 2016 at 8:48 AM, Scott Wells <[email protected]> wrote:

> Thanks, Owen.  I'd tried using references but it didn't resolve the
> issue.  Here's the code:
>
> ========================================================
> new File("my-file.orc").delete();
>
> Configuration conf = new Configuration();
> TypeDescription schema = TypeDescription.fromString("
> struct<x:int,str:string>");
> Writer writer = OrcFile.createWriter(new Path("my-file.orc"),
>     OrcFile.writerOptions(conf)
>         .setSchema(schema));
>
> VectorizedRowBatch writeBatch = schema.createRowBatch();
> LongColumnVector x = (LongColumnVector) writeBatch.cols[0];
> BytesColumnVector str = (BytesColumnVector) writeBatch.cols[1];
> for (int r = 0; r < 10; ++r)
> {
>     int row = writeBatch.size++;
>     x.vector[row] = r;
>     byte[] lastNameBytes = ("String-" + (r * 3)).getBytes(StandardCharsets.
> UTF_8);
>     str.setRef(row, lastNameBytes, 0, lastNameBytes.length);
>
>     // If the batch is full, write it out and start over.
>     if (writeBatch.size == writeBatch.getMaxSize())
>     {
>         writer.addRowBatch(writeBatch);
>         writeBatch.reset();
>     }
> }
> if (writeBatch.size > 0)
> {
>     writer.addRowBatch(writeBatch);
> }
> writer.close();
>
> Reader reader = OrcFile.createReader(new Path("my-file.orc"),
>     OrcFile.readerOptions(conf));
>
> RecordReader rows = reader.rows();
> VectorizedRowBatch readBatch = reader.getSchema().createRowBatch();
> while (rows.nextBatch(readBatch))
> {
>     System.out.println(readBatch);
> }
> rows.close();
> ========================================================
>
> and here's the result of running it:
>
> [0, "        "]
> [1, "        "]
> [2, "        "]
> [3, "        "]
> [4, "         "]
> [5, "         "]
> [6, "         "]
> [7, "         "]
> [8, "         "]
> [9, "         "]
>
> Any idea why the strings are coming back empty?  Am I missing something on
> the reader?  For what it's worth, I've tried to put this ORC file into S3
> for access via Hive/PrestoDB (using AWS' new Athena service) and it also
> doesn't like it.
>
> Thanks again!
> Scott
>
> On Tue, Dec 6, 2016 at 10:41 AM, Owen O'Malley <[email protected]> wrote:
>
>> As an example of why having the code be executable is a good idea, I
>> noticed that I was dropping the last batch and needed to add:
>>
>> if (batch.size != 0) {
>>   writer.addRowBatch(batch);
>> }
>>
>> before the close.
>>
>> .. Owen
>>
>> On Tue, Dec 6, 2016 at 8:35 AM, Owen O'Malley <[email protected]> wrote:
>>
>>> You need to call setRef on the BytesColumnVectors. The relevant part is:
>>>
>>> byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8);
>>> y.setRef(row, buffer, 0, buffer.length);
>>>
>>> I've created a gist with the example modified to do one int and one
>>> string, here:
>>>
>>> https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf
>>>
>>> I realized that we should include the example code in the code base and
>>> created ORC-116.
>>>
>>> .. Owen
>>>
>>> On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <[email protected]>
>>> wrote:
>>>
>>>> I'm trying to create a little utility to convert CSV files into ORC
>>>> files.  I've noticed that the resulting ORC files don't seem quite correct,
>>>> though.  In an effort to create a simple reproducible test case, I just
>>>> changed the "Writing/Reading ORC Files" examples here:
>>>>
>>>> https://orc.apache.org/docs/core-java.html
>>>>
>>>> to create a file based on a pair of strings instead of integers.  The
>>>> first issue I hit is that TypeDescription.fromString() isn't available in
>>>> 2.1.0, but instead I did the following:
>>>>
>>>>         TypeDescription schema = TypeDescription.createStruct()
>>>>             .addField("first", TypeDescription.createString())
>>>>             .addField("last", TypeDescription.createString());
>>>>
>>>> Then I changed the loop as follows:
>>>>
>>>>         BytesColumnVector first = (BytesColumnVector)
>>>> writeBatch.cols[0];
>>>>         BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1];
>>>>         for (int r = 0; r < 10; ++r)
>>>>         {
>>>>             String firstName = ("First-" + r).intern();
>>>>             String lastName = ("Last-" + (r * 3)).intern();
>>>>             ...
>>>>         }
>>>>
>>>> The file writes without errors, and if I write it with no compression,
>>>> I can see the data using "strings my-file.orc".  However, when I then try
>>>> to read the data back from the file and print out the resulting batches to
>>>> the console, I get the following:
>>>>
>>>> ["       ", "      "]
>>>> ["       ", "      "]
>>>> ["       ", "      "]
>>>> ["       ", "      "]
>>>> ["       ", "       "]
>>>> ["       ", "       "]
>>>> ["       ", "       "]
>>>> ["       ", "       "]
>>>> ["       ", "       "]
>>>> ["       ", "       "]
>>>>
>>>> Any insights about what I may be doing wrong here would be greatly
>>>> appreciated!
>>>>
>>>> Regards,
>>>> Scott
>>>>
>>>
>>>
>>
>

Re: Unable to write string data into ORC file (or at least read it back)

Reply via email to