Re: Unable to write string data into ORC file (or at least read it back)

Owen O'Malley Tue, 06 Dec 2016 09:39:58 -0800

I found the problem. Basically BytesColumnVector.stringifyValue is broken.

I'll update ORC-115.


On Tue, Dec 6, 2016 at 9:31 AM, Owen O'Malley <[email protected]> wrote:

> It looks like your writer is correct. Maybe the
> VectorizedRowBatch.toString is wonky. Can you try printing the output using
> the standard dumper:
>
> % java -jar tools/target/orc-tools-1.2.2-uber.jar data my-file.orc
>
> Thanks,
>    Owen
>
> On Tue, Dec 6, 2016 at 8:48 AM, Scott Wells <[email protected]> wrote:
>
>> Thanks, Owen.  I'd tried using references but it didn't resolve the
>> issue.  Here's the code:
>>
>> ========================================================
>> new File("my-file.orc").delete();
>>
>> Configuration conf = new Configuration();
>> TypeDescription schema = TypeDescription.fromString("st
>> ruct<x:int,str:string>");
>> Writer writer = OrcFile.createWriter(new Path("my-file.orc"),
>>     OrcFile.writerOptions(conf)
>>         .setSchema(schema));
>>
>> VectorizedRowBatch writeBatch = schema.createRowBatch();
>> LongColumnVector x = (LongColumnVector) writeBatch.cols[0];
>> BytesColumnVector str = (BytesColumnVector) writeBatch.cols[1];
>> for (int r = 0; r < 10; ++r)
>> {
>>     int row = writeBatch.size++;
>>     x.vector[row] = r;
>>     byte[] lastNameBytes = ("String-" + (r *
>> 3)).getBytes(StandardCharsets.UTF_8);
>>     str.setRef(row, lastNameBytes, 0, lastNameBytes.length);
>>
>>     // If the batch is full, write it out and start over.
>>     if (writeBatch.size == writeBatch.getMaxSize())
>>     {
>>         writer.addRowBatch(writeBatch);
>>         writeBatch.reset();
>>     }
>> }
>> if (writeBatch.size > 0)
>> {
>>     writer.addRowBatch(writeBatch);
>> }
>> writer.close();
>>
>> Reader reader = OrcFile.createReader(new Path("my-file.orc"),
>>     OrcFile.readerOptions(conf));
>>
>> RecordReader rows = reader.rows();
>> VectorizedRowBatch readBatch = reader.getSchema().createRowBatch();
>> while (rows.nextBatch(readBatch))
>> {
>>     System.out.println(readBatch);
>> }
>> rows.close();
>> ========================================================
>>
>> and here's the result of running it:
>>
>> [0, "        "]
>> [1, "        "]
>> [2, "        "]
>> [3, "        "]
>> [4, "         "]
>> [5, "         "]
>> [6, "         "]
>> [7, "         "]
>> [8, "         "]
>> [9, "         "]
>>
>> Any idea why the strings are coming back empty?  Am I missing something
>> on the reader?  For what it's worth, I've tried to put this ORC file into
>> S3 for access via Hive/PrestoDB (using AWS' new Athena service) and it also
>> doesn't like it.
>>
>> Thanks again!
>> Scott
>>
>> On Tue, Dec 6, 2016 at 10:41 AM, Owen O'Malley <[email protected]>
>> wrote:
>>
>>> As an example of why having the code be executable is a good idea, I
>>> noticed that I was dropping the last batch and needed to add:
>>>
>>> if (batch.size != 0) {
>>>   writer.addRowBatch(batch);
>>> }
>>>
>>> before the close.
>>>
>>> .. Owen
>>>
>>> On Tue, Dec 6, 2016 at 8:35 AM, Owen O'Malley <[email protected]>
>>> wrote:
>>>
>>>> You need to call setRef on the BytesColumnVectors. The relevant part is:
>>>>
>>>> byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8);
>>>> y.setRef(row, buffer, 0, buffer.length);
>>>>
>>>> I've created a gist with the example modified to do one int and one
>>>> string, here:
>>>>
>>>> https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf
>>>>
>>>> I realized that we should include the example code in the code base and
>>>> created ORC-116.
>>>>
>>>> .. Owen
>>>>
>>>> On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <[email protected]>
>>>> wrote:
>>>>
>>>>> I'm trying to create a little utility to convert CSV files into ORC
>>>>> files.  I've noticed that the resulting ORC files don't seem quite 
>>>>> correct,
>>>>> though.  In an effort to create a simple reproducible test case, I just
>>>>> changed the "Writing/Reading ORC Files" examples here:
>>>>>
>>>>> https://orc.apache.org/docs/core-java.html
>>>>>
>>>>> to create a file based on a pair of strings instead of integers.  The
>>>>> first issue I hit is that TypeDescription.fromString() isn't available in
>>>>> 2.1.0, but instead I did the following:
>>>>>
>>>>>         TypeDescription schema = TypeDescription.createStruct()
>>>>>             .addField("first", TypeDescription.createString())
>>>>>             .addField("last", TypeDescription.createString());
>>>>>
>>>>> Then I changed the loop as follows:
>>>>>
>>>>>         BytesColumnVector first = (BytesColumnVector)
>>>>> writeBatch.cols[0];
>>>>>         BytesColumnVector last = (BytesColumnVector)
>>>>> writeBatch.cols[1];
>>>>>         for (int r = 0; r < 10; ++r)
>>>>>         {
>>>>>             String firstName = ("First-" + r).intern();
>>>>>             String lastName = ("Last-" + (r * 3)).intern();
>>>>>             ...
>>>>>         }
>>>>>
>>>>> The file writes without errors, and if I write it with no compression,
>>>>> I can see the data using "strings my-file.orc".  However, when I then try
>>>>> to read the data back from the file and print out the resulting batches to
>>>>> the console, I get the following:
>>>>>
>>>>> ["       ", "      "]
>>>>> ["       ", "      "]
>>>>> ["       ", "      "]
>>>>> ["       ", "      "]
>>>>> ["       ", "       "]
>>>>> ["       ", "       "]
>>>>> ["       ", "       "]
>>>>> ["       ", "       "]
>>>>> ["       ", "       "]
>>>>> ["       ", "       "]
>>>>>
>>>>> Any insights about what I may be doing wrong here would be greatly
>>>>> appreciated!
>>>>>
>>>>> Regards,
>>>>> Scott
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Unable to write string data into ORC file (or at least read it back)

Reply via email to