Re: Unable to write string data into ORC file (or at least read it back)

Scott Wells Tue, 06 Dec 2016 08:49:54 -0800

Thanks, Owen.  I'd tried using references but it didn't resolve the issue.
Here's the code:


========================================================
new File("my-file.orc").delete();

Configuration conf = new Configuration();
TypeDescription schema =
TypeDescription.fromString("struct<x:int,str:string>");
Writer writer = OrcFile.createWriter(new Path("my-file.orc"),
    OrcFile.writerOptions(conf)
        .setSchema(schema));

VectorizedRowBatch writeBatch = schema.createRowBatch();
LongColumnVector x = (LongColumnVector) writeBatch.cols[0];
BytesColumnVector str = (BytesColumnVector) writeBatch.cols[1];
for (int r = 0; r < 10; ++r)
{
    int row = writeBatch.size++;
    x.vector[row] = r;
    byte[] lastNameBytes = ("String-" + (r *
3)).getBytes(StandardCharsets.UTF_8);
    str.setRef(row, lastNameBytes, 0, lastNameBytes.length);

    // If the batch is full, write it out and start over.
    if (writeBatch.size == writeBatch.getMaxSize())
    {
        writer.addRowBatch(writeBatch);
        writeBatch.reset();
    }
}
if (writeBatch.size > 0)
{
    writer.addRowBatch(writeBatch);
}
writer.close();

Reader reader = OrcFile.createReader(new Path("my-file.orc"),
    OrcFile.readerOptions(conf));

RecordReader rows = reader.rows();
VectorizedRowBatch readBatch = reader.getSchema().createRowBatch();
while (rows.nextBatch(readBatch))
{
    System.out.println(readBatch);
}
rows.close();
========================================================

and here's the result of running it:

[0, "        "]
[1, "        "]
[2, "        "]
[3, "        "]
[4, "         "]
[5, "         "]
[6, "         "]
[7, "         "]
[8, "         "]
[9, "         "]

Any idea why the strings are coming back empty?  Am I missing something on
the reader?  For what it's worth, I've tried to put this ORC file into S3
for access via Hive/PrestoDB (using AWS' new Athena service) and it also
doesn't like it.

Thanks again!
Scott

On Tue, Dec 6, 2016 at 10:41 AM, Owen O'Malley <[email protected]> wrote:

> As an example of why having the code be executable is a good idea, I
> noticed that I was dropping the last batch and needed to add:
>
> if (batch.size != 0) {
>   writer.addRowBatch(batch);
> }
>
> before the close.
>
> .. Owen
>
> On Tue, Dec 6, 2016 at 8:35 AM, Owen O'Malley <[email protected]> wrote:
>
>> You need to call setRef on the BytesColumnVectors. The relevant part is:
>>
>> byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8);
>> y.setRef(row, buffer, 0, buffer.length);
>>
>> I've created a gist with the example modified to do one int and one
>> string, here:
>>
>> https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf
>>
>> I realized that we should include the example code in the code base and
>> created ORC-116.
>>
>> .. Owen
>>
>> On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <[email protected]> wrote:
>>
>>> I'm trying to create a little utility to convert CSV files into ORC
>>> files.  I've noticed that the resulting ORC files don't seem quite correct,
>>> though.  In an effort to create a simple reproducible test case, I just
>>> changed the "Writing/Reading ORC Files" examples here:
>>>
>>> https://orc.apache.org/docs/core-java.html
>>>
>>> to create a file based on a pair of strings instead of integers.  The
>>> first issue I hit is that TypeDescription.fromString() isn't available in
>>> 2.1.0, but instead I did the following:
>>>
>>>         TypeDescription schema = TypeDescription.createStruct()
>>>             .addField("first", TypeDescription.createString())
>>>             .addField("last", TypeDescription.createString());
>>>
>>> Then I changed the loop as follows:
>>>
>>>         BytesColumnVector first = (BytesColumnVector) writeBatch.cols[0];
>>>         BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1];
>>>         for (int r = 0; r < 10; ++r)
>>>         {
>>>             String firstName = ("First-" + r).intern();
>>>             String lastName = ("Last-" + (r * 3)).intern();
>>>             ...
>>>         }
>>>
>>> The file writes without errors, and if I write it with no compression, I
>>> can see the data using "strings my-file.orc".  However, when I then try to
>>> read the data back from the file and print out the resulting batches to the
>>> console, I get the following:
>>>
>>> ["       ", "      "]
>>> ["       ", "      "]
>>> ["       ", "      "]
>>> ["       ", "      "]
>>> ["       ", "       "]
>>> ["       ", "       "]
>>> ["       ", "       "]
>>> ["       ", "       "]
>>> ["       ", "       "]
>>> ["       ", "       "]
>>>
>>> Any insights about what I may be doing wrong here would be greatly
>>> appreciated!
>>>
>>> Regards,
>>> Scott
>>>
>>
>>
>

Re: Unable to write string data into ORC file (or at least read it back)

Reply via email to