Re: [Java] PhysicalWriter to DataOutputStream implementation?

István Thu, 21 Jan 2021 03:44:44 -0800

Hi Andrey,

This is an older code, you can adjust it to a more up-to-date version. As
far as I know, ORC does not have anything that you are looking for where
Hadoop is separated from ORC. Maybe the C++ version has that. I was trying
to get rid of Hadoop from ORC years ago but never finished the project.
Generally speaking, non-Java projects have Hadoop/HDFS free version for
these columnar formats. I use C# with Parquet that does exactly that. Not
sure about ORC.



package com.streambright.orcdemo;

import org.apache.hadoop.hive.ql.io.orc.Writer;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.CompressionKind;

import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.IOException;
import java.util.UUID;


public class App {

    private static Configuration conf = new Configuration(false);
    private static Writer writer;
    private static OrcFile.WriterOptions orc_options =
OrcFile.writerOptions(conf);
    private static OrcFile.Version vers = OrcFile.Version.V_0_12;
    private static CompressionKind compr = CompressionKind.ZLIB;

    public static class OrcRow {
        Integer col1;
        String col2;
        String col3;

        OrcRow(int a, String b, String c) {
            this.col1 = a;
            this.col2 = b;
            this.col3 = c;
        }
    }

    public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {

        String path = "/tmp/orcfile.orc";

        try {

            conf = new Configuration();
            FileSystem fs = FileSystem.getLocal(conf);

            ObjectInspector ObjInspector =
ObjectInspectorFactory.getReflectionObjectInspector(OrcRow.class,
ObjectInspectorFactory.ObjectInspectorOptions.JAVA);

            Path local_path = new Path(path);

            orc_options.inspector(ObjInspector).
                    stripeSize(8388608).
                    bufferSize(8388608).
                    blockPadding(true).
                    bloomFilterColumns("col1").
                    compress(compr).
                    version(vers);

            writer = OrcFile.createWriter(local_path, orc_options);
            //reader = OrcFile.createReader(fs, local_path);

            for (int i=1; i<1100000; i++) {
                writer.addRow(new OrcRow(i, UUID.randomUUID().toString(),
"orcFile"));
            }
            writer.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Regards,
Istvan


On Tue, Jan 19, 2021 at 8:20 PM Andrey Elenskiy <andrey.elens...@arista.com>
wrote:

> Hello, currently there's only a single implementation of PhysicalWriter
> that I were able to find -- PhysicalFSWriter, which only gives the option
> to write to HDFS.
>
> I'd like to reuse the ORC file format for my own purposes without the
> destination being HDFS, but just some byte buffer where I can decide myself
> where the bytes end up being saved.
>
> I've started implementing PhysicalWriter, but it seems like a lot of it
> just ends up being copied over from PhysicalFSWriter which seems redundant.
> So, I'm wondering if maybe something already exists to achieve my goal of
> just writing resulting columns to DataOutputStream (maybe there's some
> unofficial Java library or I'm missing some obvious official API).
>
> Thanks,
> Andrey
>


-- 
the sun shines for all

Re: [Java] PhysicalWriter to DataOutputStream implementation?

Reply via email to