Hello, There is not support yet to read ORC files directly on Beam, You can track the progress of this issue here. https://issues.apache.org/jira/browse/BEAM-1861
You better use HCatalogIO than JdbcIO (the split should be better). On Mon, Dec 18, 2017 at 4:17 AM, Allan Wilson <[email protected]> wrote: > Hi, > > Is there anyway to read ORC files from HDFS directly using Apache Beam? > > I’m looking at loading up Kafka with data stored in ORC files backing Hive > tables. > > After doing some research it doesn’t look possible, but I thought I ask to > make sure. > > It may be possible to use jdbc or hcatalog to query the data out, but I’d > rather scale out by pulling the data straight from the datanodes. > > The runner I’m using is Spark 1.6.3 on the HDP 2.6.2 distro. > > > >
