Re: Reading from ORC Files in HDFS

2017-12-19 Thread Allan Wilson
 Had a feeling that would be the answer, but being new to Beam I wanted to make 
sure I wasn’t missing something. :)


Thanks Ismael



On 12/18/17, 3:07 AM, "Ismaël Mejía"  wrote:

>Hello,
>
>There is not support yet to read ORC files directly on Beam, You can
>track the progress of this issue here.
>https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_BEAM-2D1861=DwIFaQ=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw=ZpzaEtcaU94NK3jHb3YffLFtq_DRaHEGobEO2J_3zIw=M0Hv4VMVlhVQOflTfehE_mOiOJXTz5Y-Mc7Hk-ybtF8=BVnOfRDnazZ6nFSJN0tyuBb-qNOUTvab47qT5Nykuws=
> 
>
>You better use HCatalogIO than JdbcIO (the split should be better).
>
>
>
>
>On Mon, Dec 18, 2017 at 4:17 AM, Allan Wilson  wrote:
>> Hi,
>>
>> Is there anyway to read ORC files from HDFS directly using Apache Beam?
>>
>> I’m looking at loading up Kafka with data stored in ORC files backing Hive
>> tables.
>>
>> After doing some research it doesn’t look possible, but I thought I ask to
>> make sure.
>>
>> It may be possible to use jdbc or hcatalog to query the data out, but I’d
>> rather scale out by pulling the data straight from the datanodes.
>>
>> The runner I’m using is Spark 1.6.3 on the HDP 2.6.2 distro.
>>
>>
>>
>>


Reading from ORC Files in HDFS

2017-12-17 Thread Allan Wilson
Hi,

Is there anyway to read ORC files from HDFS directly using Apache Beam?

I’m looking at loading up Kafka with data stored in ORC files backing Hive 
tables.

After doing some research it doesn’t look possible, but I thought I ask to make 
sure.

It may be possible to use jdbc or hcatalog to query the data out, but I’d 
rather scale out by pulling the data straight from the datanodes.

The runner I’m using is Spark 1.6.3 on the HDP 2.6.2 distro.