If you are using HDFS, we could add an API to allow you to pass in a
FSDataInputStream
<https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FSDataInputStream.html>,
which is a subclass of InputStream. That class is returned from Hadoop's
fs.open(path)
<https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#open-org.apache.hadoop.fs.Path->
and it does allow the reader to do positioned reads in the stream. I have
an additional concern that there are a set of users who really would like
to have a need for an ORC reader without a dependence on Hadoop, so I'm
hesitant to add yet another Hadoop class to the API.

Let me think about this a bit and come up with a proposal for the new API.

.. Owen

On Thu, Feb 20, 2020 at 7:21 PM Matamoros, Ronald <
ronald.matamo...@accenture.com> wrote:

> Hi Owen,
>
> We have a custom connector that pulls all different sorts of files from a
> remote Hadoop/HDFS.
> One of the types we have to support is Orc, among others.
> Each record from the Orc file will be processed at a later stage
> individually. So, I am implementing the record extractor (in the middle).
> At that later stage - it could be a record from Parquet, Orc, CSV, RDBMS,
> etc.
>
> The connector already does the work of referencing the path and reading
> the file into a Java InputStream .
> By the time my record extractor gets the file it is already an InputStream
> instance.
> First thought was, since the InputStream is already available might as
> well use it.
> Of course there are performance and memory-usage considerations.
> I can always go with the option of writing the stream temporarily to local
> disk to traverse the records (especially since these files can be large).
>
> Appreciate any insights and if this approach is completely wrong please
> let me know.
>
> Regards
> Ronald Matamoros
>
> -----Original Message-----
> From: Owen O'Malley <owen.omal...@gmail.com>
> Sent: Thursday, February 20, 2020 8:12 PM
> To: Matamoros, Ronald <ronald.matamo...@accenture.com>
> Cc: user@orc.apache.org; Sobrado Barquero, H. <
> h.sobrado.barqu...@accenture.com>; Ortega Ugalde, Ronny <
> ronny.ortega.uga...@accenture.com>
> Subject: Re: [External] Re: Creating a Reader from a Java InputStream
>
> What is the use case that you are working on that only provides you with
> an InputStram?
>
> .. Owen
>
> > On Feb 20, 2020, at 13:09, Matamoros, Ronald <
> ronald.matamo...@accenture.com> wrote:
> >
> > Hi Owen, thanks for the feedback and recommendations .
> >
> > In the current requirement it is a one shot deal, capture all records in
> the ORC file to be consumed individually by another phase down the
> solution's pipeline (read once).
> > I guess the seek/position is required even if the read operation is just
> going forward over the records?
> >
> > Will try making the wrapper and watching out for your PositionedReadable
> extension.
> >
> > Regards,
> > Ronald Matamoros
> >
> > From: Owen O'Malley <owen.omal...@gmail.com>
> > Sent: Thursday, February 20, 2020 2:18 PM
> > To: Matamoros, Ronald <ronald.matamo...@accenture.com>
> > Cc: user@orc.apache.org; Sobrado Barquero, H. <
> h.sobrado.barqu...@accenture.com>; Ortega Ugalde, Ronny <
> ronny.ortega.uga...@accenture.com>
> > Subject: [External] Re: Creating a Reader from a Java InputStream
> >
> > This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with
> links and attachments.
> > ________________________________________
> >
> > Just to be a little more clear, Java’s InputStream doesn’t provide the
> primitive methods that we need.  We’d always need a sub interface That
> provides positioned reads and there hasn’t been any consensus about which
> extension to use.
> >
> > Effectively what we need is Hadoop’s PositionedReadable  with
> ByteBuffers. I’m actually currently defining the extension to
> PositionedReadable  to add an async read method with ByteBuffers.
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.apache.org_docs_current_api_org_apache_hadoop_fs_PositionedReadable.html&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=tBsXSo4NJM19Wqjkx1fQL1JPREMPERjRQCgc40wLnzw&m=cizXgFgcv3-hG0M8l-YOTBDYsgkCaVzPEJ7Fwtd0CNU&s=SLTclrj_O57UHFpsX6GGvoQepTVhIPz3Uw2SfJ71I74&e=
> >
> > .. Owen
> >
> >
> > On Feb 20, 2020, at 11:01, Matamoros, Ronald <mailto:
> ronald.matamo...@accenture.com> wrote:
> >
> > ________________________________
> >
> > This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy. Your privacy is important to us.
> Accenture uses your personal data only in compliance with data protection
> laws. For further information on how Accenture processes your personal
> data, please see our privacy statement at
> https://www.accenture.com/us-en/privacy-policy.
> >
> ______________________________________________________________________________________
> >
> > www.accenture.com
>

Reply via email to