If you are using HDFS, we could add an API to allow you to pass in a FSDataInputStream <https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FSDataInputStream.html>, which is a subclass of InputStream. That class is returned from Hadoop's fs.open(path) <https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#open-org.apache.hadoop.fs.Path-> and it does allow the reader to do positioned reads in the stream. I have an additional concern that there are a set of users who really would like to have a need for an ORC reader without a dependence on Hadoop, so I'm hesitant to add yet another Hadoop class to the API.
Let me think about this a bit and come up with a proposal for the new API. .. Owen On Thu, Feb 20, 2020 at 7:21 PM Matamoros, Ronald < ronald.matamo...@accenture.com> wrote: > Hi Owen, > > We have a custom connector that pulls all different sorts of files from a > remote Hadoop/HDFS. > One of the types we have to support is Orc, among others. > Each record from the Orc file will be processed at a later stage > individually. So, I am implementing the record extractor (in the middle). > At that later stage - it could be a record from Parquet, Orc, CSV, RDBMS, > etc. > > The connector already does the work of referencing the path and reading > the file into a Java InputStream . > By the time my record extractor gets the file it is already an InputStream > instance. > First thought was, since the InputStream is already available might as > well use it. > Of course there are performance and memory-usage considerations. > I can always go with the option of writing the stream temporarily to local > disk to traverse the records (especially since these files can be large). > > Appreciate any insights and if this approach is completely wrong please > let me know. > > Regards > Ronald Matamoros > > -----Original Message----- > From: Owen O'Malley <owen.omal...@gmail.com> > Sent: Thursday, February 20, 2020 8:12 PM > To: Matamoros, Ronald <ronald.matamo...@accenture.com> > Cc: user@orc.apache.org; Sobrado Barquero, H. < > h.sobrado.barqu...@accenture.com>; Ortega Ugalde, Ronny < > ronny.ortega.uga...@accenture.com> > Subject: Re: [External] Re: Creating a Reader from a Java InputStream > > What is the use case that you are working on that only provides you with > an InputStram? > > .. Owen > > > On Feb 20, 2020, at 13:09, Matamoros, Ronald < > ronald.matamo...@accenture.com> wrote: > > > > Hi Owen, thanks for the feedback and recommendations . > > > > In the current requirement it is a one shot deal, capture all records in > the ORC file to be consumed individually by another phase down the > solution's pipeline (read once). > > I guess the seek/position is required even if the read operation is just > going forward over the records? > > > > Will try making the wrapper and watching out for your PositionedReadable > extension. > > > > Regards, > > Ronald Matamoros > > > > From: Owen O'Malley <owen.omal...@gmail.com> > > Sent: Thursday, February 20, 2020 2:18 PM > > To: Matamoros, Ronald <ronald.matamo...@accenture.com> > > Cc: user@orc.apache.org; Sobrado Barquero, H. < > h.sobrado.barqu...@accenture.com>; Ortega Ugalde, Ronny < > ronny.ortega.uga...@accenture.com> > > Subject: [External] Re: Creating a Reader from a Java InputStream > > > > This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with > links and attachments. > > ________________________________________ > > > > Just to be a little more clear, Java’s InputStream doesn’t provide the > primitive methods that we need. We’d always need a sub interface That > provides positioned reads and there hasn’t been any consensus about which > extension to use. > > > > Effectively what we need is Hadoop’s PositionedReadable with > ByteBuffers. I’m actually currently defining the extension to > PositionedReadable to add an async read method with ByteBuffers. > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.apache.org_docs_current_api_org_apache_hadoop_fs_PositionedReadable.html&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=tBsXSo4NJM19Wqjkx1fQL1JPREMPERjRQCgc40wLnzw&m=cizXgFgcv3-hG0M8l-YOTBDYsgkCaVzPEJ7Fwtd0CNU&s=SLTclrj_O57UHFpsX6GGvoQepTVhIPz3Uw2SfJ71I74&e= > > > > .. Owen > > > > > > On Feb 20, 2020, at 11:01, Matamoros, Ronald <mailto: > ronald.matamo...@accenture.com> wrote: > > > > ________________________________ > > > > This message is for the designated recipient only and may contain > privileged, proprietary, or otherwise confidential information. If you have > received it in error, please notify the sender immediately and delete the > original. Any other use of the e-mail by you is prohibited. Where allowed > by local law, electronic communications with Accenture and its affiliates, > including e-mail and instant messaging (including content), may be scanned > by our systems for the purposes of information security and assessment of > internal compliance with Accenture policy. Your privacy is important to us. > Accenture uses your personal data only in compliance with data protection > laws. For further information on how Accenture processes your personal > data, please see our privacy statement at > https://www.accenture.com/us-en/privacy-policy. > > > ______________________________________________________________________________________ > > > > www.accenture.com >