Ok, for the short term, what I'd propose is a wrapper that creates a FileSystem that only knows about the stream that you want to read as an ORC file. Take a look at https://github.com/apache/orc/pull/486 .
The usage looks like: FileSystem fs = new StreamWrapperFileSystem(stream, new Path("foo"), fileSize, conf); try (Reader reader = OrcFile.createReader(new Path("foo"), OrcFile.readerOptions(conf).filesystem(fs))) { ... } Please comment on the pull request, if this matches what you need. .. Owen On Thu, Feb 20, 2020 at 8:16 PM Owen O'Malley <owen.omal...@gmail.com> wrote: > If you are using HDFS, we could add an API to allow you to pass in a > FSDataInputStream > <https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FSDataInputStream.html>, > which is a subclass of InputStream. That class is returned from Hadoop's > fs.open(path) > <https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#open-org.apache.hadoop.fs.Path-> > and it does allow the reader to do positioned reads in the stream. I have > an additional concern that there are a set of users who really would like > to have a need for an ORC reader without a dependence on Hadoop, so I'm > hesitant to add yet another Hadoop class to the API. > > Let me think about this a bit and come up with a proposal for the new API. > > .. Owen > > On Thu, Feb 20, 2020 at 7:21 PM Matamoros, Ronald < > ronald.matamo...@accenture.com> wrote: > >> Hi Owen, >> >> We have a custom connector that pulls all different sorts of files from a >> remote Hadoop/HDFS. >> One of the types we have to support is Orc, among others. >> Each record from the Orc file will be processed at a later stage >> individually. So, I am implementing the record extractor (in the middle). >> At that later stage - it could be a record from Parquet, Orc, CSV, RDBMS, >> etc. >> >> The connector already does the work of referencing the path and reading >> the file into a Java InputStream . >> By the time my record extractor gets the file it is already an >> InputStream instance. >> First thought was, since the InputStream is already available might as >> well use it. >> Of course there are performance and memory-usage considerations. >> I can always go with the option of writing the stream temporarily to >> local disk to traverse the records (especially since these files can be >> large). >> >> Appreciate any insights and if this approach is completely wrong please >> let me know. >> >> Regards >> Ronald Matamoros >> >> -----Original Message----- >> From: Owen O'Malley <owen.omal...@gmail.com> >> Sent: Thursday, February 20, 2020 8:12 PM >> To: Matamoros, Ronald <ronald.matamo...@accenture.com> >> Cc: user@orc.apache.org; Sobrado Barquero, H. < >> h.sobrado.barqu...@accenture.com>; Ortega Ugalde, Ronny < >> ronny.ortega.uga...@accenture.com> >> Subject: Re: [External] Re: Creating a Reader from a Java InputStream >> >> What is the use case that you are working on that only provides you with >> an InputStram? >> >> .. Owen >> >> > On Feb 20, 2020, at 13:09, Matamoros, Ronald < >> ronald.matamo...@accenture.com> wrote: >> > >> > Hi Owen, thanks for the feedback and recommendations . >> > >> > In the current requirement it is a one shot deal, capture all records >> in the ORC file to be consumed individually by another phase down the >> solution's pipeline (read once). >> > I guess the seek/position is required even if the read operation is >> just going forward over the records? >> > >> > Will try making the wrapper and watching out for your >> PositionedReadable extension. >> > >> > Regards, >> > Ronald Matamoros >> > >> > From: Owen O'Malley <owen.omal...@gmail.com> >> > Sent: Thursday, February 20, 2020 2:18 PM >> > To: Matamoros, Ronald <ronald.matamo...@accenture.com> >> > Cc: user@orc.apache.org; Sobrado Barquero, H. < >> h.sobrado.barqu...@accenture.com>; Ortega Ugalde, Ronny < >> ronny.ortega.uga...@accenture.com> >> > Subject: [External] Re: Creating a Reader from a Java InputStream >> > >> > This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly >> with links and attachments. >> > ________________________________________ >> > >> > Just to be a little more clear, Java’s InputStream doesn’t provide the >> primitive methods that we need. We’d always need a sub interface That >> provides positioned reads and there hasn’t been any consensus about which >> extension to use. >> > >> > Effectively what we need is Hadoop’s PositionedReadable with >> ByteBuffers. I’m actually currently defining the extension to >> PositionedReadable to add an async read method with ByteBuffers. >> > >> https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.apache.org_docs_current_api_org_apache_hadoop_fs_PositionedReadable.html&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=tBsXSo4NJM19Wqjkx1fQL1JPREMPERjRQCgc40wLnzw&m=cizXgFgcv3-hG0M8l-YOTBDYsgkCaVzPEJ7Fwtd0CNU&s=SLTclrj_O57UHFpsX6GGvoQepTVhIPz3Uw2SfJ71I74&e= >> > >> > .. Owen >> > >> > >> > On Feb 20, 2020, at 11:01, Matamoros, Ronald <mailto: >> ronald.matamo...@accenture.com> wrote: >> > >> > ________________________________ >> > >> > This message is for the designated recipient only and may contain >> privileged, proprietary, or otherwise confidential information. If you have >> received it in error, please notify the sender immediately and delete the >> original. Any other use of the e-mail by you is prohibited. Where allowed >> by local law, electronic communications with Accenture and its affiliates, >> including e-mail and instant messaging (including content), may be scanned >> by our systems for the purposes of information security and assessment of >> internal compliance with Accenture policy. Your privacy is important to us. >> Accenture uses your personal data only in compliance with data protection >> laws. For further information on how Accenture processes your personal >> data, please see our privacy statement at >> https://www.accenture.com/us-en/privacy-policy. >> > >> ______________________________________________________________________________________ >> > >> > www.accenture.com >> >