RE: [External] Re: Creating a Reader from a Java InputStream

Matamoros, Ronald Wed, 26 Feb 2020 12:19:30 -0800

Hi Owen, 

Thanks a lot for having a solution so quickly, the approach would work for me.
Before commenting on the pull request, wanted to make sure I was understanding 
correctly a couple of details:

- It is crucial to know the file size beforehand, hence the fileSize parameter. 
This is due to the underlying seek/position implementation, correct?
- the  'new Path("foo")' is just a placeholder to meet the underlying method 
signatures. Looking at the code, it would not be needed for anything, correct?

Regards 
Ronald Matamoros

From: Owen O'Malley <owen.omal...@gmail.com> 
Sent: Monday, February 24, 2020 4:50 PM
To: Matamoros, Ronald <ronald.matamo...@accenture.com>
Cc: user@orc.apache.org; Sobrado Barquero, H. 
<h.sobrado.barqu...@accenture.com>; Ortega Ugalde, Ronny 
<ronny.ortega.uga...@accenture.com>
Subject: Re: [External] Re: Creating a Reader from a Java InputStream

Ok, for the short term, what I'd propose is a wrapper that creates a FileSystem 
that only knows about the stream that you want to read as an ORC file. Take a 
look at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_orc_pull_486&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=tBsXSo4NJM19Wqjkx1fQL1JPREMPERjRQCgc40wLnzw&m=QT8OReGS4Q3i9GXJmI_4dSeuzfPQHQ3EvC8XREoXmic&s=GJbXlhIMkX6TcIV-4NkYnH4v0Gz0Ov8ACDFp6ECbLIE&e=
 .

The usage looks like:

FileSystem fs = new StreamWrapperFileSystem(stream, new Path("foo"), fileSize, 
conf);
try (Reader reader = OrcFile.createReader(new Path("foo"),

OrcFile.readerOptions(conf).filesystem(fs))) {
   ...
}

Please comment on the pull request, if this matches what you need.

.. Owen

On Thu, Feb 20, 2020 at 8:16 PM Owen O'Malley <mailto:owen.omal...@gmail.com> 
wrote:
If you are using HDFS, we could add an API to allow you to pass in a 
https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.apache.org_docs_stable_api_org_apache_hadoop_fs_FSDataInputStream.html&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=tBsXSo4NJM19Wqjkx1fQL1JPREMPERjRQCgc40wLnzw&m=QT8OReGS4Q3i9GXJmI_4dSeuzfPQHQ3EvC8XREoXmic&s=1EEt2K4KgIktYQE0MHzmgBThOMcyQK7VNfweKEiZ0jc&e=,
 which is a subclass of InputStream. That class is returned from Hadoop's 
https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.apache.org_docs_stable_api_org_apache_hadoop_fs_FileSystem.html-23open-2Dorg.apache.hadoop.fs.Path-2D&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=tBsXSo4NJM19Wqjkx1fQL1JPREMPERjRQCgc40wLnzw&m=QT8OReGS4Q3i9GXJmI_4dSeuzfPQHQ3EvC8XREoXmic&s=6eljA-3ZTcxfh3zi-TZPTZRr975I2rSO5OYph-84xpc&e=
 and it does allow the reader to do positioned reads in the stream. I have an 
additional concern that there are a set of users who really would like to have 
a need for an ORC reader without a dependence on Hadoop, so I'm hesitant to add 
yet another Hadoop class to the API.

Let me think about this a bit and come up with a proposal for the new API.

.. Owen

On Thu, Feb 20, 2020 at 7:21 PM Matamoros, Ronald 
<mailto:ronald.matamo...@accenture.com> wrote:
Hi Owen,

We have a custom connector that pulls all different sorts of files from a 
remote Hadoop/HDFS.
One of the types we have to support is Orc, among others. 
Each record from the Orc file will be processed at a later stage individually. 
So, I am implementing the record extractor (in the middle).
At that later stage - it could be a record from Parquet, Orc, CSV, RDBMS, etc.

The connector already does the work of referencing the path and reading the 
file into a Java InputStream .
By the time my record extractor gets the file it is already an InputStream 
instance. 
First thought was, since the InputStream is already available might as well use 
it.
Of course there are performance and memory-usage considerations. 
I can always go with the option of writing the stream temporarily to local disk 
to traverse the records (especially since these files can be large).

Appreciate any insights and if this approach is completely wrong please let me 
know. 

Regards
Ronald Matamoros

-----Original Message-----
From: Owen O'Malley <mailto:owen.omal...@gmail.com> 
Sent: Thursday, February 20, 2020 8:12 PM
To: Matamoros, Ronald <mailto:ronald.matamo...@accenture.com>
Cc: mailto:user@orc.apache.org; Sobrado Barquero, H. 
<mailto:h.sobrado.barqu...@accenture.com>; Ortega Ugalde, Ronny 
<mailto:ronny.ortega.uga...@accenture.com>
Subject: Re: [External] Re: Creating a Reader from a Java InputStream

What is the use case that you are working on that only provides you with an 
InputStram?

.. Owen

> On Feb 20, 2020, at 13:09, Matamoros, Ronald 
> <mailto:ronald.matamo...@accenture.com> wrote:
> 
> Hi Owen, thanks for the feedback and recommendations .
> 
> In the current requirement it is a one shot deal, capture all records in the 
> ORC file to be consumed individually by another phase down the solution's 
> pipeline (read once).
> I guess the seek/position is required even if the read operation is just 
> going forward over the records?
> 
> Will try making the wrapper and watching out for your PositionedReadable 
> extension.
> 
> Regards,
> Ronald Matamoros
> 
> From: Owen O'Malley <mailto:owen.omal...@gmail.com>
> Sent: Thursday, February 20, 2020 2:18 PM
> To: Matamoros, Ronald <mailto:ronald.matamo...@accenture.com>
> Cc: mailto:user@orc.apache.org; Sobrado Barquero, H. 
> <mailto:h.sobrado.barqu...@accenture.com>; Ortega Ugalde, Ronny 
> <mailto:ronny.ortega.uga...@accenture.com>
> Subject: [External] Re: Creating a Reader from a Java InputStream
> 
> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with 
> links and attachments.
> ________________________________________
> 
> Just to be a little more clear, Java’s InputStream doesn’t provide the 
> primitive methods that we need.  We’d always need a sub interface That 
> provides positioned reads and there hasn’t been any consensus about which 
> extension to use.
> 
> Effectively what we need is Hadoop’s PositionedReadable  with ByteBuffers. 
> I’m actually currently defining the extension to PositionedReadable  to add 
> an async read method with ByteBuffers.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.apache.org_docs_current_api_org_apache_hadoop_fs_PositionedReadable.html&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=tBsXSo4NJM19Wqjkx1fQL1JPREMPERjRQCgc40wLnzw&m=cizXgFgcv3-hG0M8l-YOTBDYsgkCaVzPEJ7Fwtd0CNU&s=SLTclrj_O57UHFpsX6GGvoQepTVhIPz3Uw2SfJ71I74&e=
> 
> .. Owen
> 
> 
> On Feb 20, 2020, at 11:01, Matamoros, Ronald 
> <mailto:mailto:ronald.matamo...@accenture.com> wrote:
> 
> ________________________________
> 
> This message is for the designated recipient only and may contain privileged, 
> proprietary, or otherwise confidential information. If you have received it 
> in error, please notify the sender immediately and delete the original. Any 
> other use of the e-mail by you is prohibited. Where allowed by local law, 
> electronic communications with Accenture and its affiliates, including e-mail 
> and instant messaging (including content), may be scanned by our systems for 
> the purposes of information security and assessment of internal compliance 
> with Accenture policy. Your privacy is important to us. Accenture uses your 
> personal data only in compliance with data protection laws. For further 
> information on how Accenture processes your personal data, please see our 
> privacy statement at https://www.accenture.com/us-en/privacy-policy.
> ______________________________________________________________________________________
> 
> http://www.accenture.com

RE: [External] Re: Creating a Reader from a Java InputStream

Reply via email to