RE: Kafka to parquet to s3

Williams, Jim Mon, 15 Jul 2019 05:18:17 -0700

Dweep,

The data I am moving into S3 is already some fairly large sets of files, since 
they are a bulk export from a SaaS application.  Thus, the number of files 
which were being PUT to S3 was not a huge consideration.  However, since the 
Parquet files are to be consumed by Redshift Spectrum I had an interest in 
consolidating flow files containing like objects into a single flow file prior 
to Parquet conversion.  I used the MergeRecord processor [1] to do this.

So, to amplify on the flow, it really looks more like this:

(Get stuff in JSON format) --> ConvertRecord --> MergeRecord --> 
ConvertAvroToParquet --> PutS3

This is not really a “real-time streaming flow” it’s more batch-oriented.  
There is a delay in the flow (which is acceptable to us) for the MergeRecord 
processor to collect and merge possibly several flow files into a bigger flow 
file.

[1] - 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.MergeRecord/index.html

Warm regards,

 <https://www.alertlogic.com/> 

Jim Williams | Principal Database Developer

O: +1 713.341.7812 | C: +1 919.523.8767 | jwilli...@alertlogic.com |  
<http://www.alertlogic.com/> alertlogic.com  <https://twitter.com/alertlogic>  
<https://www.linkedin.com/company/alert-logic> 

From: Dweep Sharma <dweep.sha...@redbus.com> 
Sent: Sunday, July 14, 2019 1:07 AM
To: users@nifi.apache.org
Subject: Re: Kafka to parquet to s3

Thanks Jim for the insights on advantages, this worked for me as well. 

Any thoughts on partitioning and filesize so the S3 PUT costs are not too high?

I do not see options on the convertavrotoparquet for this

-Dweep

On Mon, Jul 8, 2019 at 6:09 PM Williams, Jim <jwilli...@alertlogic.com 
<mailto:jwilli...@alertlogic.com> > wrote:

Dweep,

I have been working on a project where Parquet files are being written to S3.  
I’ve had the liberty to use the most up-to-date version of Nifi, so I have 
implemented this on 1.9.2.

The approach I have taken is something like this:

(Get stuff in JSON format) --> ConvertRecord --> ConvertAvroToParquet --> PutS3

The ConvertRecord [1] processor changes the flow files from JSON to Avro.  
Although it is possible to use schema inference with this processor, it is 
something we have not leveraged yet.  The ConvertAvroToParquet [2] converts the 
flow file, but does not write it out to a local or HDFS file system like the 
PutParquet [3] processor would.

Implementing the flow in this way gives a couple advantages:

1.      We do not need to use the PutParquet processor

a.      Extra configuration on cluster nodes is avoided for writing directly to 
S3 with this processor
b.      Writing to a local or HDFS filesystem and then copying to S3 is avoided

2.      We can use the native authentication methods which come with the S3 
processor

a.      Roles associated with EC2 instances are leveraged, which makes cluster 
deployment much simpler

We have been happy using this pattern for the past couple months.  I am 
watching for progress on Nifi-6089 [4] for a Parquet Record Reader/Writer with 
interest.

[1] - 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.ConvertRecord/index.html

<https://urldefense.proofpoint.com/v2/url?u=https-3A__nifi.apache.org_docs_nifi-2Ddocs_components_org.apache.nifi_nifi-2Dstandard-2Dnar_1.9.2_org.apache.nifi.processors.standard.ConvertRecord_index.html&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=-xRMuIesOHDc-Qh9SKoVvE7x9EZVnHvR5oTmpj_9ccM&e=>

[2] - 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-parquet-nar/1.9.2/org.apache.nifi.processors.parquet.ConvertAvroToParquet/index.html

<https://urldefense.proofpoint.com/v2/url?u=https-3A__nifi.apache.org_docs_nifi-2Ddocs_components_org.apache.nifi_nifi-2Dparquet-2Dnar_1.9.2_org.apache.nifi.processors.parquet.ConvertAvroToParquet_index.html&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=H7S26aP4FKFn_xmTAuMgfgn9Wqf0haoIWTLS1T9b-qE&e=>

[3] - 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-parquet-nar/1.9.2/org.apache.nifi.processors.parquet.PutParquet/index.html

<https://urldefense.proofpoint.com/v2/url?u=https-3A__nifi.apache.org_docs_nifi-2Ddocs_components_org.apache.nifi_nifi-2Dparquet-2Dnar_1.9.2_org.apache.nifi.processors.parquet.PutParquet_index.html&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=umNhX0Fi2wpJhFqSFsVNDml65cyD02cLMxVK4k6mneg&e=>

[4] - https://issues.apache.org/jira/browse/NIFI-6089 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_NIFI-2D6089&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=lIH0BBMqActG0dOTWDA_M54FV3vlKDhUj2nPI0EPnoU&e=>

Warm regards,

 <https://www.alertlogic.com/> 

Jim Williams | Principal Database Developer

O: +1 713.341.7812 | C: +1 919.523.8767 | jwilli...@alertlogic.com 
<mailto:jwilli...@alertlogic.com>  |  <http://www.alertlogic.com/> 
alertlogic.com  
<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_alertlogic&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=57YoBOOxUJfq3IEijhUXJ2nAaN8e-0m5S13SMMIJvU8&e=>

<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_alert-2Dlogic&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=CzrU0ZuogIyzWa-t3_5ZRY2Q31_a391RD-ODMOxcNRc&e=>

From: Bryan Bende <bbe...@gmail.com <mailto:bbe...@gmail.com> > 
Sent: Saturday, July 6, 2019 11:49 AM
To: users@nifi.apache.org <mailto:users@nifi.apache.org> 
Subject: Re: Kafka to parquet to s3

Currently put and fetch parquet are tied to the Hadoop API so they need the 
config files. As mentioned you can create a core-site with a local file system, 
and then you could use another part of the flow to pick up the file using 
ListFile -> FetchFile -> PutS3Object.

There is a way to write to S3 directly from PutParquet and PutHdfs, but it 
requires additional jars and config, and is honestly harder to setup then just 
using the above approach.

There is also a JIRA to implement a parquet record reader and writer which 
would then let you use ConvertRecord to go from JSON to parquet, and then to 
PutS3Object.

I think the error mentioned means you have the same field name at different 
levels in your JSON and that is not allowed in an Avro schema.

On Sat, Jul 6, 2019 at 9:21 AM Dweep Sharma <dweep.sha...@redbus.com 
<mailto:dweep.sha...@redbus.com> > wrote:

Thanks Shanker, 

But I do not see options in putparquet for s3 bucket/credential, I am assuming 
I need to push this to a local store and then add the puts3object processor on 
top of that ?

Also, as record reader in putparquet I am using the JsonTreeReader with 
defaults (InferSchema)  and I get the error "Failed to write due to can't 
redefine: org.apache.nifi.addresstype: 

Some files however do get written. Are the default settings good or am I 
missing something ?

-Dweep

On Fri, Jul 5, 2019 at 11:43 PM Andrew Grande <apere...@gmail.com 
<mailto:apere...@gmail.com> > wrote:

Interestingly enough, the ORC processor in NiFi can just use defaults if hadoop 
configs aren't provided, no additional config steps required. Is it something 
which can be improved for PutParquet maybe?

Andrew

On Fri, Jul 5, 2019, 4:18 AM Shanker Sneh <shanker.s...@zoomcar.com 
<mailto:shanker.s...@zoomcar.com> > wrote:

Hello Dweep,

In putparquet processor you can set the attribute 'Hadoop Configuration 
Resources' to a core-site.xml file whose content can be somewhat like below:

<configuration>

    <property>

        <name>fs.defaultFS</name>

        <value>file:///reservoir-dl <file:///reservoir-dl%3c/value> </value>

    </property>

</configuration>

Here the file:///reservoir-dl could be your path where in-transit parquet files 
have be written to -- before being pushed to S3.

More importantly, you do not need Hadoop to be installed. You can just place 
the core-site.xml file on your NiFi nodes and get started.

On Fri, Jul 5, 2019 at 1:54 PM Dweep Sharma <dweep.sha...@redbus.com 
<mailto:dweep.sha...@redbus.com> > wrote:

Hi, 

I have been trying to move some JSON data to S3 in Parquet format. 

>From Kafka to s3 is straight forward but I cannot seem to find the right 
>processor to convert JSON to parquet and move it to s3.

putparquet does not take s3 bucket or credentials and requires hadoop to be 
installed. 

Can someone please share a blog or steps to achieve this. Thanks in advance

-Dweep

::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachments are confidential and intended 
for the named recipient(s) only.E-mail transmission is not guaranteed to be 
secure or error-free as information could be intercepted, corrupted,lost, 
destroyed, arrive late or incomplete, or may contain viruses in transmission. 
The e mail and its contents(with or without referred errors) shall therefore 
not attach any liability on the originator or redBus.com. Views or opinions, if 
any, presented in this email are solely those of the author and may not 
necessarily reflect the views or opinions of redBus.com. Any form of 
reproduction, dissemination, copying, disclosure, modification,distribution and 
/ or publication of this message without the prior written consent of 
authorized representative of redbus. 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=0TYoCDJ4Lt2B5pKdtP7nUjspleKrASukFc4wPpwn7nE&s=g0beRWiOmVpZgFikYTh2BVHRMTiHH-iqlFPmu75uefs&e=>
 com is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.Before opening any email and/or 
attachments, please check them for viruses and other defects.

-- 

Best,

Sneh

::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachments are confidential and intended 
for the named recipient(s) only.E-mail transmission is not guaranteed to be 
secure or error-free as information could be intercepted, corrupted,lost, 
destroyed, arrive late or incomplete, or may contain viruses in transmission. 
The e mail and its contents(with or without referred errors) shall therefore 
not attach any liability on the originator or redBus.com. Views or opinions, if 
any, presented in this email are solely those of the author and may not 
necessarily reflect the views or opinions of redBus.com. Any form of 
reproduction, dissemination, copying, disclosure, modification,distribution and 
/ or publication of this message without the prior written consent of 
authorized representative of redbus. 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=0TYoCDJ4Lt2B5pKdtP7nUjspleKrASukFc4wPpwn7nE&s=g0beRWiOmVpZgFikYTh2BVHRMTiHH-iqlFPmu75uefs&e=>
 com is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.Before opening any email and/or 
attachments, please check them for viruses and other defects.

-- 

Sent from Gmail Mobile

::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachments are confidential and intended 
for the named recipient(s) only.E-mail transmission is not guaranteed to be 
secure or error-free as information could be intercepted, corrupted,lost, 
destroyed, arrive late or incomplete, or may contain viruses in transmission. 
The e mail and its contents(with or without referred errors) shall therefore 
not attach any liability on the originator or redBus.com. Views or opinions, if 
any, presented in this email are solely those of the author and may not 
necessarily reflect the views or opinions of redBus.com. Any form of 
reproduction, dissemination, copying, disclosure, modification,distribution and 
/ or publication of this message without the prior written consent of 
authorized representative of redbus. 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=xRnGFwEijCDoiz5rOh9c_74-BTK-vsyD06-_YcmZURs&e=>
 com is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.Before opening any email and/or 
attachments, please check them for viruses and other defects.

smime.p7s
Description: S/MIME cryptographic signature

RE: Kafka to parquet to s3

Reply via email to