Dweep,
The data I am moving into S3 is already some fairly large sets of files, since they are a bulk export from a SaaS application. Thus, the number of files which were being PUT to S3 was not a huge consideration. However, since the Parquet files are to be consumed by Redshift Spectrum I had an interest in consolidating flow files containing like objects into a single flow file prior to Parquet conversion. I used the MergeRecord processor [1] to do this. So, to amplify on the flow, it really looks more like this: (Get stuff in JSON format) --> ConvertRecord --> MergeRecord --> ConvertAvroToParquet --> PutS3 This is not really a “real-time streaming flow” it’s more batch-oriented. There is a delay in the flow (which is acceptable to us) for the MergeRecord processor to collect and merge possibly several flow files into a bigger flow file. [1] - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.MergeRecord/index.html Warm regards, <https://www.alertlogic.com/> Jim Williams | Principal Database Developer O: +1 713.341.7812 | C: +1 919.523.8767 | jwilli...@alertlogic.com | <http://www.alertlogic.com/> alertlogic.com <https://twitter.com/alertlogic> <https://www.linkedin.com/company/alert-logic> From: Dweep Sharma <dweep.sha...@redbus.com> Sent: Sunday, July 14, 2019 1:07 AM To: users@nifi.apache.org Subject: Re: Kafka to parquet to s3 Thanks Jim for the insights on advantages, this worked for me as well. Any thoughts on partitioning and filesize so the S3 PUT costs are not too high? I do not see options on the convertavrotoparquet for this -Dweep On Mon, Jul 8, 2019 at 6:09 PM Williams, Jim <jwilli...@alertlogic.com <mailto:jwilli...@alertlogic.com> > wrote: Dweep, I have been working on a project where Parquet files are being written to S3. I’ve had the liberty to use the most up-to-date version of Nifi, so I have implemented this on 1.9.2. The approach I have taken is something like this: (Get stuff in JSON format) --> ConvertRecord --> ConvertAvroToParquet --> PutS3 The ConvertRecord [1] processor changes the flow files from JSON to Avro. Although it is possible to use schema inference with this processor, it is something we have not leveraged yet. The ConvertAvroToParquet [2] converts the flow file, but does not write it out to a local or HDFS file system like the PutParquet [3] processor would. Implementing the flow in this way gives a couple advantages: 1. We do not need to use the PutParquet processor a. Extra configuration on cluster nodes is avoided for writing directly to S3 with this processor b. Writing to a local or HDFS filesystem and then copying to S3 is avoided 2. We can use the native authentication methods which come with the S3 processor a. Roles associated with EC2 instances are leveraged, which makes cluster deployment much simpler We have been happy using this pattern for the past couple months. I am watching for progress on Nifi-6089 [4] for a Parquet Record Reader/Writer with interest. [1] - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.ConvertRecord/index.html <https://urldefense.proofpoint.com/v2/url?u=https-3A__nifi.apache.org_docs_nifi-2Ddocs_components_org.apache.nifi_nifi-2Dstandard-2Dnar_1.9.2_org.apache.nifi.processors.standard.ConvertRecord_index.html&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=-xRMuIesOHDc-Qh9SKoVvE7x9EZVnHvR5oTmpj_9ccM&e=> [2] - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-parquet-nar/1.9.2/org.apache.nifi.processors.parquet.ConvertAvroToParquet/index.html <https://urldefense.proofpoint.com/v2/url?u=https-3A__nifi.apache.org_docs_nifi-2Ddocs_components_org.apache.nifi_nifi-2Dparquet-2Dnar_1.9.2_org.apache.nifi.processors.parquet.ConvertAvroToParquet_index.html&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=H7S26aP4FKFn_xmTAuMgfgn9Wqf0haoIWTLS1T9b-qE&e=> [3] - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-parquet-nar/1.9.2/org.apache.nifi.processors.parquet.PutParquet/index.html <https://urldefense.proofpoint.com/v2/url?u=https-3A__nifi.apache.org_docs_nifi-2Ddocs_components_org.apache.nifi_nifi-2Dparquet-2Dnar_1.9.2_org.apache.nifi.processors.parquet.PutParquet_index.html&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=umNhX0Fi2wpJhFqSFsVNDml65cyD02cLMxVK4k6mneg&e=> [4] - https://issues.apache.org/jira/browse/NIFI-6089 <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_NIFI-2D6089&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=lIH0BBMqActG0dOTWDA_M54FV3vlKDhUj2nPI0EPnoU&e=> Warm regards, <https://www.alertlogic.com/> Jim Williams | Principal Database Developer O: +1 713.341.7812 | C: +1 919.523.8767 | jwilli...@alertlogic.com <mailto:jwilli...@alertlogic.com> | <http://www.alertlogic.com/> alertlogic.com <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_alertlogic&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=57YoBOOxUJfq3IEijhUXJ2nAaN8e-0m5S13SMMIJvU8&e=> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_alert-2Dlogic&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=CzrU0ZuogIyzWa-t3_5ZRY2Q31_a391RD-ODMOxcNRc&e=> From: Bryan Bende <bbe...@gmail.com <mailto:bbe...@gmail.com> > Sent: Saturday, July 6, 2019 11:49 AM To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: Re: Kafka to parquet to s3 Currently put and fetch parquet are tied to the Hadoop API so they need the config files. As mentioned you can create a core-site with a local file system, and then you could use another part of the flow to pick up the file using ListFile -> FetchFile -> PutS3Object. There is a way to write to S3 directly from PutParquet and PutHdfs, but it requires additional jars and config, and is honestly harder to setup then just using the above approach. There is also a JIRA to implement a parquet record reader and writer which would then let you use ConvertRecord to go from JSON to parquet, and then to PutS3Object. I think the error mentioned means you have the same field name at different levels in your JSON and that is not allowed in an Avro schema. On Sat, Jul 6, 2019 at 9:21 AM Dweep Sharma <dweep.sha...@redbus.com <mailto:dweep.sha...@redbus.com> > wrote: Thanks Shanker, But I do not see options in putparquet for s3 bucket/credential, I am assuming I need to push this to a local store and then add the puts3object processor on top of that ? Also, as record reader in putparquet I am using the JsonTreeReader with defaults (InferSchema) and I get the error "Failed to write due to can't redefine: org.apache.nifi.addresstype: Some files however do get written. Are the default settings good or am I missing something ? -Dweep On Fri, Jul 5, 2019 at 11:43 PM Andrew Grande <apere...@gmail.com <mailto:apere...@gmail.com> > wrote: Interestingly enough, the ORC processor in NiFi can just use defaults if hadoop configs aren't provided, no additional config steps required. Is it something which can be improved for PutParquet maybe? Andrew On Fri, Jul 5, 2019, 4:18 AM Shanker Sneh <shanker.s...@zoomcar.com <mailto:shanker.s...@zoomcar.com> > wrote: Hello Dweep, In putparquet processor you can set the attribute 'Hadoop Configuration Resources' to a core-site.xml file whose content can be somewhat like below: <configuration> <property> <name>fs.defaultFS</name> <value>file:///reservoir-dl <file:///reservoir-dl%3c/value> </value> </property> </configuration> Here the file:///reservoir-dl could be your path where in-transit parquet files have be written to -- before being pushed to S3. More importantly, you do not need Hadoop to be installed. You can just place the core-site.xml file on your NiFi nodes and get started. On Fri, Jul 5, 2019 at 1:54 PM Dweep Sharma <dweep.sha...@redbus.com <mailto:dweep.sha...@redbus.com> > wrote: Hi, I have been trying to move some JSON data to S3 in Parquet format. >From Kafka to s3 is straight forward but I cannot seem to find the right >processor to convert JSON to parquet and move it to s3. putparquet does not take s3 bucket or credentials and requires hadoop to be installed. Can someone please share a blog or steps to achieve this. Thanks in advance -Dweep ::DISCLAIMER:: ---------------------------------------------------------------------------------------------------------------------------------------------------- The contents of this e-mail and any attachments are confidential and intended for the named recipient(s) only.E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents(with or without referred errors) shall therefore not attach any liability on the originator or redBus.com. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of redBus.com. Any form of reproduction, dissemination, copying, disclosure, modification,distribution and / or publication of this message without the prior written consent of authorized representative of redbus. <https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=0TYoCDJ4Lt2B5pKdtP7nUjspleKrASukFc4wPpwn7nE&s=g0beRWiOmVpZgFikYTh2BVHRMTiHH-iqlFPmu75uefs&e=> com is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.Before opening any email and/or attachments, please check them for viruses and other defects. -- Best, Sneh ::DISCLAIMER:: ---------------------------------------------------------------------------------------------------------------------------------------------------- The contents of this e-mail and any attachments are confidential and intended for the named recipient(s) only.E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents(with or without referred errors) shall therefore not attach any liability on the originator or redBus.com. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of redBus.com. Any form of reproduction, dissemination, copying, disclosure, modification,distribution and / or publication of this message without the prior written consent of authorized representative of redbus. <https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=0TYoCDJ4Lt2B5pKdtP7nUjspleKrASukFc4wPpwn7nE&s=g0beRWiOmVpZgFikYTh2BVHRMTiHH-iqlFPmu75uefs&e=> com is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.Before opening any email and/or attachments, please check them for viruses and other defects. -- Sent from Gmail Mobile ::DISCLAIMER:: ---------------------------------------------------------------------------------------------------------------------------------------------------- The contents of this e-mail and any attachments are confidential and intended for the named recipient(s) only.E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents(with or without referred errors) shall therefore not attach any liability on the originator or redBus.com. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of redBus.com. Any form of reproduction, dissemination, copying, disclosure, modification,distribution and / or publication of this message without the prior written consent of authorized representative of redbus. <https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=xRnGFwEijCDoiz5rOh9c_74-BTK-vsyD06-_YcmZURs&e=> com is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.Before opening any email and/or attachments, please check them for viruses and other defects.
smime.p7s
Description: S/MIME cryptographic signature