Re: Kafka to parquet to s3

Bryan Bende Sat, 06 Jul 2019 09:49:51 -0700

Currently put and fetch parquet are tied to the Hadoop API so they need the
config files. As mentioned you can create a core-site with a local file
system, and then you could use another part of the flow to pick up the file
using ListFile -> FetchFile -> PutS3Object.


There is a way to write to S3 directly from PutParquet and PutHdfs, but it
requires additional jars and config, and is honestly harder to setup then
just using the above approach.

There is also a JIRA to implement a parquet record reader and writer which
would then let you use ConvertRecord to go from JSON to parquet, and then
to PutS3Object.

I think the error mentioned means you have the same field name at different
levels in your JSON and that is not allowed in an Avro schema.

On Sat, Jul 6, 2019 at 9:21 AM Dweep Sharma <[email protected]> wrote:

> Thanks Shanker,
>
> But I do not see options in putparquet for s3 bucket/credential, I am
> assuming I need to push this to a local store and then add the puts3object
> processor on top of that ?
>
> Also, as record reader in putparquet I am using the JsonTreeReader with
> defaults (InferSchema)  and I get the error "Failed to write due to can't
> redefine: org.apache.nifi.addresstype:
>
> Some files however do get written. Are the default settings good or am I
> missing something ?
>
>
> -Dweep
>
>
>
> On Fri, Jul 5, 2019 at 11:43 PM Andrew Grande <[email protected]> wrote:
>
>> Interestingly enough, the ORC processor in NiFi can just use defaults if
>> hadoop configs aren't provided, no additional config steps required. Is it
>> something which can be improved for PutParquet maybe?
>>
>> Andrew
>>
>>
>>
>> On Fri, Jul 5, 2019, 4:18 AM Shanker Sneh <[email protected]>
>> wrote:
>>
>>> Hello Dweep,
>>>
>>> In putparquet processor you can set the attribute '*Hadoop
>>> Configuration Resources*' to a *core-site.xml* file whose content can
>>> be somewhat like below:
>>>
>>> <configuration>
>>>
>>>     <property>
>>>
>>>         <name>fs.defaultFS</name>
>>>
>>>         <value>file:///reservoir-dl</value>
>>>
>>>     </property>
>>>
>>> </configuration>
>>>
>>> Here the file:///reservoir-dl could be your path where
>>> in-transit parquet files have be written to -- before being pushed to S3.
>>> More importantly, you *do not* need Hadoop to be installed. You can
>>> just place the core-site.xml file on your NiFi nodes and get started.
>>>
>>>
>>> On Fri, Jul 5, 2019 at 1:54 PM Dweep Sharma <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have been trying to move some JSON data to S3 in Parquet format.
>>>>
>>>> From Kafka to s3 is straight forward but I cannot seem to find the
>>>> right processor to convert JSON to parquet and move it to s3.
>>>>
>>>> putparquet does not take s3 bucket or credentials and requires hadoop
>>>> to be installed.
>>>>
>>>> Can someone please share a blog or steps to achieve this. Thanks in
>>>> advance
>>>>
>>>> -Dweep
>>>>
>>>>
>>>>
>>>>
>>>> *::DISCLAIMER::----------------------------------------------------------------------------------------------------------------------------------------------------The
>>>> contents of this e-mail and any attachments are confidential and intended
>>>> for the named recipient(s) only.E-mail transmission is not guaranteed to be
>>>> secure or error-free as information could be intercepted, corrupted,lost,
>>>> destroyed, arrive late or incomplete, or may contain viruses in
>>>> transmission. The e mail and its contents(with or without referred errors)
>>>> shall therefore not attach any liability on the originator or redBus.com.
>>>> Views or opinions, if any, presented in this email are solely those of the
>>>> author and may not necessarily reflect the views or opinions of redBus.com.
>>>> Any form of reproduction, dissemination, copying, disclosure,
>>>> modification,distribution and / or publication of this message without the
>>>> prior written consent of authorized representative of redbus.
>>>> <http://redbus.in/>com is strictly prohibited. If you have received this
>>>> email in error please delete it and notify the sender immediately.Before
>>>> opening any email and/or attachments, please check them for viruses and
>>>> other defects.*
>>>
>>>
>>>
>>> --
>>> Best,
>>> Sneh
>>>
>>
>
>
>
> *::DISCLAIMER::----------------------------------------------------------------------------------------------------------------------------------------------------The
> contents of this e-mail and any attachments are confidential and intended
> for the named recipient(s) only.E-mail transmission is not guaranteed to be
> secure or error-free as information could be intercepted, corrupted,lost,
> destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents(with or without referred errors)
> shall therefore not attach any liability on the originator or redBus.com.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the views or opinions of redBus.com.
> Any form of reproduction, dissemination, copying, disclosure,
> modification,distribution and / or publication of this message without the
> prior written consent of authorized representative of redbus.
> <http://redbus.in/>com is strictly prohibited. If you have received this
> email in error please delete it and notify the sender immediately.Before
> opening any email and/or attachments, please check them for viruses and
> other defects.*

-- 
Sent from Gmail Mobile

Re: Kafka to parquet to s3

Reply via email to