Re: Integration between Apache NiFi and Parquet or Workaround?

Bryan Bende Tue, 14 Feb 2017 14:22:03 -0800

I'll caveat this by saying that up until 10 mins ago I had never
looked at Parquet, so I could be completely wrong, but...


The Parquet API seems heavily geared towards HDFS. For example, take
the AvroParquetWriter:

https://github.com/Parquet/parquet-mr/blob/master/parquet-avro/src/main/java/parquet/avro/AvroParquetWriter.java

You have to give it a Hadoop Path object to write the data to, so this
wouldn't really work in the middle of a NiFi flow if you wanted to
have a processor like ConvertXyzToParquet, because the processor needs
to write the output to an OutputStream which is a location in NiFi's
internal repositories, not HDFS.

It could make sense at the end of a flow when writing to HDFS, so you
could probably implement a custom processor similar to PutHDFS that
used the Parquet libraries to write the data to HDFS as Parquet
(assuming you merged together a bunch of data before this) . This is
probably what the Kite processors are already doing, but not sure.

-Bryan

On Tue, Feb 14, 2017 at 5:12 PM, Carlos Paradis <[email protected]> wrote:
> Hi James,
>
> Thank you for pointing the issue out! :-) I wanted to point out another
> alternative solution to Kite I observed, to hear if you had any insight on
> this approach too if you don't mind.
>
> When I saw a presentation of Ni-Fi and Parquet being used in a guest
> project, although not many details implementation wise were discussed, it
> was mentioned using also Apache Spark (apparently only) leaving a port from
> Ni-Fi to read in the data. Someone in Hortonworks posted a tutorial on it
> (github) on Jan 2016 that seems to head towards that direction.
>
> The configuration looked as follows according to the tutorial's image:
>
> https://community.hortonworks.com/storage/attachments/1669-screen-shot-2016-01-31-at-21029-pm.png
>
>
> The group presentation also used Spark, but I am not sure if they used the
> same port approach, this is all I have:
>
>
> PackageToParquetRunner <-> getFilePaths() <-> datalake [RDD <String,String>]
>
> PackageToParquetRunner -> FileProcessorClass -> RDD Filter -> RDDflatMap ->
> RDDMap -> RDD <row> -> PackageToParquetRunner -> Create Data Frame (SQL
> Context) -> Write Parquet (DataFrame).
>
> When you say,
>
>> then running periodic jobs to build Parquet data sets.
>
>
> Would such Spark setup be the case as period jobs? I am minimally acquainted
> on how Spark goes about MapReduce using RDDs, but I am not certain to what
> extent it would support the NiFi pipeline for such purpose (not to mention,
> on the way it appears, seems to leave a hole in NiFi diagram as a port,
> which makes it unable to monitor for data provenance).
>
> ---
>
> Do you think these details and Kite details would be worth mentioning as a
> comment on the JIRA issue you pointed out?
>
> Thanks!
>
>
> On Tue, Feb 14, 2017 at 11:46 AM, James Wing <[email protected]> wrote:
>>
>> Carlos,
>>
>> Welcome to NiFi!  I believe the Kite dataset is currently the most direct,
>> built-in solution for writing Parquet files from NiFi.
>>
>> I'm not an expert on Parquet, but I understand columnar formats like
>> Parquet and ORC are not easily written to in the incremental, streaming
>> fashion that NiFi excels at (I hope writing this will prompt expert
>> correction).  Other alternatives typically involve NiFi writing to more
>> stream-friendly data stores or formats directly, then running periodic jobs
>> to build Parquet data sets.  Hive, Drill, and similar tools can do this.
>>
>> You are certainly not alone in wanting better Parquet support, there is at
>> least one JIRA ticket for it as well:
>>
>> Add processors for Google Cloud Storage Fetch/Put/Delete
>> https://issues.apache.org/jira/browse/NIFI-2725
>>
>> You might want to chime in with some details of your use case, or create a
>> new ticket if that's not a fit for you.
>>
>> Thanks,
>>
>> James
>>
>> On Mon, Feb 13, 2017 at 3:13 PM, Carlos Paradis <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> Our group has recently started trying to prototype a setup of
>>> Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any
>>> documentation other than a scarce discussion on using Kite as a workaround
>>> to integrate NiFi and Parquet.
>>>
>>> Are there any future plans for this integration from NiFi or anyone would
>>> be able to give me some insight in which scenario this workaround would
>>> (not) be worthwhile and alternatives?
>>>
>>> The most recent discussion I found in this list dates from May 11, 2016.
>>> I also saw some interest in doing this on Stackoverflow here, and here.
>>>
>>> Thanks,
>>>
>>> --
>>> Carlos Paradis
>>> http://carlosparadis.com
>>
>>
>
>
>
> --
> Carlos Paradis
> http://carlosparadis.com

Re: Integration between Apache NiFi and Parquet or Workaround?

Reply via email to