Hi James,

Thank you for pointing the issue out! :-) I wanted to point out another
alternative solution to Kite I observed, to hear if you had any insight on
this approach too if you don't mind.

When I saw a presentation of Ni-Fi and Parquet being used in a guest
project, although not many details implementation wise were discussed, it
was mentioned using also Apache Spark (apparently only) leaving a port from
Ni-Fi to read in the data. Someone in Hortonworks posted a tutorial on it
<https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html>
 (github
<https://github.com/hortonworks-meetup-content/intro-to-apache-spark-streaming-with-apache-nifi-and-apache-kafka>)
on Jan 2016 that seems to head towards that direction.

The configuration looked as follows according to the tutorial's image:

https://community.hortonworks.com/storage/attachments/1669-screen-shot-2016-01-31-at-21029-pm.png


The group presentation also used Spark, but I am not sure if they used the
same port approach, this is all I have:


*PackageToParquetRunner* <-> getFilePaths() <-> datalake [RDD
<String,String>]

*PackageToParquetRunner *-> FileProcessorClass -> RDD Filter -> RDDflatMap
-> RDDMap -> RDD <row> -> *PackageToParquetRunner *-> Create Data Frame
(SQL Context) -> Write Parquet (DataFrame).

When you say,

then running periodic jobs to build Parquet data sets.


Would such Spark setup be the case as period jobs? I am minimally
acquainted on how Spark goes about MapReduce using RDDs, but I am not
certain to what extent it would support the NiFi pipeline for such purpose
(not to mention, on the way it appears, seems to leave a hole in NiFi
diagram as a port, which makes it unable to monitor for data provenance).

---

Do you think these details and Kite details would be worth mentioning as a
comment on the JIRA issue you pointed out?

Thanks!


On Tue, Feb 14, 2017 at 11:46 AM, James Wing <[email protected]> wrote:

> Carlos,
>
> Welcome to NiFi!  I believe the Kite dataset is currently the most direct,
> built-in solution for writing Parquet files from NiFi.
>
> I'm not an expert on Parquet, but I understand columnar formats like
> Parquet and ORC are not easily written to in the incremental, streaming
> fashion that NiFi excels at (I hope writing this will prompt expert
> correction).  Other alternatives typically involve NiFi writing to more
> stream-friendly data stores or formats directly, then running periodic jobs
> to build Parquet data sets.  Hive, Drill, and similar tools can do this.
>
> You are certainly not alone in wanting better Parquet support, there is at
> least one JIRA ticket for it as well:
>
> Add processors for Google Cloud Storage Fetch/Put/Delete
> https://issues.apache.org/jira/browse/NIFI-2725
>
> You might want to chime in with some details of your use case, or create a
> new ticket if that's not a fit for you.
>
> Thanks,
>
> James
>
> On Mon, Feb 13, 2017 at 3:13 PM, Carlos Paradis <[email protected]> wrote:
>
>> Hi,
>>
>> Our group has recently started trying to prototype a setup of
>> Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any
>> documentation other than a scarce discussion on using Kite as a workaround
>> to integrate NiFi and Parquet.
>>
>> Are there any future plans for this integration from NiFi or anyone would
>> be able to give me some insight in which scenario this workaround would
>> (not) be worthwhile and alternatives?
>>
>> The most recent discussion
>> <http://apache-nifi-developer-list.39713.n7.nabble.com/parquet-format-td10145.html>
>> I found in this list dates from May 11, 2016. I also saw some interest in
>> doing this on Stackoverflow here
>> <http://stackoverflow.com/questions/37149331/apache-nifi-hdfs-parquet-format>,
>> and here
>> <http://stackoverflow.com/questions/37165764/convert-incoming-message-to-parquet-format>
>> .
>>
>> Thanks,
>>
>> --
>> Carlos Paradis
>> http://carlosparadis.com <http://carlosandrade.co>
>>
>
>


-- 
Carlos Paradis
http://carlosparadis.com <http://carlosandrade.co>

Reply via email to