Aw: Re: CSV/delimited to Parquet conversion via Nifi

Uwe Geercken Wed, 23 Mar 2016 05:24:47 -0700

Maybe I am too naive here, but formatting for text based formats could be done using a template engine.

Matt is right with the user experience - but only if the complexity of this one processor is not too high - that is what I think. Personally I - with the user hat on - don't like dialogs with tons of configuration options for this or that. I more for a clean, slim design. I'd rather try different processors that are quick to understand, rather than have one processor but spend 2 hours to understand which of the many options I have to select under which circumstances. Typically in the last case the documentation is somewhat hard to write and hard to understand.

But maybe there is a way in the middle: create two (from/to) processors for common/similar formats and add some others for more complicated ones or are otherwise not a good fit to the other group(s) of formats.

Anyway I think the important thing is: if I want to develop something in Nifi, I want to quickly find the right suited processor. Now the important thing is, that I find it and probably others that have a similar scope. The tags help a lot here - they provide a sort of grouping.

Regards,

Uwe

Gesendet: Mittwoch, 23. März 2016 um 03:25 Uhr
Von: "Matt Burgess" <[email protected]>
An: [email protected]
Betreff: Re: CSV/delimited to Parquet conversion via Nifi

I am +1 for the ConvertFormat processor, the user experience is so much enhanced by the hands-off conversion. Such a capability might be contingent on the "dependent properties" concept (in Jira somewhere).

Also this guy could get pretty big in terms of footprint, I'd imagine the forthcoming Registry might be a good place for it.

In general a format translator would probably make for a great Apache project :) Martin Fowler has blogged about some ideas like this (w.r.t. abstracting translation logic), Tika has done some of this but AFAIK its focus is on extraction not transformation. In any case, we could certainly capture the idea in NiFi.

Regards,

Matt

On Mar 22, 2016, at 9:52 PM, Edmon Begoli <[email protected]> wrote:

Good point.

I just think that Parquet and ORC are important targets, just as relational/JDBC stores are.

On Tuesday, March 22, 2016, Tony Kurc <[email protected]> wrote:

Interesting question. A couple discussion points: If we start doing a processor for each of these conversions, it may become unwieldy (P(x,2) processors, where x is number of data formats?) I'd say maybe a more general ConvertFormat processor may be appropriate, but then configuration and code complexity may suffer. If there is a canonical internal data form and a bunch (2*x) of convertXtocanonical, and convertcanonicaltoX processors, the flow could get complex and the extra transform could be expensive.

On Mar 21, 2016 9:39 PM, "Dmitry Goldenberg" <[email protected]> wrote:

Since NiFi has ConvertJsonToAvro and ConvertCsvToAvro processors, would it make sense to add a feature request for a ConvertJsonToParquet processor and a ConvertCsvToParquet processor?

- Dmitry

On Mon, Mar 21, 2016 at 9:23 PM, Matt Burgess <[email protected]> wrote:

Edmon,

NIFI-1663 [1] was created to add ORC support to NiFi. If you have a target dataset that has been created with Parquet format, I think you can use ConvertCSVtoAvro then StoreInKiteDataset to get flow files in Parquet format into Hive, HDFS, etc. Others in the community know a lot more about the StoreInKiteDataset processor than I do.

Regards,

Matt

[1] https://issues.apache.org/jira/browse/NIFI-1663

On Mon, Mar 21, 2016 at 8:25 PM, Edmon Begoli <[email protected]> wrote:

Is there a way to do straight CSV(PSV) to Parquet or ORC conversion via Nifi, or do I always need to push the data through some of the "data engines" - Drill, Spark, Hive, etc.?

Aw: Re: CSV/delimited to Parquet conversion via Nifi

Reply via email to