Re: CSV/delimited to Parquet conversion via Nifi

Dmitry Goldenberg Tue, 22 Mar 2016 20:08:14 -0700

It seems to me that for starters it's great to have the processors which
convert from various input formats to FlowFile, and from FlowFile to
various output formats.  That covers all the cases and it gives the users a
chance to run some extra processors in between which is often handy, and
sometimes necessary.


ConvertFormat sounds cool but I'd agree that it may grow to be "hairy" with
the number of conversions, each with its own set of configuration options.
>From that perspective, might it be easier to deal with 2 * N specific
converters, and keep adding them as needed, rather than try to maintain a
large "Swiss knife"?

Would ConvertFormat really be able to avoid having to use some kind of
intermediary in-memory format as the conversion is going on?  If not, why
not let this intermediary format be FlowFile, and if it is FlowFile, then
why not just roll with the ConvertFrom / ConvertTo processors?  That way,
implementing a direct converter is simply a matter of dropping the two
converters next to each other into your dataflow (plus a few in-between
transformations, if necessary).

Furthermore, a combination of a ConvertFrom and a subsequent ConvertTo
could be saved as a sub-template for reuse, left as an exercise for the
user, driven by the user's specific use-cases.

I just wrote a Dataflow which converts some input XML to Avro, and I
suspect that making such a converter work through a common ConvertFormat
would take quite a few options.  Between the start and the finish, I ended
up with: SplitXml, EvaluateXPath, UpdateAttributes, AttributesToJSON,
ConvertJSONToAvro, MergeContent (after that I have a SetAvroFileExtension
and WriteToHdfs).  Too many options to expose for the XMl-to-Avro use-case,
IMHO, for the common ConvertFormat, even if perhaps my Dataflow can be
optimized to avoid a step or two.

Regards,
- Dmitry



On Tue, Mar 22, 2016 at 10:25 PM, Matt Burgess <[email protected]> wrote:

> I am +1 for the ConvertFormat processor, the  user experience is so much
> enhanced by the hands-off conversion. Such a capability might be contingent
> on the "dependent properties" concept (in Jira somewhere).
>
> Also this guy could get pretty big in terms of footprint, I'd imagine the
> forthcoming Registry might be a good place for it.
>
> In general a format translator would probably make for a great Apache
> project :) Martin Fowler has blogged about some ideas like this (w.r.t.
> abstracting translation logic), Tika has done some of this but AFAIK its
> focus is on extraction not transformation. In any case, we could certainly
> capture the idea in NiFi.
>
> Regards,
> Matt
>
> On Mar 22, 2016, at 9:52 PM, Edmon Begoli <[email protected]> wrote:
>
> Good point.
>
> I just think that Parquet and ORC are important targets, just as
> relational/JDBC stores are.
>
> On Tuesday, March 22, 2016, Tony Kurc <[email protected]> wrote:
>
>> Interesting question. A couple discussion points: If we start doing a
>> processor for each of these conversions, it may become unwieldy (P(x,2)
>> processors, where x is number of data formats?) I'd say maybe a more
>> general ConvertFormat processor may be appropriate, but then configuration
>> and code complexity may suffer. If there is a canonical internal data form
>> and a bunch (2*x) of convertXtocanonical, and convertcanonicaltoX
>> processors, the flow could get complex and the extra transform could be
>> expensive.
>> On Mar 21, 2016 9:39 PM, "Dmitry Goldenberg" <[email protected]>
>> wrote:
>>
>>> Since NiFi has ConvertJsonToAvro and ConvertCsvToAvro processors, would
>>> it make sense to add a feature request for a ConvertJsonToParquet processor
>>> and a ConvertCsvToParquet processor?
>>>
>>> - Dmitry
>>>
>>> On Mon, Mar 21, 2016 at 9:23 PM, Matt Burgess <[email protected]>
>>> wrote:
>>>
>>>> Edmon,
>>>>
>>>> NIFI-1663 [1] was created to add ORC support to NiFi. If you have a
>>>> target dataset that has been created with Parquet format, I think you can
>>>> use ConvertCSVtoAvro then StoreInKiteDataset to get flow files in Parquet
>>>> format into Hive, HDFS, etc. Others in the community know a lot more about
>>>> the StoreInKiteDataset processor than I do.
>>>>
>>>> Regards,
>>>> Matt
>>>>
>>>> [1] https://issues.apache.org/jira/browse/NIFI-1663
>>>>
>>>> On Mon, Mar 21, 2016 at 8:25 PM, Edmon Begoli <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>> Is there a way to do straight CSV(PSV) to Parquet or ORC conversion
>>>>> via Nifi, or do I always need to push the data through some of the
>>>>> "data engines" - Drill, Spark, Hive, etc.?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>

Re: CSV/delimited to Parquet conversion via Nifi

Reply via email to