Agreed, but probably not the case with XML to Avro. Perhaps ConvertFormat would be for a set of the more straightforward conversions.
> On Mar 22, 2016, at 11:30 PM, Tony Kurc <[email protected]> wrote: > > On the intermediate representation: not necessarily needed, and likely a > performance hindrance to do so. Consider converting from a CSV to a flat json > object. This can be done by streaming through the values, and likely only > needing a single input character in memory at a time. > > On Mar 22, 2016 11:07 PM, "Dmitry Goldenberg" <[email protected]> > wrote: >> It seems to me that for starters it's great to have the processors which >> convert from various input formats to FlowFile, and from FlowFile to various >> output formats. That covers all the cases and it gives the users a chance >> to run some extra processors in between which is often handy, and sometimes >> necessary. >> >> ConvertFormat sounds cool but I'd agree that it may grow to be "hairy" with >> the number of conversions, each with its own set of configuration options. >> From that perspective, might it be easier to deal with 2 * N specific >> converters, and keep adding them as needed, rather than try to maintain a >> large "Swiss knife"? >> >> Would ConvertFormat really be able to avoid having to use some kind of >> intermediary in-memory format as the conversion is going on? If not, why >> not let this intermediary format be FlowFile, and if it is FlowFile, then >> why not just roll with the ConvertFrom / ConvertTo processors? That way, >> implementing a direct converter is simply a matter of dropping the two >> converters next to each other into your dataflow (plus a few in-between >> transformations, if necessary). >> >> Furthermore, a combination of a ConvertFrom and a subsequent ConvertTo could >> be saved as a sub-template for reuse, left as an exercise for the user, >> driven by the user's specific use-cases. >> >> I just wrote a Dataflow which converts some input XML to Avro, and I suspect >> that making such a converter work through a common ConvertFormat would take >> quite a few options. Between the start and the finish, I ended up with: >> SplitXml, EvaluateXPath, UpdateAttributes, AttributesToJSON, >> ConvertJSONToAvro, MergeContent (after that I have a SetAvroFileExtension >> and WriteToHdfs). Too many options to expose for the XMl-to-Avro use-case, >> IMHO, for the common ConvertFormat, even if perhaps my Dataflow can be >> optimized to avoid a step or two. >> >> Regards, >> - Dmitry >> >> >> >>> On Tue, Mar 22, 2016 at 10:25 PM, Matt Burgess <[email protected]> wrote: >>> I am +1 for the ConvertFormat processor, the user experience is so much >>> enhanced by the hands-off conversion. Such a capability might be contingent >>> on the "dependent properties" concept (in Jira somewhere). >>> >>> Also this guy could get pretty big in terms of footprint, I'd imagine the >>> forthcoming Registry might be a good place for it. >>> >>> In general a format translator would probably make for a great Apache >>> project :) Martin Fowler has blogged about some ideas like this (w.r.t. >>> abstracting translation logic), Tika has done some of this but AFAIK its >>> focus is on extraction not transformation. In any case, we could certainly >>> capture the idea in NiFi. >>> >>> Regards, >>> Matt >>> >>>> On Mar 22, 2016, at 9:52 PM, Edmon Begoli <[email protected]> wrote: >>>> >>>> Good point. >>>> >>>> I just think that Parquet and ORC are important targets, just as >>>> relational/JDBC stores are. >>>> >>>>> On Tuesday, March 22, 2016, Tony Kurc <[email protected]> wrote: >>>>> Interesting question. A couple discussion points: If we start doing a >>>>> processor for each of these conversions, it may become unwieldy (P(x,2) >>>>> processors, where x is number of data formats?) I'd say maybe a more >>>>> general ConvertFormat processor may be appropriate, but then >>>>> configuration and code complexity may suffer. If there is a canonical >>>>> internal data form and a bunch (2*x) of convertXtocanonical, and >>>>> convertcanonicaltoX processors, the flow could get complex and the extra >>>>> transform could be expensive. >>>>> >>>>>> On Mar 21, 2016 9:39 PM, "Dmitry Goldenberg" <[email protected]> >>>>>> wrote: >>>>>> Since NiFi has ConvertJsonToAvro and ConvertCsvToAvro processors, would >>>>>> it make sense to add a feature request for a ConvertJsonToParquet >>>>>> processor and a ConvertCsvToParquet processor? >>>>>> >>>>>> - Dmitry >>>>>> >>>>>>> On Mon, Mar 21, 2016 at 9:23 PM, Matt Burgess <[email protected]> >>>>>>> wrote: >>>>>>> Edmon, >>>>>>> >>>>>>> NIFI-1663 [1] was created to add ORC support to NiFi. If you have a >>>>>>> target dataset that has been created with Parquet format, I think you >>>>>>> can use ConvertCSVtoAvro then StoreInKiteDataset to get flow files in >>>>>>> Parquet format into Hive, HDFS, etc. Others in the community know a lot >>>>>>> more about the StoreInKiteDataset processor than I do. >>>>>>> >>>>>>> Regards, >>>>>>> Matt >>>>>>> >>>>>>> [1] https://issues.apache.org/jira/browse/NIFI-1663 >>>>>>> >>>>>>>> On Mon, Mar 21, 2016 at 8:25 PM, Edmon Begoli <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Is there a way to do straight CSV(PSV) to Parquet or ORC conversion >>>>>>>> via Nifi, or do I always need to push the data through some of the >>>>>>>> "data engines" - Drill, Spark, Hive, etc.?
