[GitHub] nifi issue #2003: NIFI-4181 CSVReader and CSVRecordSetWriter can be used by ...
Github user Wesley-Lawrence commented on the issue: https://github.com/apache/nifi/pull/2003 @mattyb149 Looks like a lot has changed in the last 9 months. The code I added here leveraged classes related to Schema registries, but those classes have been moved into a more Avro specific package (`nifi-avro-record-utils`), where the CSV stuff is still under a standard package (`nifi-standard-record-utils`). Looks like it'll take some work to abstract the schema-registry specific stuff away from Avro, so the CSV reader/writers can leverage it. Sadly, I don't have the time to get back deep in NiFi right now, so I'm OK with closing this PR so a more updated solution can be worked on. ---
[GitHub] nifi issue #2003: NIFI-4181 CSVReader and CSVRecordSetWriter can be used by ...
Github user Wesley-Lawrence commented on the issue: https://github.com/apache/nifi/pull/2003 While I think @markap14 and I agree on the correct long-term solution of adding a new "String Column Names" schema registry, I've never got around to taking on that task. Shifting priorities and all. However, if you'd like to take the work in as-is @mattyb149 (maybe as a stepping stone to an eventual new schema registry), I'll update the PR for you. ---
[GitHub] nifi issue #2003: NIFI-4181 CSVReader and CSVRecordSetWriter can be used by ...
Github user mattyb149 commented on the issue: https://github.com/apache/nifi/pull/2003 @Wesley-Lawrence is this ready for "final" review? If so, do you mind rebasing this PR against the latest master? There are some conflicts listed above. Please and thanks! ---
[GitHub] nifi issue #2003: NIFI-4181 CSVReader and CSVRecordSetWriter can be used by ...
Github user markap14 commented on the issue: https://github.com/apache/nifi/pull/2003 @Wesley-Lawrence I definitely think that's a reasonable approach. Sorry, just going through old PR's to make sure that I've followed up on everything. I like what's been outlined above. ---
[GitHub] nifi issue #2003: NIFI-4181 CSVReader and CSVRecordSetWriter can be used by ...
Github user Wesley-Lawrence commented on the issue: https://github.com/apache/nifi/pull/2003 Yea, I think we're on the same page. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #2003: NIFI-4181 CSVReader and CSVRecordSetWriter can be used by ...
Github user markap14 commented on the issue: https://github.com/apache/nifi/pull/2003 @Wesley-Lawrence sorry - i must have missed the email that you'd commented on the PR. Sorry I didn't notice until now. So I think that what you're proposing here is that we should have a property named "Schema Text Format" on the readers/writers, in addition to "Schema Text", and the Schema Text Format would tell the service how to parse the "Schema Text." For example, there would be two options: "Avro" and "String Column Names". In addition to this, we could potentially also introduce a new Schema Registry as I described above. This means that for one-off types of schemas, such as when we have a CSV file and we only need to use the schema once, we don't have to go to the trouble of creating a Schema Registry and adding entries to it - we'd just set the column names (optionally using Expression Language) in the Reader. And we would add this capability to JSON, etc. as well. Does that all sound accurate? Just want to make sure that we are on the same page here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #2003: NIFI-4181 CSVReader and CSVRecordSetWriter can be used by ...
Github user Wesley-Lawrence commented on the issue: https://github.com/apache/nifi/pull/2003 @markap14 @mattyb149 Is there any other feedback you'd two like me to address? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #2003: NIFI-4181 CSVReader and CSVRecordSetWriter can be used by ...
Github user Wesley-Lawrence commented on the issue: https://github.com/apache/nifi/pull/2003 @markap14 Good points. I actually considered adding a schema registry at first, but thought it might be too much for just making CSV in/out easier. I think using schemas falls into two categories. Case A schemas are ones that someone uses all the time. Having a schema registry is great then, because it's consistently defined in a single place. Case B schemas are one-off scheams. Input or output is in some format a person doesn't typically use, and someone is just converting to-or-from a Case A often-used schema. In this case, the "Schema Text" property becomes really useful. Rather than cluttering your registry, you can just define the Case B on-off in a processor/service, and never think about it again. To your point, I've only solved Case B, for CSV in/out. By adding a "ExplicitFieldSchemaRegistry" or "ColumnSchemaRegistry" (I think whatever this gets named, it's going to be ugly =P) we can tackle Case A for any type. I think to do this completely properly, we should solve this for Case A and B, for any type. So I personally think we should add a new schema registry for Case A, but I think we could also add some "Schema Format" property (defaulting to Avro) to `SchemaRegistryService` that informs processors how to interpret "Schema Text". That way, it's also easy to define Case B one-off schemas of any type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #2003: NIFI-4181 CSVReader and CSVRecordSetWriter can be used by ...
Github user markap14 commented on the issue: https://github.com/apache/nifi/pull/2003 @Wesley-Lawrence I definitely agree that having to build up an entire Avro schema can be a pain -- especially if you're not already familiar with Avro schemas. And I absolutely love that you saw something that can be improved and jumped in to improve the user experience! However, I am a bit concerned with the approach taken here because while it does scratch an itch, it does so only for CSV data and as a result isn't really consistent. I would like to start a discussion on how we could perhaps handle this in a more generic way. My first thought is to actually provide an alternative implementation of the Schema Registry. Call it ExplicitStringFieldSchemaRegistry for lack of a better name (i'm suggesting calling it this for the sake of the discussion, not necessarily creating an implementation with this name). The idea here, though, is that there would be a new implementation of SchemaRegistry. In this new implementation you would add properties just like AvroSchemaRegistry. The name of the property would be the name of a schema. But the value, instead of an Avro Schema, would be a comma-separated list of column names, and all would be assumed to be Strings. This approach gives us a few different benefits. First, it allows this same approach to be taken with other data formats (flat JSON, for example). Also, it keeps things consistent in terms of how we access the schema (we can still use the Schema Name property along with the Schema Registry). It also makes the CSV Reader more re-usable because we can use Expression Language to access the name of the schema, so a single CSV Reader can be used by many different processors spread throughout the flow. What I would love to see at some point is a more powerful SchemaRegistry service that provides a Custom/Advanced UI for actually building schemas. I think this would be extremely powerful and useful and far easier to use. But I promise that I am the last person that you (or any user) wants building a UI :) Any thoughts here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---