Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
EndzeitBegins commented on PR #8011: URL: https://github.com/apache/nifi/pull/8011#issuecomment-1813151208 The backport was more straightforward than I've expected. I've opened #8031. I'd highly appreciate if you'd take a look at it @exceptionfactory, as you're familiar with this original pull-request already. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
EndzeitBegins commented on PR #8011: URL: https://github.com/apache/nifi/pull/8011#issuecomment-1812907291 Thank you for your reviews and the fast merge @exceptionfactory. I'll take a look at what's required for the backport in the next days presumably and open a new pull-request if the work required is manageable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
exceptionfactory closed pull request #8011: NIFI-4550 Detect charset for FlowFile with MIME type text/* URL: https://github.com/apache/nifi/pull/8011 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
EndzeitBegins commented on code in PR #8011: URL: https://github.com/apache/nifi/pull/8011#discussion_r1393217615 ## nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java: ## @@ -89,14 +89,17 @@ @Tags({"compression", "gzip", "bzip2", "zip", "MIME", "mime.type", "file", "identify"}) @CapabilityDescription("Attempts to identify the MIME Type used for a FlowFile. If the MIME Type can be identified, " + "an attribute with the name 'mime.type' is added with the value being the MIME Type. If the MIME Type cannot be determined, " -+ "the value will be set to 'application/octet-stream'. In addition, the attribute mime.extension will be set if a common file " -+ "extension for the MIME Type is known.") ++ "the value will be set to 'application/octet-stream'. In addition, the attribute 'mime.extension' will be set if a common file " ++ "extension for the MIME Type is known. If the MIME Type detected is of type text/*, attempts to identify the charset used " + +"and an attribute with the name 'mime.charset' is added with the value being the charset.") @WritesAttributes({ -@WritesAttribute(attribute = "mime.type", description = "This Processor sets the FlowFile's mime.type attribute to the detected MIME Type. " -+ "If unable to detect the MIME Type, the attribute's value will be set to application/octet-stream"), -@WritesAttribute(attribute = "mime.extension", description = "This Processor sets the FlowFile's mime.extension attribute to the file " -+ "extension associated with the detected MIME Type. " -+ "If there is no correlated extension, the attribute's value will be empty") +@WritesAttribute(attribute = "mime.type", description = "This Processor sets the FlowFile's mime.type attribute to the detected MIME Type. " ++ "If unable to detect the MIME Type, the attribute's value will be set to application/octet-stream"), +@WritesAttribute(attribute = "mime.extension", description = "This Processor sets the FlowFile's mime.extension attribute to the file " ++ "extension associated with the detected MIME Type. " ++ "If there is no correlated extension, the attribute's value will be empty"), +@WritesAttribute(attribute = "mime.charset", description = "This Processor sets the FlowFile's mime.charset attribute to the detected charset. " ++ "If unable to detect the charset or the detected MIME type is not of type text/*, the attribute's value will be empty") Review Comment: Perfect. Yes, that makes sense. Thank you for your quick responses. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
exceptionfactory commented on code in PR #8011: URL: https://github.com/apache/nifi/pull/8011#discussion_r1393209952 ## nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java: ## @@ -89,14 +89,17 @@ @Tags({"compression", "gzip", "bzip2", "zip", "MIME", "mime.type", "file", "identify"}) @CapabilityDescription("Attempts to identify the MIME Type used for a FlowFile. If the MIME Type can be identified, " + "an attribute with the name 'mime.type' is added with the value being the MIME Type. If the MIME Type cannot be determined, " -+ "the value will be set to 'application/octet-stream'. In addition, the attribute mime.extension will be set if a common file " -+ "extension for the MIME Type is known.") ++ "the value will be set to 'application/octet-stream'. In addition, the attribute 'mime.extension' will be set if a common file " ++ "extension for the MIME Type is known. If the MIME Type detected is of type text/*, attempts to identify the charset used " + +"and an attribute with the name 'mime.charset' is added with the value being the charset.") @WritesAttributes({ -@WritesAttribute(attribute = "mime.type", description = "This Processor sets the FlowFile's mime.type attribute to the detected MIME Type. " -+ "If unable to detect the MIME Type, the attribute's value will be set to application/octet-stream"), -@WritesAttribute(attribute = "mime.extension", description = "This Processor sets the FlowFile's mime.extension attribute to the file " -+ "extension associated with the detected MIME Type. " -+ "If there is no correlated extension, the attribute's value will be empty") +@WritesAttribute(attribute = "mime.type", description = "This Processor sets the FlowFile's mime.type attribute to the detected MIME Type. " ++ "If unable to detect the MIME Type, the attribute's value will be set to application/octet-stream"), +@WritesAttribute(attribute = "mime.extension", description = "This Processor sets the FlowFile's mime.extension attribute to the file " ++ "extension associated with the detected MIME Type. " ++ "If there is no correlated extension, the attribute's value will be empty"), +@WritesAttribute(attribute = "mime.charset", description = "This Processor sets the FlowFile's mime.charset attribute to the detected charset. " ++ "If unable to detect the charset or the detected MIME type is not of type text/*, the attribute's value will be empty") Review Comment: Thanks for making the change to the new `mime.charset` approach. I think the existing approach for `mime.extension` should be left unchanged to avoid unexpected changes for existing flows. Since this is a new attribute that only applies for text types, it seems better to set it only when it is not null. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
EndzeitBegins commented on code in PR #8011: URL: https://github.com/apache/nifi/pull/8011#discussion_r1393174739 ## nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java: ## @@ -89,14 +89,17 @@ @Tags({"compression", "gzip", "bzip2", "zip", "MIME", "mime.type", "file", "identify"}) @CapabilityDescription("Attempts to identify the MIME Type used for a FlowFile. If the MIME Type can be identified, " + "an attribute with the name 'mime.type' is added with the value being the MIME Type. If the MIME Type cannot be determined, " -+ "the value will be set to 'application/octet-stream'. In addition, the attribute mime.extension will be set if a common file " -+ "extension for the MIME Type is known.") ++ "the value will be set to 'application/octet-stream'. In addition, the attribute 'mime.extension' will be set if a common file " ++ "extension for the MIME Type is known. If the MIME Type detected is of type text/*, attempts to identify the charset used " + +"and an attribute with the name 'mime.charset' is added with the value being the charset.") @WritesAttributes({ -@WritesAttribute(attribute = "mime.type", description = "This Processor sets the FlowFile's mime.type attribute to the detected MIME Type. " -+ "If unable to detect the MIME Type, the attribute's value will be set to application/octet-stream"), -@WritesAttribute(attribute = "mime.extension", description = "This Processor sets the FlowFile's mime.extension attribute to the file " -+ "extension associated with the detected MIME Type. " -+ "If there is no correlated extension, the attribute's value will be empty") +@WritesAttribute(attribute = "mime.type", description = "This Processor sets the FlowFile's mime.type attribute to the detected MIME Type. " ++ "If unable to detect the MIME Type, the attribute's value will be set to application/octet-stream"), +@WritesAttribute(attribute = "mime.extension", description = "This Processor sets the FlowFile's mime.extension attribute to the file " ++ "extension associated with the detected MIME Type. " ++ "If there is no correlated extension, the attribute's value will be empty"), +@WritesAttribute(attribute = "mime.charset", description = "This Processor sets the FlowFile's mime.charset attribute to the detected charset. " ++ "If unable to detect the charset or the detected MIME type is not of type text/*, the attribute's value will be empty") Review Comment: Thank you for the feedback. I adjusted the implementation and documentation. In the original commit, I followed the implementation of `mime.extension`, which is set to an empty string, when it cannot be determined. Should its implementation stay as is or be changed as well, such that it is not set at all, like `mime.charset`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
EndzeitBegins commented on code in PR #8011: URL: https://github.com/apache/nifi/pull/8011#discussion_r1393174739 ## nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java: ## @@ -89,14 +89,17 @@ @Tags({"compression", "gzip", "bzip2", "zip", "MIME", "mime.type", "file", "identify"}) @CapabilityDescription("Attempts to identify the MIME Type used for a FlowFile. If the MIME Type can be identified, " + "an attribute with the name 'mime.type' is added with the value being the MIME Type. If the MIME Type cannot be determined, " -+ "the value will be set to 'application/octet-stream'. In addition, the attribute mime.extension will be set if a common file " -+ "extension for the MIME Type is known.") ++ "the value will be set to 'application/octet-stream'. In addition, the attribute 'mime.extension' will be set if a common file " ++ "extension for the MIME Type is known. If the MIME Type detected is of type text/*, attempts to identify the charset used " + +"and an attribute with the name 'mime.charset' is added with the value being the charset.") @WritesAttributes({ -@WritesAttribute(attribute = "mime.type", description = "This Processor sets the FlowFile's mime.type attribute to the detected MIME Type. " -+ "If unable to detect the MIME Type, the attribute's value will be set to application/octet-stream"), -@WritesAttribute(attribute = "mime.extension", description = "This Processor sets the FlowFile's mime.extension attribute to the file " -+ "extension associated with the detected MIME Type. " -+ "If there is no correlated extension, the attribute's value will be empty") +@WritesAttribute(attribute = "mime.type", description = "This Processor sets the FlowFile's mime.type attribute to the detected MIME Type. " ++ "If unable to detect the MIME Type, the attribute's value will be set to application/octet-stream"), +@WritesAttribute(attribute = "mime.extension", description = "This Processor sets the FlowFile's mime.extension attribute to the file " ++ "extension associated with the detected MIME Type. " ++ "If there is no correlated extension, the attribute's value will be empty"), +@WritesAttribute(attribute = "mime.charset", description = "This Processor sets the FlowFile's mime.charset attribute to the detected charset. " ++ "If unable to detect the charset or the detected MIME type is not of type text/*, the attribute's value will be empty") Review Comment: Thank you for the feedback. I adjusted the implementation and documentation. In the original commit, I followed the implementation of `mime.extension`, which is set to an empty string, when it cannot be determined. Should its implementation stay as is or be changed as well, such that it does not be set like `mime.charset`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
EndzeitBegins commented on code in PR #8011: URL: https://github.com/apache/nifi/pull/8011#discussion_r1393176002 ## nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java: ## @@ -233,53 +239,62 @@ public void onTrigger(final ProcessContext context, final ProcessSession session } final ComponentLog logger = getLogger(); -final AtomicReference mimeTypeRef = new AtomicReference<>(null); -final String filename = flowFile.getAttribute(CoreAttributes.FILENAME.key()); - -session.read(flowFile, new InputStreamCallback() { -@Override -public void process(final InputStream stream) throws IOException { -try (final InputStream in = new BufferedInputStream(stream); - final TikaInputStream tikaStream = TikaInputStream.get(in)) { -Metadata metadata = new Metadata(); - -if (filename != null && context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) { -metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filename); -} -// Get mime type -MediaType mediatype = detector.detect(tikaStream, metadata); -mimeTypeRef.set(mediatype.toString()); -} + +final String mediaTypeString; +final String extension; +final Charset charset; + +try (final InputStream flowFileStream = session.read(flowFile); + final TikaInputStream tikaStream = TikaInputStream.get(flowFileStream)) { +final String filename = flowFile.getAttribute(CoreAttributes.FILENAME.key()); + +Metadata metadata = new Metadata(); +if (filename != null && context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) { +metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filename); } -}); -String mimeType = mimeTypeRef.get(); +final MediaType mediaType = detector.detect(tikaStream, metadata); +mediaTypeString = mediaType.getBaseType().toString(); +extension = lookupExtension(mediaTypeString, logger); +charset = identifyCharset(tikaStream, metadata, mediaType); +} catch (IOException e) { +throw new ProcessException("IOException thrown identifying mime-type of FlowFile content", e); +} + +flowFile = session.putAttribute(flowFile, CoreAttributes.MIME_TYPE.key(), mediaTypeString); +flowFile = session.putAttribute(flowFile, "mime.extension", extension); +flowFile = session.putAttribute(flowFile, "mime.charset", charset == null ? "" : charset.name()); Review Comment: I added a commit, no longer setting the attribute at all, when the charset cannot be detected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
EndzeitBegins commented on code in PR #8011: URL: https://github.com/apache/nifi/pull/8011#discussion_r1393175162 ## nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java: ## @@ -233,53 +239,62 @@ public void onTrigger(final ProcessContext context, final ProcessSession session } final ComponentLog logger = getLogger(); -final AtomicReference mimeTypeRef = new AtomicReference<>(null); -final String filename = flowFile.getAttribute(CoreAttributes.FILENAME.key()); - -session.read(flowFile, new InputStreamCallback() { -@Override -public void process(final InputStream stream) throws IOException { -try (final InputStream in = new BufferedInputStream(stream); - final TikaInputStream tikaStream = TikaInputStream.get(in)) { -Metadata metadata = new Metadata(); - -if (filename != null && context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) { -metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filename); -} -// Get mime type -MediaType mediatype = detector.detect(tikaStream, metadata); -mimeTypeRef.set(mediatype.toString()); -} + +final String mediaTypeString; +final String extension; +final Charset charset; + +try (final InputStream flowFileStream = session.read(flowFile); + final TikaInputStream tikaStream = TikaInputStream.get(flowFileStream)) { +final String filename = flowFile.getAttribute(CoreAttributes.FILENAME.key()); + +Metadata metadata = new Metadata(); +if (filename != null && context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) { +metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filename); } -}); -String mimeType = mimeTypeRef.get(); +final MediaType mediaType = detector.detect(tikaStream, metadata); +mediaTypeString = mediaType.getBaseType().toString(); +extension = lookupExtension(mediaTypeString, logger); +charset = identifyCharset(tikaStream, metadata, mediaType); +} catch (IOException e) { +throw new ProcessException("IOException thrown identifying mime-type of FlowFile content", e); Review Comment: I adjusted the message as suggested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
EndzeitBegins commented on code in PR #8011: URL: https://github.com/apache/nifi/pull/8011#discussion_r1393174739 ## nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java: ## @@ -89,14 +89,17 @@ @Tags({"compression", "gzip", "bzip2", "zip", "MIME", "mime.type", "file", "identify"}) @CapabilityDescription("Attempts to identify the MIME Type used for a FlowFile. If the MIME Type can be identified, " + "an attribute with the name 'mime.type' is added with the value being the MIME Type. If the MIME Type cannot be determined, " -+ "the value will be set to 'application/octet-stream'. In addition, the attribute mime.extension will be set if a common file " -+ "extension for the MIME Type is known.") ++ "the value will be set to 'application/octet-stream'. In addition, the attribute 'mime.extension' will be set if a common file " ++ "extension for the MIME Type is known. If the MIME Type detected is of type text/*, attempts to identify the charset used " + +"and an attribute with the name 'mime.charset' is added with the value being the charset.") @WritesAttributes({ -@WritesAttribute(attribute = "mime.type", description = "This Processor sets the FlowFile's mime.type attribute to the detected MIME Type. " -+ "If unable to detect the MIME Type, the attribute's value will be set to application/octet-stream"), -@WritesAttribute(attribute = "mime.extension", description = "This Processor sets the FlowFile's mime.extension attribute to the file " -+ "extension associated with the detected MIME Type. " -+ "If there is no correlated extension, the attribute's value will be empty") +@WritesAttribute(attribute = "mime.type", description = "This Processor sets the FlowFile's mime.type attribute to the detected MIME Type. " ++ "If unable to detect the MIME Type, the attribute's value will be set to application/octet-stream"), +@WritesAttribute(attribute = "mime.extension", description = "This Processor sets the FlowFile's mime.extension attribute to the file " ++ "extension associated with the detected MIME Type. " ++ "If there is no correlated extension, the attribute's value will be empty"), +@WritesAttribute(attribute = "mime.charset", description = "This Processor sets the FlowFile's mime.charset attribute to the detected charset. " ++ "If unable to detect the charset or the detected MIME type is not of type text/*, the attribute's value will be empty") Review Comment: Thank you for the feedback. I adjusted the implementation and documentation. In the original commit, I followed the implementation of `mime.extension`, which is set to an empty string, when it cannot be determined. Should its implementation as is or don't be set like `mime.charset` as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
exceptionfactory commented on code in PR #8011: URL: https://github.com/apache/nifi/pull/8011#discussion_r1393154825 ## nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java: ## @@ -233,53 +239,62 @@ public void onTrigger(final ProcessContext context, final ProcessSession session } final ComponentLog logger = getLogger(); -final AtomicReference mimeTypeRef = new AtomicReference<>(null); -final String filename = flowFile.getAttribute(CoreAttributes.FILENAME.key()); - -session.read(flowFile, new InputStreamCallback() { -@Override -public void process(final InputStream stream) throws IOException { -try (final InputStream in = new BufferedInputStream(stream); - final TikaInputStream tikaStream = TikaInputStream.get(in)) { -Metadata metadata = new Metadata(); - -if (filename != null && context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) { -metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filename); -} -// Get mime type -MediaType mediatype = detector.detect(tikaStream, metadata); -mimeTypeRef.set(mediatype.toString()); -} + +final String mediaTypeString; +final String extension; +final Charset charset; + +try (final InputStream flowFileStream = session.read(flowFile); + final TikaInputStream tikaStream = TikaInputStream.get(flowFileStream)) { +final String filename = flowFile.getAttribute(CoreAttributes.FILENAME.key()); + +Metadata metadata = new Metadata(); +if (filename != null && context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) { +metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filename); } -}); -String mimeType = mimeTypeRef.get(); +final MediaType mediaType = detector.detect(tikaStream, metadata); +mediaTypeString = mediaType.getBaseType().toString(); +extension = lookupExtension(mediaTypeString, logger); +charset = identifyCharset(tikaStream, metadata, mediaType); +} catch (IOException e) { +throw new ProcessException("IOException thrown identifying mime-type of FlowFile content", e); +} + +flowFile = session.putAttribute(flowFile, CoreAttributes.MIME_TYPE.key(), mediaTypeString); +flowFile = session.putAttribute(flowFile, "mime.extension", extension); +flowFile = session.putAttribute(flowFile, "mime.charset", charset == null ? "" : charset.name()); Review Comment: It seems better to avoid setting this attribute at all, as opposed to setting an empty value when the detected character set is null. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
exceptionfactory commented on code in PR #8011: URL: https://github.com/apache/nifi/pull/8011#discussion_r1393158889 ## nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java: ## @@ -89,14 +89,17 @@ @Tags({"compression", "gzip", "bzip2", "zip", "MIME", "mime.type", "file", "identify"}) @CapabilityDescription("Attempts to identify the MIME Type used for a FlowFile. If the MIME Type can be identified, " + "an attribute with the name 'mime.type' is added with the value being the MIME Type. If the MIME Type cannot be determined, " -+ "the value will be set to 'application/octet-stream'. In addition, the attribute mime.extension will be set if a common file " -+ "extension for the MIME Type is known.") ++ "the value will be set to 'application/octet-stream'. In addition, the attribute 'mime.extension' will be set if a common file " ++ "extension for the MIME Type is known. If the MIME Type detected is of type text/*, attempts to identify the charset used " + +"and an attribute with the name 'mime.charset' is added with the value being the charset.") @WritesAttributes({ -@WritesAttribute(attribute = "mime.type", description = "This Processor sets the FlowFile's mime.type attribute to the detected MIME Type. " -+ "If unable to detect the MIME Type, the attribute's value will be set to application/octet-stream"), -@WritesAttribute(attribute = "mime.extension", description = "This Processor sets the FlowFile's mime.extension attribute to the file " -+ "extension associated with the detected MIME Type. " -+ "If there is no correlated extension, the attribute's value will be empty") +@WritesAttribute(attribute = "mime.type", description = "This Processor sets the FlowFile's mime.type attribute to the detected MIME Type. " ++ "If unable to detect the MIME Type, the attribute's value will be set to application/octet-stream"), +@WritesAttribute(attribute = "mime.extension", description = "This Processor sets the FlowFile's mime.extension attribute to the file " ++ "extension associated with the detected MIME Type. " ++ "If there is no correlated extension, the attribute's value will be empty"), +@WritesAttribute(attribute = "mime.charset", description = "This Processor sets the FlowFile's mime.charset attribute to the detected charset. " ++ "If unable to detect the charset or the detected MIME type is not of type text/*, the attribute's value will be empty") Review Comment: As noted below, it seems better to avoid setting the attribute instead of setting an empty value for null. ## nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java: ## @@ -233,53 +239,62 @@ public void onTrigger(final ProcessContext context, final ProcessSession session } final ComponentLog logger = getLogger(); -final AtomicReference mimeTypeRef = new AtomicReference<>(null); -final String filename = flowFile.getAttribute(CoreAttributes.FILENAME.key()); - -session.read(flowFile, new InputStreamCallback() { -@Override -public void process(final InputStream stream) throws IOException { -try (final InputStream in = new BufferedInputStream(stream); - final TikaInputStream tikaStream = TikaInputStream.get(in)) { -Metadata metadata = new Metadata(); - -if (filename != null && context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) { -metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filename); -} -// Get mime type -MediaType mediatype = detector.detect(tikaStream, metadata); -mimeTypeRef.set(mediatype.toString()); -} + +final String mediaTypeString; +final String extension; +final Charset charset; + +try (final InputStream flowFileStream = session.read(flowFile); + final TikaInputStream tikaStream = TikaInputStream.get(flowFileStream)) { +final String filename = flowFile.getAttribute(CoreAttributes.FILENAME.key()); + +Metadata metadata = new Metadata(); +if (filename != null && context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) { +metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filename); } -}); -String mimeType = mimeTypeRef.get(); +final MediaType mediaType = detector.detect(tikaStream, metadata); +mediaTypeString = mediaType.getBaseType().toString(); +extension = lookupExtension(mediaTypeString, logger); +charset =
Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
EndzeitBegins commented on PR #8011: URL: https://github.com/apache/nifi/pull/8011#issuecomment-1811025944 @exceptionfactory Is there any chance to include this into 1.24.0, now that the RC1 is called off? This way, we wouldn't need to wait for the following release which will be some weeks in the future, I assume. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
EndzeitBegins commented on PR #8011: URL: https://github.com/apache/nifi/pull/8011#issuecomment-1808890691 Thank you for taking a look at this @exceptionfactory, I highly appreciate it. I've made some changes regarding the stream re-use as you've pointed out. I replaced the use of `ProcessSession.read` that receives a `InputStreamCallback` with the version that returns the `InputStream` instead, to be able to use final fields instead of AtomicReferences. While doing so, I've noticed that some code paths were impossible to reach (especially the mimeType being null) and removed them. This further streamlines the processors implementation. Please let me know what you think about these changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]
EndzeitBegins opened a new pull request, #8011: URL: https://github.com/apache/nifi/pull/8011 # Summary Identifies `charset`as part of `IdentifyMimeType` processor, when the detected MIME type is `text/*`. [NIFI-4550](https://issues.apache.org/jira/browse/NIFI-4550) [NIFI-1874](https://issues.apache.org/jira/browse/NIFI-1874) # Tracking ### Issue Tracking - [x] [Apache NiFi Jira](https://issues.apache.org/jira/browse/NIFI) issue created ### Pull Request Tracking - [x] Pull Request title starts with Apache NiFi Jira issue number, such as `NIFI-0` - [x] Pull Request commit message starts with Apache NiFi Jira issue number, as such `NIFI-0` ### Pull Request Formatting - [x] Pull Request based on current revision of the `main` branch - [x] Pull Request refers to a feature branch with one commit containing changes # Verification Please indicate the verification steps performed prior to pull request creation. ### Build - [x] Build completed using `mvn clean install -P contrib-check` - [x] JDK 21 ### Licensing - [] New dependencies are compatible with the [Apache License 2.0](https://apache.org/licenses/LICENSE-2.0) according to the [License Policy](https://www.apache.org/legal/resolved.html) - [ ] New dependencies are documented in applicable `LICENSE` and `NOTICE` files ### Documentation - [] Documentation formatting appears as expected in rendered files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org