Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-15 Thread via GitHub


EndzeitBegins commented on PR #8011:
URL: https://github.com/apache/nifi/pull/8011#issuecomment-1813151208

   The backport was more straightforward than I've expected. I've opened #8031.
   I'd highly appreciate if you'd take a look at it @exceptionfactory, as 
you're familiar with this original pull-request already.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-15 Thread via GitHub


EndzeitBegins commented on PR #8011:
URL: https://github.com/apache/nifi/pull/8011#issuecomment-1812907291

   Thank you for your reviews and the fast merge @exceptionfactory.
   
   I'll take a look at what's required for the backport in the next days 
presumably and open a new pull-request if the work required is manageable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-15 Thread via GitHub


exceptionfactory closed pull request #8011: NIFI-4550 Detect charset for 
FlowFile with MIME type text/*
URL: https://github.com/apache/nifi/pull/8011


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-14 Thread via GitHub


EndzeitBegins commented on code in PR #8011:
URL: https://github.com/apache/nifi/pull/8011#discussion_r1393217615


##
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java:
##
@@ -89,14 +89,17 @@
 @Tags({"compression", "gzip", "bzip2", "zip", "MIME", "mime.type", "file", 
"identify"})
 @CapabilityDescription("Attempts to identify the MIME Type used for a 
FlowFile. If the MIME Type can be identified, "
 + "an attribute with the name 'mime.type' is added with the value 
being the MIME Type. If the MIME Type cannot be determined, "
-+ "the value will be set to 'application/octet-stream'. In addition, 
the attribute mime.extension will be set if a common file "
-+ "extension for the MIME Type is known.")
++ "the value will be set to 'application/octet-stream'. In addition, 
the attribute 'mime.extension' will be set if a common file "
++ "extension for the MIME Type is known. If the MIME Type detected is 
of type text/*, attempts to identify the charset used " +
+"and an attribute with the name 'mime.charset' is added with the value 
being the charset.")
 @WritesAttributes({
-@WritesAttribute(attribute = "mime.type", description = "This Processor sets 
the FlowFile's mime.type attribute to the detected MIME Type. "
-+ "If unable to detect the MIME Type, the attribute's value will be 
set to application/octet-stream"),
-@WritesAttribute(attribute = "mime.extension", description = "This Processor 
sets the FlowFile's mime.extension attribute to the file "
-+ "extension associated with the detected MIME Type. "
-+ "If there is no correlated extension, the attribute's value will be 
empty")
+@WritesAttribute(attribute = "mime.type", description = "This 
Processor sets the FlowFile's mime.type attribute to the detected MIME Type. "
++ "If unable to detect the MIME Type, the attribute's value 
will be set to application/octet-stream"),
+@WritesAttribute(attribute = "mime.extension", description = "This 
Processor sets the FlowFile's mime.extension attribute to the file "
++ "extension associated with the detected MIME Type. "
++ "If there is no correlated extension, the attribute's value 
will be empty"),
+@WritesAttribute(attribute = "mime.charset", description = "This 
Processor sets the FlowFile's mime.charset attribute to the detected charset. "
++ "If unable to detect the charset or the detected MIME type 
is not of type text/*, the attribute's value will be empty")

Review Comment:
   Perfect. Yes, that makes sense. Thank you for your quick responses.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-14 Thread via GitHub


exceptionfactory commented on code in PR #8011:
URL: https://github.com/apache/nifi/pull/8011#discussion_r1393209952


##
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java:
##
@@ -89,14 +89,17 @@
 @Tags({"compression", "gzip", "bzip2", "zip", "MIME", "mime.type", "file", 
"identify"})
 @CapabilityDescription("Attempts to identify the MIME Type used for a 
FlowFile. If the MIME Type can be identified, "
 + "an attribute with the name 'mime.type' is added with the value 
being the MIME Type. If the MIME Type cannot be determined, "
-+ "the value will be set to 'application/octet-stream'. In addition, 
the attribute mime.extension will be set if a common file "
-+ "extension for the MIME Type is known.")
++ "the value will be set to 'application/octet-stream'. In addition, 
the attribute 'mime.extension' will be set if a common file "
++ "extension for the MIME Type is known. If the MIME Type detected is 
of type text/*, attempts to identify the charset used " +
+"and an attribute with the name 'mime.charset' is added with the value 
being the charset.")
 @WritesAttributes({
-@WritesAttribute(attribute = "mime.type", description = "This Processor sets 
the FlowFile's mime.type attribute to the detected MIME Type. "
-+ "If unable to detect the MIME Type, the attribute's value will be 
set to application/octet-stream"),
-@WritesAttribute(attribute = "mime.extension", description = "This Processor 
sets the FlowFile's mime.extension attribute to the file "
-+ "extension associated with the detected MIME Type. "
-+ "If there is no correlated extension, the attribute's value will be 
empty")
+@WritesAttribute(attribute = "mime.type", description = "This 
Processor sets the FlowFile's mime.type attribute to the detected MIME Type. "
++ "If unable to detect the MIME Type, the attribute's value 
will be set to application/octet-stream"),
+@WritesAttribute(attribute = "mime.extension", description = "This 
Processor sets the FlowFile's mime.extension attribute to the file "
++ "extension associated with the detected MIME Type. "
++ "If there is no correlated extension, the attribute's value 
will be empty"),
+@WritesAttribute(attribute = "mime.charset", description = "This 
Processor sets the FlowFile's mime.charset attribute to the detected charset. "
++ "If unable to detect the charset or the detected MIME type 
is not of type text/*, the attribute's value will be empty")

Review Comment:
   Thanks for making the change to the new `mime.charset` approach. I think the 
existing approach for `mime.extension` should be left unchanged to avoid 
unexpected changes for existing flows. Since this is a new attribute that only 
applies for text types, it seems better to set it only when it is not null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-14 Thread via GitHub


EndzeitBegins commented on code in PR #8011:
URL: https://github.com/apache/nifi/pull/8011#discussion_r1393174739


##
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java:
##
@@ -89,14 +89,17 @@
 @Tags({"compression", "gzip", "bzip2", "zip", "MIME", "mime.type", "file", 
"identify"})
 @CapabilityDescription("Attempts to identify the MIME Type used for a 
FlowFile. If the MIME Type can be identified, "
 + "an attribute with the name 'mime.type' is added with the value 
being the MIME Type. If the MIME Type cannot be determined, "
-+ "the value will be set to 'application/octet-stream'. In addition, 
the attribute mime.extension will be set if a common file "
-+ "extension for the MIME Type is known.")
++ "the value will be set to 'application/octet-stream'. In addition, 
the attribute 'mime.extension' will be set if a common file "
++ "extension for the MIME Type is known. If the MIME Type detected is 
of type text/*, attempts to identify the charset used " +
+"and an attribute with the name 'mime.charset' is added with the value 
being the charset.")
 @WritesAttributes({
-@WritesAttribute(attribute = "mime.type", description = "This Processor sets 
the FlowFile's mime.type attribute to the detected MIME Type. "
-+ "If unable to detect the MIME Type, the attribute's value will be 
set to application/octet-stream"),
-@WritesAttribute(attribute = "mime.extension", description = "This Processor 
sets the FlowFile's mime.extension attribute to the file "
-+ "extension associated with the detected MIME Type. "
-+ "If there is no correlated extension, the attribute's value will be 
empty")
+@WritesAttribute(attribute = "mime.type", description = "This 
Processor sets the FlowFile's mime.type attribute to the detected MIME Type. "
++ "If unable to detect the MIME Type, the attribute's value 
will be set to application/octet-stream"),
+@WritesAttribute(attribute = "mime.extension", description = "This 
Processor sets the FlowFile's mime.extension attribute to the file "
++ "extension associated with the detected MIME Type. "
++ "If there is no correlated extension, the attribute's value 
will be empty"),
+@WritesAttribute(attribute = "mime.charset", description = "This 
Processor sets the FlowFile's mime.charset attribute to the detected charset. "
++ "If unable to detect the charset or the detected MIME type 
is not of type text/*, the attribute's value will be empty")

Review Comment:
   Thank you for the feedback. 
   I adjusted the implementation and documentation. In the original commit, I 
followed the implementation of `mime.extension`, which is set to an empty 
string, when it cannot be determined. Should its implementation stay as is or 
be changed as well, such that it is not set at all, like `mime.charset`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-14 Thread via GitHub


EndzeitBegins commented on code in PR #8011:
URL: https://github.com/apache/nifi/pull/8011#discussion_r1393174739


##
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java:
##
@@ -89,14 +89,17 @@
 @Tags({"compression", "gzip", "bzip2", "zip", "MIME", "mime.type", "file", 
"identify"})
 @CapabilityDescription("Attempts to identify the MIME Type used for a 
FlowFile. If the MIME Type can be identified, "
 + "an attribute with the name 'mime.type' is added with the value 
being the MIME Type. If the MIME Type cannot be determined, "
-+ "the value will be set to 'application/octet-stream'. In addition, 
the attribute mime.extension will be set if a common file "
-+ "extension for the MIME Type is known.")
++ "the value will be set to 'application/octet-stream'. In addition, 
the attribute 'mime.extension' will be set if a common file "
++ "extension for the MIME Type is known. If the MIME Type detected is 
of type text/*, attempts to identify the charset used " +
+"and an attribute with the name 'mime.charset' is added with the value 
being the charset.")
 @WritesAttributes({
-@WritesAttribute(attribute = "mime.type", description = "This Processor sets 
the FlowFile's mime.type attribute to the detected MIME Type. "
-+ "If unable to detect the MIME Type, the attribute's value will be 
set to application/octet-stream"),
-@WritesAttribute(attribute = "mime.extension", description = "This Processor 
sets the FlowFile's mime.extension attribute to the file "
-+ "extension associated with the detected MIME Type. "
-+ "If there is no correlated extension, the attribute's value will be 
empty")
+@WritesAttribute(attribute = "mime.type", description = "This 
Processor sets the FlowFile's mime.type attribute to the detected MIME Type. "
++ "If unable to detect the MIME Type, the attribute's value 
will be set to application/octet-stream"),
+@WritesAttribute(attribute = "mime.extension", description = "This 
Processor sets the FlowFile's mime.extension attribute to the file "
++ "extension associated with the detected MIME Type. "
++ "If there is no correlated extension, the attribute's value 
will be empty"),
+@WritesAttribute(attribute = "mime.charset", description = "This 
Processor sets the FlowFile's mime.charset attribute to the detected charset. "
++ "If unable to detect the charset or the detected MIME type 
is not of type text/*, the attribute's value will be empty")

Review Comment:
   Thank you for the feedback. 
   I adjusted the implementation and documentation. In the original commit, I 
followed the implementation of `mime.extension`, which is set to an empty 
string, when it cannot be determined. Should its implementation stay as is or 
be changed as well, such that it does not be set like `mime.charset`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-14 Thread via GitHub


EndzeitBegins commented on code in PR #8011:
URL: https://github.com/apache/nifi/pull/8011#discussion_r1393176002


##
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java:
##
@@ -233,53 +239,62 @@ public void onTrigger(final ProcessContext context, final 
ProcessSession session
 }
 
 final ComponentLog logger = getLogger();
-final AtomicReference mimeTypeRef = new 
AtomicReference<>(null);
-final String filename = 
flowFile.getAttribute(CoreAttributes.FILENAME.key());
-
-session.read(flowFile, new InputStreamCallback() {
-@Override
-public void process(final InputStream stream) throws IOException {
-try (final InputStream in = new BufferedInputStream(stream);
- final TikaInputStream tikaStream = 
TikaInputStream.get(in)) {
-Metadata metadata = new Metadata();
-
-if (filename != null && 
context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) {
-metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, 
filename);
-}
-// Get mime type
-MediaType mediatype = detector.detect(tikaStream, 
metadata);
-mimeTypeRef.set(mediatype.toString());
-}
+
+final String mediaTypeString;
+final String extension;
+final Charset charset;
+
+try (final InputStream flowFileStream = session.read(flowFile);
+ final TikaInputStream tikaStream = 
TikaInputStream.get(flowFileStream)) {
+final String filename = 
flowFile.getAttribute(CoreAttributes.FILENAME.key());
+
+Metadata metadata = new Metadata();
+if (filename != null && 
context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) {
+metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filename);
 }
-});
 
-String mimeType = mimeTypeRef.get();
+final MediaType mediaType = detector.detect(tikaStream, metadata);
+mediaTypeString = mediaType.getBaseType().toString();
+extension = lookupExtension(mediaTypeString, logger);
+charset = identifyCharset(tikaStream, metadata, mediaType);
+} catch (IOException e) {
+throw new ProcessException("IOException thrown identifying 
mime-type of FlowFile content", e);
+}
+
+flowFile = session.putAttribute(flowFile, 
CoreAttributes.MIME_TYPE.key(), mediaTypeString);
+flowFile = session.putAttribute(flowFile, "mime.extension", extension);
+flowFile = session.putAttribute(flowFile, "mime.charset", charset == 
null ? "" : charset.name());

Review Comment:
   I added a commit, no longer setting the attribute at all, when the charset 
cannot be detected.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-14 Thread via GitHub


EndzeitBegins commented on code in PR #8011:
URL: https://github.com/apache/nifi/pull/8011#discussion_r1393175162


##
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java:
##
@@ -233,53 +239,62 @@ public void onTrigger(final ProcessContext context, final 
ProcessSession session
 }
 
 final ComponentLog logger = getLogger();
-final AtomicReference mimeTypeRef = new 
AtomicReference<>(null);
-final String filename = 
flowFile.getAttribute(CoreAttributes.FILENAME.key());
-
-session.read(flowFile, new InputStreamCallback() {
-@Override
-public void process(final InputStream stream) throws IOException {
-try (final InputStream in = new BufferedInputStream(stream);
- final TikaInputStream tikaStream = 
TikaInputStream.get(in)) {
-Metadata metadata = new Metadata();
-
-if (filename != null && 
context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) {
-metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, 
filename);
-}
-// Get mime type
-MediaType mediatype = detector.detect(tikaStream, 
metadata);
-mimeTypeRef.set(mediatype.toString());
-}
+
+final String mediaTypeString;
+final String extension;
+final Charset charset;
+
+try (final InputStream flowFileStream = session.read(flowFile);
+ final TikaInputStream tikaStream = 
TikaInputStream.get(flowFileStream)) {
+final String filename = 
flowFile.getAttribute(CoreAttributes.FILENAME.key());
+
+Metadata metadata = new Metadata();
+if (filename != null && 
context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) {
+metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filename);
 }
-});
 
-String mimeType = mimeTypeRef.get();
+final MediaType mediaType = detector.detect(tikaStream, metadata);
+mediaTypeString = mediaType.getBaseType().toString();
+extension = lookupExtension(mediaTypeString, logger);
+charset = identifyCharset(tikaStream, metadata, mediaType);
+} catch (IOException e) {
+throw new ProcessException("IOException thrown identifying 
mime-type of FlowFile content", e);

Review Comment:
   I adjusted the message as suggested.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-14 Thread via GitHub


EndzeitBegins commented on code in PR #8011:
URL: https://github.com/apache/nifi/pull/8011#discussion_r1393174739


##
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java:
##
@@ -89,14 +89,17 @@
 @Tags({"compression", "gzip", "bzip2", "zip", "MIME", "mime.type", "file", 
"identify"})
 @CapabilityDescription("Attempts to identify the MIME Type used for a 
FlowFile. If the MIME Type can be identified, "
 + "an attribute with the name 'mime.type' is added with the value 
being the MIME Type. If the MIME Type cannot be determined, "
-+ "the value will be set to 'application/octet-stream'. In addition, 
the attribute mime.extension will be set if a common file "
-+ "extension for the MIME Type is known.")
++ "the value will be set to 'application/octet-stream'. In addition, 
the attribute 'mime.extension' will be set if a common file "
++ "extension for the MIME Type is known. If the MIME Type detected is 
of type text/*, attempts to identify the charset used " +
+"and an attribute with the name 'mime.charset' is added with the value 
being the charset.")
 @WritesAttributes({
-@WritesAttribute(attribute = "mime.type", description = "This Processor sets 
the FlowFile's mime.type attribute to the detected MIME Type. "
-+ "If unable to detect the MIME Type, the attribute's value will be 
set to application/octet-stream"),
-@WritesAttribute(attribute = "mime.extension", description = "This Processor 
sets the FlowFile's mime.extension attribute to the file "
-+ "extension associated with the detected MIME Type. "
-+ "If there is no correlated extension, the attribute's value will be 
empty")
+@WritesAttribute(attribute = "mime.type", description = "This 
Processor sets the FlowFile's mime.type attribute to the detected MIME Type. "
++ "If unable to detect the MIME Type, the attribute's value 
will be set to application/octet-stream"),
+@WritesAttribute(attribute = "mime.extension", description = "This 
Processor sets the FlowFile's mime.extension attribute to the file "
++ "extension associated with the detected MIME Type. "
++ "If there is no correlated extension, the attribute's value 
will be empty"),
+@WritesAttribute(attribute = "mime.charset", description = "This 
Processor sets the FlowFile's mime.charset attribute to the detected charset. "
++ "If unable to detect the charset or the detected MIME type 
is not of type text/*, the attribute's value will be empty")

Review Comment:
   Thank you for the feedback. 
   I adjusted the implementation and documentation. In the original commit, I 
followed the implementation of `mime.extension`, which is set to an empty 
string, when it cannot be determined. Should its implementation as is or don't 
be set like `mime.charset` as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-14 Thread via GitHub


exceptionfactory commented on code in PR #8011:
URL: https://github.com/apache/nifi/pull/8011#discussion_r1393154825


##
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java:
##
@@ -233,53 +239,62 @@ public void onTrigger(final ProcessContext context, final 
ProcessSession session
 }
 
 final ComponentLog logger = getLogger();
-final AtomicReference mimeTypeRef = new 
AtomicReference<>(null);
-final String filename = 
flowFile.getAttribute(CoreAttributes.FILENAME.key());
-
-session.read(flowFile, new InputStreamCallback() {
-@Override
-public void process(final InputStream stream) throws IOException {
-try (final InputStream in = new BufferedInputStream(stream);
- final TikaInputStream tikaStream = 
TikaInputStream.get(in)) {
-Metadata metadata = new Metadata();
-
-if (filename != null && 
context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) {
-metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, 
filename);
-}
-// Get mime type
-MediaType mediatype = detector.detect(tikaStream, 
metadata);
-mimeTypeRef.set(mediatype.toString());
-}
+
+final String mediaTypeString;
+final String extension;
+final Charset charset;
+
+try (final InputStream flowFileStream = session.read(flowFile);
+ final TikaInputStream tikaStream = 
TikaInputStream.get(flowFileStream)) {
+final String filename = 
flowFile.getAttribute(CoreAttributes.FILENAME.key());
+
+Metadata metadata = new Metadata();
+if (filename != null && 
context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) {
+metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filename);
 }
-});
 
-String mimeType = mimeTypeRef.get();
+final MediaType mediaType = detector.detect(tikaStream, metadata);
+mediaTypeString = mediaType.getBaseType().toString();
+extension = lookupExtension(mediaTypeString, logger);
+charset = identifyCharset(tikaStream, metadata, mediaType);
+} catch (IOException e) {
+throw new ProcessException("IOException thrown identifying 
mime-type of FlowFile content", e);
+}
+
+flowFile = session.putAttribute(flowFile, 
CoreAttributes.MIME_TYPE.key(), mediaTypeString);
+flowFile = session.putAttribute(flowFile, "mime.extension", extension);
+flowFile = session.putAttribute(flowFile, "mime.charset", charset == 
null ? "" : charset.name());

Review Comment:
   It seems better to avoid setting this attribute at all, as opposed to 
setting an empty value when the detected character set is null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-14 Thread via GitHub


exceptionfactory commented on code in PR #8011:
URL: https://github.com/apache/nifi/pull/8011#discussion_r1393158889


##
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java:
##
@@ -89,14 +89,17 @@
 @Tags({"compression", "gzip", "bzip2", "zip", "MIME", "mime.type", "file", 
"identify"})
 @CapabilityDescription("Attempts to identify the MIME Type used for a 
FlowFile. If the MIME Type can be identified, "
 + "an attribute with the name 'mime.type' is added with the value 
being the MIME Type. If the MIME Type cannot be determined, "
-+ "the value will be set to 'application/octet-stream'. In addition, 
the attribute mime.extension will be set if a common file "
-+ "extension for the MIME Type is known.")
++ "the value will be set to 'application/octet-stream'. In addition, 
the attribute 'mime.extension' will be set if a common file "
++ "extension for the MIME Type is known. If the MIME Type detected is 
of type text/*, attempts to identify the charset used " +
+"and an attribute with the name 'mime.charset' is added with the value 
being the charset.")
 @WritesAttributes({
-@WritesAttribute(attribute = "mime.type", description = "This Processor sets 
the FlowFile's mime.type attribute to the detected MIME Type. "
-+ "If unable to detect the MIME Type, the attribute's value will be 
set to application/octet-stream"),
-@WritesAttribute(attribute = "mime.extension", description = "This Processor 
sets the FlowFile's mime.extension attribute to the file "
-+ "extension associated with the detected MIME Type. "
-+ "If there is no correlated extension, the attribute's value will be 
empty")
+@WritesAttribute(attribute = "mime.type", description = "This 
Processor sets the FlowFile's mime.type attribute to the detected MIME Type. "
++ "If unable to detect the MIME Type, the attribute's value 
will be set to application/octet-stream"),
+@WritesAttribute(attribute = "mime.extension", description = "This 
Processor sets the FlowFile's mime.extension attribute to the file "
++ "extension associated with the detected MIME Type. "
++ "If there is no correlated extension, the attribute's value 
will be empty"),
+@WritesAttribute(attribute = "mime.charset", description = "This 
Processor sets the FlowFile's mime.charset attribute to the detected charset. "
++ "If unable to detect the charset or the detected MIME type 
is not of type text/*, the attribute's value will be empty")

Review Comment:
   As noted below, it seems better to avoid setting the attribute instead of 
setting an empty value for null.



##
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java:
##
@@ -233,53 +239,62 @@ public void onTrigger(final ProcessContext context, final 
ProcessSession session
 }
 
 final ComponentLog logger = getLogger();
-final AtomicReference mimeTypeRef = new 
AtomicReference<>(null);
-final String filename = 
flowFile.getAttribute(CoreAttributes.FILENAME.key());
-
-session.read(flowFile, new InputStreamCallback() {
-@Override
-public void process(final InputStream stream) throws IOException {
-try (final InputStream in = new BufferedInputStream(stream);
- final TikaInputStream tikaStream = 
TikaInputStream.get(in)) {
-Metadata metadata = new Metadata();
-
-if (filename != null && 
context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) {
-metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, 
filename);
-}
-// Get mime type
-MediaType mediatype = detector.detect(tikaStream, 
metadata);
-mimeTypeRef.set(mediatype.toString());
-}
+
+final String mediaTypeString;
+final String extension;
+final Charset charset;
+
+try (final InputStream flowFileStream = session.read(flowFile);
+ final TikaInputStream tikaStream = 
TikaInputStream.get(flowFileStream)) {
+final String filename = 
flowFile.getAttribute(CoreAttributes.FILENAME.key());
+
+Metadata metadata = new Metadata();
+if (filename != null && 
context.getProperty(USE_FILENAME_IN_DETECTION).asBoolean()) {
+metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filename);
 }
-});
 
-String mimeType = mimeTypeRef.get();
+final MediaType mediaType = detector.detect(tikaStream, metadata);
+mediaTypeString = mediaType.getBaseType().toString();
+extension = lookupExtension(mediaTypeString, logger);
+charset = 

Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-14 Thread via GitHub


EndzeitBegins commented on PR #8011:
URL: https://github.com/apache/nifi/pull/8011#issuecomment-1811025944

   @exceptionfactory Is there any chance to include this into 1.24.0, now that 
the RC1 is called off? This way, we wouldn't need to wait for the following 
release which will be some weeks in the future, I assume.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-13 Thread via GitHub


EndzeitBegins commented on PR #8011:
URL: https://github.com/apache/nifi/pull/8011#issuecomment-1808890691

   Thank you for taking a look at this @exceptionfactory, I highly appreciate 
it.
   
   I've made some changes regarding the stream re-use as you've pointed out.
   I replaced the use of `ProcessSession.read` that receives a 
`InputStreamCallback` with the version that returns the `InputStream` instead, 
to be able to use final fields instead of AtomicReferences. 
   While doing so, I've noticed that some code paths were impossible to reach 
(especially the mimeType being null) and removed them. This further streamlines 
the processors implementation.
   
   Please let me know what you think about these changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NIFI-4550 Detect charset for FlowFile with MIME type text/* [nifi]

2023-11-12 Thread via GitHub


EndzeitBegins opened a new pull request, #8011:
URL: https://github.com/apache/nifi/pull/8011

   
   
   
   
   
   
   
   
   
   
   
   
   
   # Summary
   
   Identifies `charset`as part of `IdentifyMimeType` processor, when the 
detected MIME type is `text/*`.
   
   [NIFI-4550](https://issues.apache.org/jira/browse/NIFI-4550)
   [NIFI-1874](https://issues.apache.org/jira/browse/NIFI-1874)
   
   # Tracking
   
   ### Issue Tracking
   
   - [x] [Apache NiFi Jira](https://issues.apache.org/jira/browse/NIFI) issue 
created
   
   ### Pull Request Tracking
   
   - [x] Pull Request title starts with Apache NiFi Jira issue number, such as 
`NIFI-0`
   - [x] Pull Request commit message starts with Apache NiFi Jira issue number, 
as such `NIFI-0`
   
   ### Pull Request Formatting
   
   - [x] Pull Request based on current revision of the `main` branch
   - [x] Pull Request refers to a feature branch with one commit containing 
changes
   
   # Verification
   
   Please indicate the verification steps performed prior to pull request 
creation.
   
   ### Build
   
   - [x] Build completed using `mvn clean install -P contrib-check`
 - [x] JDK 21
   
   ### Licensing
   
   - [] New dependencies are compatible with the [Apache License 
2.0](https://apache.org/licenses/LICENSE-2.0) according to the [License 
Policy](https://www.apache.org/legal/resolved.html)
   - [ ] New dependencies are documented in applicable `LICENSE` and `NOTICE` 
files
   
   ### Documentation
   
   - [] Documentation formatting appears as expected in rendered files
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org