Josh, Thank you for sharing an example file with me offline. That helped clarify what's going on.
1) Yes, this is coming from our JpegParser calling drew noakes' metadata-extractor, and Tika is adding the "File Name", "File Size" and "File Modified Date" in our CopyUnknownFieldsHandler (https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-image-module/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java#L277) 2) The recent changes do not affect the file name in this case. So metadata-extractor will extract the tmp file name . :( 3) I couldn't find in the original file a better name for the image. :( In Tika 3.x, I'd like to prefix all metadata keys so that we know the sources of the information. We can start dual-labeling keys in 2.x as we did in 1.x, where we keep the original key but also add the new key...effectively duplicating data, but allowing for a smoother transition to 3.x. We use "dcterms:created" for the actual metadata about when a file alleges it was created -- see for example in the parent file: "dcterms:created": "2010-01-14T20:50:34Z". And, as you note, there is no "resourceName" for the embedded file because neither Tika nor I could find an actual name for the embedded file. Two things are weird to me: a) There's no Content-Length on the embedded file. Given that we spooled the file to a temp file, we should be adding that. b) The image's embedded type is "ATTACHMENT" and it should be "INLINE" because the image is intended to be rendered as part of the email. So, what do we do now? Is there better behavior that won't break backwards compatibility? Best, Tim On Tue, Jan 17, 2023 at 11:39 AM Tim Allison <[email protected]> wrote: > > Josh, > I'm sorry for my delay. Without looking more deeply... > > 1) It totally isn't. My _guess_ is that this comes from the > metadata-extractor project that we use in the JPEG parser. I can look > into this, and we might consider turning it off in these cases. > 2) We made recent, unreleased changes in how Tika handles tmp file > names, and I _think_ we've fixed this. > 3) Yes. Is there an actual file name encoded somewhere in the html? > We might be able to associate it with this embedded file. > > If you'd prefer different behavior, please open an issue on our JIRA > for discussion/work. If you're able to share the file even if > privately, I can take a look and confirm 1 and 2. > > On Fri, Jan 6, 2023 at 4:04 PM Josh Burchard <[email protected]> wrote: > > > > I'm a bit confused with the parsed output from an eml file. Below is the > > result of parsing an embedded image. > > > > Questions: > > - How is it useful to have "File Modified Date" be today's date? This > > email was from 2010 so I'd think any dates would correspond with that. > > - In the same vein, why print out the name of the .tmp file Tika extracted > > the embedded image to? > > - Why is there an "X-TIKA:embedded_resource_path" but no corresponding > > "resourceName"? Is it because no name exists in the embedded context so > > Tika generates one? If so, then why have a X-TIKA:embedded_resource_path > > meta item at all? > > > > > > { > > "Component 1": "Y component: Quantization table 0, Sampling factors > > 2 horiz/2 vert", > > "Component 2": "Cb component: Quantization table 1, Sampling > > factors 1 horiz/1 vert", > > "Component 3": "Cr component: Quantization table 1, Sampling > > factors 1 horiz/1 vert", > > "Compression Type": "Baseline", > > "Content-Type": "image/jpeg", > > "Data Precision": "8 bits", > > "File Modified Date": "Fri Jan 06 14:23:00 -05:00 2023", > > "File Name": "apache-tika-2005517581185548340.tmp", > > "File Size": "45006 bytes", > > "Image Height": "125 pixels", > > "Image Width": "894 pixels", > > "Multipart-Boundary": "=_related 00719D7F852576AB_=", > > "Multipart-Subtype": "related", > > "Number of Components": "3", > > "Number of Tables": "4 Huffman tables", > > "Resolution Units": "none", > > "Thumbnail Height Pixels": "0", > > "Thumbnail Width Pixels": "0", > > "Version": "1.1", > > "X Resolution": "1 dot", > > "X-TIKA:Parsed-By": [ > > "org.apache.tika.parser.DefaultParser", > > "org.apache.tika.parser.image.JpegParser" > > ], > > "X-TIKA:embedded_depth": "1", > > "X-TIKA:embedded_resource_path": "/embedded-1", > > "X-TIKA:parse_time_millis": "16", > > "Y Resolution": "1 dot", > > "embeddedResourceType": "ATTACHMENT", > > "tiff:BitsPerSample": "8", > > "tiff:ImageLength": "125", > > "tiff:ImageWidth": "894" > > } > >
