Josh, I'm sorry for my delay. Without looking more deeply... 1) It totally isn't. My _guess_ is that this comes from the metadata-extractor project that we use in the JPEG parser. I can look into this, and we might consider turning it off in these cases. 2) We made recent, unreleased changes in how Tika handles tmp file names, and I _think_ we've fixed this. 3) Yes. Is there an actual file name encoded somewhere in the html? We might be able to associate it with this embedded file.
If you'd prefer different behavior, please open an issue on our JIRA for discussion/work. If you're able to share the file even if privately, I can take a look and confirm 1 and 2. On Fri, Jan 6, 2023 at 4:04 PM Josh Burchard <[email protected]> wrote: > > I'm a bit confused with the parsed output from an eml file. Below is the > result of parsing an embedded image. > > Questions: > - How is it useful to have "File Modified Date" be today's date? This email > was from 2010 so I'd think any dates would correspond with that. > - In the same vein, why print out the name of the .tmp file Tika extracted > the embedded image to? > - Why is there an "X-TIKA:embedded_resource_path" but no corresponding > "resourceName"? Is it because no name exists in the embedded context so Tika > generates one? If so, then why have a X-TIKA:embedded_resource_path meta item > at all? > > > { > "Component 1": "Y component: Quantization table 0, Sampling factors 2 > horiz/2 vert", > "Component 2": "Cb component: Quantization table 1, Sampling factors > 1 horiz/1 vert", > "Component 3": "Cr component: Quantization table 1, Sampling factors > 1 horiz/1 vert", > "Compression Type": "Baseline", > "Content-Type": "image/jpeg", > "Data Precision": "8 bits", > "File Modified Date": "Fri Jan 06 14:23:00 -05:00 2023", > "File Name": "apache-tika-2005517581185548340.tmp", > "File Size": "45006 bytes", > "Image Height": "125 pixels", > "Image Width": "894 pixels", > "Multipart-Boundary": "=_related 00719D7F852576AB_=", > "Multipart-Subtype": "related", > "Number of Components": "3", > "Number of Tables": "4 Huffman tables", > "Resolution Units": "none", > "Thumbnail Height Pixels": "0", > "Thumbnail Width Pixels": "0", > "Version": "1.1", > "X Resolution": "1 dot", > "X-TIKA:Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.image.JpegParser" > ], > "X-TIKA:embedded_depth": "1", > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "16", > "Y Resolution": "1 dot", > "embeddedResourceType": "ATTACHMENT", > "tiff:BitsPerSample": "8", > "tiff:ImageLength": "125", > "tiff:ImageWidth": "894" > } >
