[jira] [Created] (TIKA-2478) MBOX import includes redundant copies of the text
Robert Letzler created TIKA-2478: Summary: MBOX import includes redundant copies of the text Key: TIKA-2478 URL: https://issues.apache.org/jira/browse/TIKA-2478 Project: Tika Issue Type: Bug Affects Versions: 1.16 Reporter: Robert Letzler Priority: Minor MBOX messages often get parsed into four documents: a. The mbox file - outer container "/" b. The actual email-- "/embedded-1" c. The utf-8 text content of the email "/embedded-1/embedded-2" d. The utf-8 html content of the email "/embedded-1/embedded-3" entries C and D are redundant and distracting. The MSG parser parses the first non-null: email body and then it skips the rest. Please modify MBOX to not have separate "attached" documents for the html body and the text body. The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example of input sufficient to generate this behavior. Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-1788) message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header
[ https://issues.apache.org/jira/browse/TIKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206460#comment-16206460 ] ASF GitHub Bot commented on TIKA-1788: -- tballison commented on issue #211: [TIKA-1788] RFC822Parser: provide email attachment filenames when available URL: https://github.com/apache/tika/pull/211#issuecomment-337012343 @AarjavP , I very much appreciate this PR. I regret that I haven't been able to review it carefully yet, but I look forward to doing so over the next week or so (I hope). Thank you! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > message/rfc822 parser doesn't identify attachment filenames from > Content-Disposition header > --- > > Key: TIKA-1788 > URL: https://issues.apache.org/jira/browse/TIKA-1788 > Project: Tika > Issue Type: Bug >Affects Versions: 1.11 >Reporter: Sergey Tsalkov >Assignee: Tim Allison > Attachments: grep_content_disposition.zip > > > rfc822 email files can contain attachments as subparts, and they'll > generally specify the filename of the attachment in a manner like > this: > Content-Disposition: attachment; > filename*=utf-8''image001.jpg > Tika doesn't seem to be grabbing that information at all! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers
[ https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206418#comment-16206418 ] Tim Allison edited comment on TIKA-2471 at 10/16/17 7:30 PM: - That looks totally hosed. Thank you for opening this and supplying an example triggering file. bq. But more to the point, what is the idea behind setting the headers in the MboxParser if they're going to be set by the RFC822Parser in any case? TIKA-1244 brought that behavior in. Before that, emails weren't treated as embedded files if I understand correctly. bq. why does the parser force Windows-1252 as the charset? Again, no idea, -but I suspect that was because of the rfc822 method of encoding-. I simply have no idea. Are you able to share an example where this corrupts the content? was (Author: talli...@mitre.org): That looks totally hosed. Thank you for opening this and supplying an example triggering file. bq. But more to the point, what is the idea behind setting the headers in the MboxParser if they're going to be set by the RFC822Parser in any case? TIKA-1244 brought that behavior in. Before that, emails weren't treated as embedded files if I understand correctly. bq. why does the parser force Windows-1252 as the charset? Again, no idea, but I suspect that was because of the rfc822 method of encoding. Are you able to share an example where this corrupts the content? > Tab-prefixed message body lines in Mbox interpreted as headers > -- > > Key: TIKA-2471 > URL: https://issues.apache.org/jira/browse/TIKA-2471 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.16 >Reporter: Matthew Caruana Galizia > Labels: message, rfc822 > Attachments: mbox > > > The mbox parser code is overly optimistic. It parses the entire message > looking for anything that matches a header pattern, wherever it occurs in a > line! > It looks to me like the parsing logic is in desperate need of a refactor. But > more to the point, what is the idea behind setting the headers in the > MboxParser if they're going to be set by the RFC822Parser in any case? > Also, out of curiosity, why does the parser force Windows-1252 as the charset? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers
[ https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206418#comment-16206418 ] Tim Allison commented on TIKA-2471: --- That looks totally hosed. Thank you for opening this and supplying an example triggering file. bq. But more to the point, what is the idea behind setting the headers in the MboxParser if they're going to be set by the RFC822Parser in any case? TIKA-1244 brought that behavior in. Before that, emails weren't treated as embedded files if I understand correctly. bq. why does the parser force Windows-1252 as the charset? Again, no idea, but I suspect that was because of the rfc822 method of encoding. Are you able to share an example where this corrupts the content? > Tab-prefixed message body lines in Mbox interpreted as headers > -- > > Key: TIKA-2471 > URL: https://issues.apache.org/jira/browse/TIKA-2471 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.16 >Reporter: Matthew Caruana Galizia > Labels: message, rfc822 > Attachments: mbox > > > The mbox parser code is overly optimistic. It parses the entire message > looking for anything that matches a header pattern, wherever it occurs in a > line! > It looks to me like the parsing logic is in desperate need of a refactor. But > more to the point, what is the idea behind setting the headers in the > MboxParser if they're going to be set by the RFC822Parser in any case? > Also, out of curiosity, why does the parser force Windows-1252 as the charset? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2477) Tika : Content of XLSX file extraction is not working after poi library upgrade
[ https://issues.apache.org/jira/browse/TIKA-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206374#comment-16206374 ] Tim Allison commented on TIKA-2477: --- Try a more recent version of Tika -- 1.16 -- available here: http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.16.jar and let us know if you still have the same problem. Thank you! > Tika : Content of XLSX file extraction is not working after poi library > upgrade > > > Key: TIKA-2477 > URL: https://issues.apache.org/jira/browse/TIKA-2477 > Project: Tika > Issue Type: Bug > Components: core >Reporter: Ramchandran > > Hi Team, > I had written program to extract content of simple xlsx file. The program is > working fine with poi-3.11 library but now I have upgraded my poi library to > poi-3.16. Now the program is running but the output is not displayed.(Post > upgrade only sheet name is displayed). > Class File > === > package MSExcelParse; > import java.io.File; > import java.io.FileInputStream; > import java.io.IOException; > import org.apache.tika.exception.TikaException; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.sax.BodyContentHandler; > import org.xml.sax.SAXException; > public class MSExcelParser{ >public static void main(final String[] args) throws IOException, > TikaException, SAXException { > > //detecting the file type > BodyContentHandler handler = new BodyContentHandler(); > Metadata metadata = new Metadata(); > FileInputStream inputstream = new FileInputStream(new > File("C:\\JavaTest\\Student.xlsx")); > ParseContext pcontext = new ParseContext(); > > Parser parser = new AutoDetectParser(); > parser.parse(inputstream, handler, metadata,pcontext); > > System.out.println("Contents of the document:" + handler.toString()); >} > } > .classpath file > > > > >path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-1.7"/> >path="C:/JavaTest/commons-collections4-4.1.jar"/> >path="C:/JavaTest/commons-compress-1.8.1.jar"/> > > > >path="C:/JavaTest/poi-ooxml-schemas-3.16.jar"/> > > > > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2477) Tika : Content of XLSX file extraction is not working after poi library upgrade
Ramchandran created TIKA-2477: - Summary: Tika : Content of XLSX file extraction is not working after poi library upgrade Key: TIKA-2477 URL: https://issues.apache.org/jira/browse/TIKA-2477 Project: Tika Issue Type: Bug Components: core Reporter: Ramchandran Hi Team, I had written program to extract content of simple xlsx file. The program is working fine with poi-3.11 library but now I have upgraded my poi library to poi-3.16. Now the program is running but the output is not displayed.(Post upgrade only sheet name is displayed). Class File === package MSExcelParse; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.SAXException; public class MSExcelParser{ public static void main(final String[] args) throws IOException, TikaException, SAXException { //detecting the file type BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("C:\\JavaTest\\Student.xlsx")); ParseContext pcontext = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); } } .classpath file -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205529#comment-16205529 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-336805333 The new urls are, https://raw.githubusercontent.com/tensorflow/models/master/research/inception/inception/data/imagenet_lsvrc_2015_synsets.txt https://raw.githubusercontent.com/tensorflow/models/master/research/inception/inception/data/imagenet_metadata.txt This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205509#comment-16205509 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-336800656 I was getting the same error. Nothing is wrong with your docker setup. The problem was with the download url of **imagenet_lsvrc_2015_synsets.txt** & **imagenet_metadata.txt**. Apparently tf maintainers have moved these meta files and models to another repo https://github.com/tensorflow/serving. See, https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_lsvrc_2015_synsets.txt https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_metadata.txt you will get 404. I'll update with the new URLs This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205506#comment-16205506 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-336800656 I was getting the same error. Nothing is wrong with your docker setup. The problem was with the download url of **imagenet_lsvrc_2015_synsets.txt** & **imagenet_metadata.txt**. Apparently tf maintainers have moved these files to another location. See, https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_lsvrc_2015_synsets.txt https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_metadata.txt you will get 404. I'll update with the new URLs This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205504#comment-16205504 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-336800656 I was getting the same error. Nothing is wrong with your docker setup. The problem was with the download url of **imagenet_lsvrc_2015_synsets.txt** & imagenet_metadata.txt. Apparently tf maintainers have moved these files to another location. See, https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_lsvrc_2015_synsets.txt https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_metadata.txt you will get 404. I'll update with the new URLs This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)