[jira] [Commented] (TIKA-1817) Extracts entire file content for ASCII DXF files
[ https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15320026#comment-15320026 ] Zoltan Toth commented on TIKA-1817: --- Anyone listening? Is this issue still being worked on? > Extracts entire file content for ASCII DXF files > > > Key: TIKA-1817 > URL: https://issues.apache.org/jira/browse/TIKA-1817 > Project: Tika > Issue Type: Bug >Affects Versions: 1.11 >Reporter: Zoltan Toth > Attachments: SMA-Controller.dxf, house design.dxf, > jcsample-screendump.jpg, jcsample.dxf > > > By definition, ASCII DXF files are encoded in plain text. However. the vast > majority of their content is not intended to be human readable (see > https://en.wikipedia.org/wiki/AutoCAD_DXF). Unfortunately for these files, > Tika simply "extracts" the entire content of the file instead of the > human-readable portions (i.e. comments etc.) that a CAD tool would render. > This results in massive amounts of rubbish data being returned with dire > consequences for applications that rely on this. > It would be nice if only the human-readable text fields were extracted. > Failing this, it would still be nice if no text was extracted from these > files at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1999) org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
[ https://issues.apache.org/jira/browse/TIKA-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319759#comment-15319759 ] Tim Allison commented on TIKA-1999: --- Thank you for opening this and sharing a triggering file. If you use pdfbox-app's ExtractText, do you run into the same issue? That'd be PDFBox 2.0.1. Will take a look in next few days. > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > -- > > Key: TIKA-1999 > URL: https://issues.apache.org/jira/browse/TIKA-1999 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Ubuntu 16.04 (64 bit) > Oracle Java 1.8.0_91-b14 (64 bit) >Reporter: Egbert >Assignee: Tim Allison > > When trying to read the following PDF document: > http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf > TIKA crashes for me with a java.lang.StackOverflowError, caused by a large > number of recursion in: > {noformat} > at > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > {noformat} > For some reason, the Tika App doesn't exhibit this behavior, but the > following MWE exposes the issue for me: > {noformat} > import java.io.ByteArrayOutputStream; > import java.io.File; > import java.io.FileInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.sax.ToHTMLContentHandler; > public class test > { > public static void main(String [] args) throws Exception { > String p = "/home/eggie/faulty_pdf_document.pdf"; > > FileInputStream input = new FileInputStream(new File(p)); > AutoDetectParser tk = new AutoDetectParser(); > ByteArrayOutputStream os = new ByteArrayOutputStream(); > ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8"); > ParseContext pc = new ParseContext(); > System.out.println("Parsing"); > tk.parse(input, handler, new Metadata(), pc); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TIKA-1999) org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
[ https://issues.apache.org/jira/browse/TIKA-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1999: - Assignee: Tim Allison > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > -- > > Key: TIKA-1999 > URL: https://issues.apache.org/jira/browse/TIKA-1999 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Ubuntu 16.04 (64 bit) > Oracle Java 1.8.0_91-b14 (64 bit) >Reporter: Egbert >Assignee: Tim Allison > > When trying to read the following PDF document: > http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf > TIKA crashes for me with a java.lang.StackOverflowError, caused by a large > number of recursion in: > {noformat} > at > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > {noformat} > For some reason, the Tika App doesn't exhibit this behavior, but the > following MWE exposes the issue for me: > {noformat} > import java.io.ByteArrayOutputStream; > import java.io.File; > import java.io.FileInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.sax.ToHTMLContentHandler; > public class test > { > public static void main(String [] args) throws Exception { > String p = "/home/eggie/faulty_pdf_document.pdf"; > > FileInputStream input = new FileInputStream(new File(p)); > AutoDetectParser tk = new AutoDetectParser(); > ByteArrayOutputStream os = new ByteArrayOutputStream(); > ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8"); > ParseContext pc = new ParseContext(); > System.out.println("Parsing"); > tk.parse(input, handler, new Metadata(), pc); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Profiler for OpenNLP
+1 that sounds quite interesting. Regards, Tommaso Il giorno mar 7 giu 2016 alle ore 20:03 Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> ha scritto: > We would love to have this part of Apache Tika. You can take a look > at the existing NER/NLP stuff integrated like in GeoTopicParser as > an example and yes please file a JIRA issue: > > http://issues.apache.org/jira/browse/TIKA > > I would be happy to work with you to make it happen. > > See: http://github.com/apache/tika/#contributing-via-github > > For guidance. > > ++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++ > > > > > > > > > > > On 6/7/16, 9:36 AM, "Anthony Beylerian"> wrote: > > >Hello, > > > >We are currently working on an experimental author profiler that we think > >could be added to the toolkit. > > > >The profiler aims to detect the gender and age range of an author. > >Later we hope to add personality aspects such as: > >[extroverted, stable, agreeable, conscientious] > > > >We would like the teams' opinion on the matter. > >An initial code drop can be found here[1] if someone is willing to > >contribute/collaborate on it with us please let us know. > > > >Thanks! > > > >[1] https://github.com/beylerian/profiler >
[jira] [Updated] (TIKA-1999) org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
[ https://issues.apache.org/jira/browse/TIKA-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egbert updated TIKA-1999: - Description: When trying to read the following PDF document: http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf TIKA crashes for me with a java.lang.StackOverflowError, caused by a large number of recursion in: {noformat} at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) {noformat} For some reason, the Tika App doesn't exhibit this behavior, but the following MWE exposes the issue for me: {noformat} import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.ToHTMLContentHandler; public class test { public static void main(String [] args) throws Exception { String p = "/home/eggie/faulty_pdf_document.pdf"; FileInputStream input = new FileInputStream(new File(p)); AutoDetectParser tk = new AutoDetectParser(); ByteArrayOutputStream os = new ByteArrayOutputStream(); ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8"); ParseContext pc = new ParseContext(); System.out.println("Parsing"); tk.parse(input, handler, new Metadata(), pc); } } {noformat} was: When trying to read the following PDF document: http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf TIKA crashes for me with a java.lang.StackOverflowError, caused by a large number of recursion in: at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) For some reason, the Tika App doesn't exhibit this behavior, but the following MWE exposes the issue for me: import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.ToHTMLContentHandler; public class test { public static void main(String [] args) throws Exception { String p = "/home/eggie/faulty_pdf_document.pdf"; FileInputStream input = new FileInputStream(new File(p)); AutoDetectParser tk = new AutoDetectParser(); ByteArrayOutputStream os = new ByteArrayOutputStream(); ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8"); ParseContext pc = new ParseContext(); System.out.println("Parsing"); tk.parse(input, handler, new Metadata(), pc); } } > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > -- > > Key: TIKA-1999 > URL: https://issues.apache.org/jira/browse/TIKA-1999 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Ubuntu 16.04 (64 bit) > Oracle Java 1.8.0_91-b14 (64 bit) >Reporter: Egbert > > When trying to read the following PDF document: > http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf > TIKA crashes for me with a java.lang.StackOverflowError, caused by a large > number of recursion in: > {noformat} > at > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > {noformat} > For some reason, the Tika App doesn't exhibit this behavior, but the > following MWE exposes the issue for me: > {noformat} > import java.io.ByteArrayOutputStream; > import java.io.File; > import java.io.FileInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.sax.ToHTMLContentHandler; > public class test > { > public static void main(String [] args) throws Exception { > String p = "/home/eggie/faulty_pdf_document.pdf"; > > FileInputStream input = new FileInputStream(new File(p)); > AutoDetectParser tk = new AutoDetectParser(); > ByteArrayOutputStream os = new ByteArrayOutputStream(); > ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8"); > ParseContext pc = new ParseContext(); > System.out.println("Parsing"); > tk.parse(input, handler, new Metadata(), pc); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1998) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Gratzl closed TIKA-1998. --- Resolution: Invalid > jhighlight license concerns > --- > > Key: TIKA-1998 > URL: https://issues.apache.org/jira/browse/TIKA-1998 > Project: Tika > Issue Type: Wish > Components: parser >Affects Versions: 1.12 >Reporter: Daniel Gratzl > Labels: tika-parsers > > While it seems that the issue with jhighlight's license has been resolved for > Tika itself, it seems that the problem still exists for tika parsers > http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/pom.xml > Is this anything that might be addressed in the near future? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1998) jhighlight license concerns
Daniel Gratzl created TIKA-1998: --- Summary: jhighlight license concerns Key: TIKA-1998 URL: https://issues.apache.org/jira/browse/TIKA-1998 Project: Tika Issue Type: Wish Components: parser Affects Versions: 1.12 Reporter: Daniel Gratzl While it seems that the issue with jhighlight's license has been resolved for Tika itself, it seems that the problem still exists for tika parsers http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/pom.xml Is this anything that might be addressed in the near future? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES
[ https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318100#comment-15318100 ] Michele Andreano commented on TIKA-1997: Attached to this issue you can find an example > Problem in Tika().detect for xml file signed in CADES > - > > Key: TIKA-1997 > URL: https://issues.apache.org/jira/browse/TIKA-1997 > Project: Tika > Issue Type: Sub-task > Components: detector >Affects Versions: 1.13 > Environment: JDK 1.7 >Reporter: Michele Andreano > Fix For: 1.13 > > Attachments: test.xml.p7m > > > When I submit a tika a xml file signed in P7M format, I expect tika return as > mimetype application / pkcs7-mime instead gives me application / > pkcs7-signature. > How is it possible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES
[ https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michele Andreano updated TIKA-1997: --- Attachment: test.xml.p7m > Problem in Tika().detect for xml file signed in CADES > - > > Key: TIKA-1997 > URL: https://issues.apache.org/jira/browse/TIKA-1997 > Project: Tika > Issue Type: Sub-task > Components: detector >Affects Versions: 1.13 > Environment: JDK 1.7 >Reporter: Michele Andreano > Fix For: 1.13 > > Attachments: test.xml.p7m > > > When I submit a tika a xml file signed in P7M format, I expect tika return as > mimetype application / pkcs7-mime instead gives me application / > pkcs7-signature. > How is it possible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)