RE: Very slow parsing of a few PDF files

2017-11-27 Thread Allison, Timothy B.
The ForkParser does have the ability to kill and restart on permanent hangs. We don't have the RecursiveParserWrapper integrated into the ForkParser currently...patches are welcomed. At the Tika level, we generally don't check for a Thread.interrupted() because our dependencies don't do it.

RE: Very slow parsing of a few PDF files

2017-11-28 Thread Allison, Timothy B.
>As the HTML parser in Tika does not produce SAX events in the correct order - >the parser is great but does not support serialization - etc. Oh, please open a ticket with examples, or point me to one I've forgotten about... ☹ Thank you!

RE: Very slow parsing of a few PDF files

2017-11-29 Thread Allison, Timothy B.
>I am going to have to write my own application specific solution Ugh. I'm sorry. If there's anything shareable, please do share. > ForkParser tries to serialize every class it things will be needed across the > connection and a lot of third party classes are not serializable. I think > that

RE: Very slow parsing of a few PDF files

2017-11-30 Thread Allison, Timothy B.
alternative. I have time scheduled next week for an in-house solution but I will first look properly at ForkParser and see if I could make something akin to that in generic and configurable fashion. If so, I will submit the code. Jim > -Original Message- > From: Allison, Tim

RE: How can I get the page number of a word document?

2017-12-07 Thread Allison, Timothy B.
MSWord calculates pages dynamically. Unlike PDFs, MSWord documents are not “page based”. The only way you can do it with Java is through n COM bridge to the MSWord application or maybe via OpenOffice UNO, etc. If you have vba, you could also programmatically get it out via the MSWord applicat

RE: How can I get the page number of a word document?

2017-12-08 Thread Allison, Timothy B.
And one other thing, because MSWord calculates pages dynamically, it often does not store the correct page count within the file, so that information is often misleading. Beware. From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, December 7, 2017 8:43 AM To: user

RE: [VOTE] Release Apache Tika 1.17 Candidate #2

2017-12-12 Thread Allison, Timothy B.
Thank you, Luis! One more vote, and we can release… From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com] Sent: Tuesday, December 12, 2017 8:43 AM To: user@tika.apache.org Cc: d...@tika.apache.org Subject: Re: [VOTE] Release Apache Tika 1.17 Candidate #2 All seems ok after integrating in our sys

RE: Parse file without creating tmp file

2018-01-11 Thread Allison, Timothy B.
I'm not aware of such a list. Part of the challenge is that we don't know when our dependencies might choose to create a temp file. Sorry! -Original Message- From: Van Tassell, Kristian [mailto:kristian.vantass...@siemens.com] Sent: Thursday, January 11, 2018 1:42 PM To: user@tika.apac

RE: How to implement an InputStream that dynamically guesses the extension of a file that is streamed using Apache Tika?

2018-01-11 Thread Allison, Timothy B.
Hi Martin, I’m sorry for my delay. As a first pass at an answer…We have roughly three mechanisms for file id: 1. mime patterns (magic mime) 2. package detection 3. parse-time sub-type detection 4. file name extension (completely useless for your purposes) 1. You should be able

RE: Tika-parsers using cat-x json.org dep and is geoapis ok?

2018-01-23 Thread Allison, Timothy B.
Fixed via TIKA-2535 in both 1.18 and 2.0. Thank you, Joe and Chris! -Original Message- From: Joe Witt [mailto:joe.w...@gmail.com] Sent: Tuesday, January 23, 2018 10:46 AM To: user@tika.apache.org Subject: Re: Tika-parsers using cat-x json.org dep and is geoapis ok? Here is the legal JIR

RE: Long time with OCR

2018-02-20 Thread Allison, Timothy B.
> These pages are hard because they have different fonts and maybe other > complications. +1 … As a side note, a colleague and I did an image degradation study, and we noticed that tesseract took far longer on the degraded images than on the originals. Your intuition is correct. This won’t he

RE: Malware RTF is not detected as RTF

2018-03-01 Thread Allison, Timothy B.
Yes. Please do open a ticket, and y, I have a need to read anything from decalage…he does some amazing work. 😊 I trust you wouldn’t, but please don’t post an actual malware file for us to use in our unit tests. 😉 From: Jim Idle [mailto:ji...@proofpoint.com] Sent: Thursday, March 1, 2018 12:32

RE: XBRL documents.

2018-03-14 Thread Allison, Timothy B.
Tika's default handling of xml is to scrape out the text and ignore the entities and attributes, IIRC. So, if that's the behavior you want, and your XBRLs are well-formed XML, you'll be good to go. If they're non-standard XML or if you want the node names and attributes, you may have to add yo

RE: Subfile Extraction

2018-03-27 Thread Allison, Timothy B.
+1 to Nick's links and advice. To use the RecursiveParserWrapper with tika-app, use the -J option; or if you're using tika-server, use the /rmeta endpoint. The ecology of embedded docs is rich and understudied (IMHO), let us know what you find! Cheers, Tim -Original Mes

RE: Tika Server: Disable OCR / Tesseract by HTTP parameter?

2018-04-11 Thread Allison, Timothy B.
Others may be more familiar with tika-server and OCR, but I notice that we do process X-Tika-OCR prefixed headers to configure TesseractOCRConfig . If you set "tesseractPath" to something bogus, that may turn off OCR...give something like this a try: --header "X-Tika-OCRTesseractPath:/bogosity

FW: Default Tika extraction of docx 5X slower than XWPFWordExtractor?

2012-01-20 Thread Allison, Timothy B.
I'm just getting started with Tika, and I tried the basic AutoDetectParser and the basic ParsingReader on a batch of a few thousand docx files (tika-app v1.0). On my laptop, I was able to extract text at a rate of 200 docs per minute. When I ran XWPFWordExtractor (poi 3.8) on the same docs,

pdf acroform and tika

2012-02-23 Thread Allison, Timothy B.
Not sure if this is an issue for PDFBox or Tika, but I noticed that PDFBox's textstripper is not extracting information from the form fields in a batch of pdf documents I'm processing. Is anyone else having this problem? I regret that I'm unable to send an example document. Inelegant solution wi

BodyContentHandler and a docx embedded within a PDF

2013-05-22 Thread Allison, Timothy B.
I have a PDF document with a docx attachment. I wasn't having luck getting the contents of the docx with tika.parseToString(file). I dug around a bit in the PDFExtractor and found that when I changed this line: embeddedExtractor.parseEmbedded( stream, new Embedde

RE: Html Parser autodetect charset

2013-06-21 Thread Allison, Timothy B.
In the tika-app.jar, go to WEB-INF/services; there's a file that specifies the order of the application of the encoding detectors (org.apache.tika.detect.EncodingDetector). The AutoDetectReader applies these in order and stops as soon as one of the detectors thinks that it detects an encoding.

RE: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Allison, Timothy B.
This looks like an area for a new feature in both Tika and POI. I've only looked very briefly into the POI libraries, and I may have missed how to extract text from autoshapes. I'll open an issue in both projects. -Original Message- From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.

RE: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Allison, Timothy B.
bean work. I've opened https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix. There's some work going on on XSSFTextCell in POI that might make this more straightforward. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, J

RE: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Allison, Timothy B.
- Is this a wrong direction? If you know which class I should fix, please let me know. -Original Message- From: Allison, Timothy B. Sent: Monday, July 22, 2013 10:27 PM To: user@tika.apache.org Subject: RE: How to extract autoshape text in Excel 2007+ This is one way

RE: How to extract autoshape text in Excel 2007+

2013-09-26 Thread Allison, Timothy B.
g/bugzilla/show_bug.cgi?id=55292 It would be great if you could give me a patch. Thanks, Hiroshi Tatsumi -Original Message----- From: Allison, Timothy B. Sent: Tuesday, July 23, 2013 5:10 AM To: user@tika.apache.org Subject: RE: How to extract autoshape text in Excel 2007+ Hiroshi, To fix th

tika server jax-rs and recursive file processing

2014-04-30 Thread Allison, Timothy B.
All, As always, apologies for the cluelessness the following reveals... I'm starting to move from embedded Tika to a server option for greater robustness. Is the jax-rs server intended not to handle embedded files recursively? If so, how are users currently handling multiply embedded documen

RE: Question re installing Tika

2014-06-26 Thread Allison, Timothy B.
My plan is to add a tika-batch package as part of TIKA-1330. One of the primary use cases will be input directory -> output directory. There will be hooks for people to add db -> db, and maybe someone with Hadoop skills would be willing to contribute a tika-batch-hadoop package. That should b

RE: Stack Overflow Question

2014-06-30 Thread Allison, Timothy B.
DefaultHandler is effectively a NullHandler; it doesn't store or do anything. Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler. If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler. QUOTE: 0down votefavorite

RE: Stack Overflow Question

2014-06-30 Thread Allison, Timothy B.
ring returned from handler.tostring() how can i map a fileName to its content. thanks, yeshwanth On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. mailto:talli...@mitre.org>> wrote: DefaultHandler is effectively a NullHandler; it doesn't store or do anything. Try BodyCo

RE: Stack Overflow Question

2014-06-30 Thread Allison, Timothy B.
Or use the ToXMLHandler and parse the XML? From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, June 30, 2014 3:55 PM To: yeshwanth kumar Cc: user@tika.apache.org Subject: RE: Stack Overflow Question Might want to look into RecursiveMetadata Parser http://wiki.apache.org/tika

RE: Stack Overflow Question

2014-07-01 Thread Allison, Timothy B.
Did you try the ToXMLHandler? From: yeshwanth kumar [mailto:yeshwant...@gmail.com] Sent: Monday, June 30, 2014 4:50 PM To: Allison, Timothy B. Subject: Re: Stack Overflow Question hi tim, i tried in all possible ways, instead of reading entire zip file i parsed individual zipentries, but even

RE: Stack Overflow Question

2014-07-01 Thread Allison, Timothy B.
kumar [mailto:yeshwant...@gmail.com] Sent: Tuesday, July 01, 2014 9:00 AM To: Allison, Timothy B. Subject: Re: Stack Overflow Question output is same even with ToXMLHandler On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. mailto:talli...@mitre.org>> wrote: Did you try the ToXMLHandler?

RE: Stack Overflow Question

2014-07-01 Thread Allison, Timothy B.
Good to hear. Let us know if you have any other questions or when you run into surprises. From: yeshwanth kumar [mailto:yeshwant...@gmail.com] Sent: Tuesday, July 01, 2014 10:23 AM To: Allison, Timothy B. Subject: Re: Stack Overflow Question hi tim, i forgot to change the BodyContentHandler

RE: How to index the parsed content effectively

2014-07-02 Thread Allison, Timothy B.
Hi Sergey, I'd take a look at what the DataImportHandler in Solr does. If you want to store the field, you need to create the field with a String (as opposed to a Reader); which means you have to have the whole thing in memory. Also, if you're proposing adding a field entry in a multivalued

RE: How to index the parsed content effectively

2014-07-14 Thread Allison, Timothy B.
rs. Best, Tim -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, July 11, 2014 1:38 PM To: user@tika.apache.org Subject: Re: How to index the parsed content effectively Hi Tim, All. On 02/07/14 14:32, Allison, Timothy B. wrote: > Hi Sergey,

RE: Avoiding Out of Memory Errors

2014-07-18 Thread Allison, Timothy B.
I'm working on adding a daemon to Tika Server so that it will restart when it hits an OOM or other big problem (infinite hangs). That won't be available until Tika 1.7. To amplify Nick's recommendations: ForkParser or Server are your best options for now. Are there specific files/file types

RE: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-28 Thread Allison, Timothy B.
+1 Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7 Windows 7, Java 1.7 I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs. There were several improvements in text extraction for

RE: Tika - Outlook msg file with another Outlook msg as an attachment - OutlookExtractor passes empty stream

2014-07-31 Thread Allison, Timothy B.
AarKay, We have a unit test for an MSG embedded within an MSG in POIContainerExtractionTest. I also just tried a newly created msg within an msg file, and I can extract the embedded content with TikaTest.RecursiveMetaParser. This suggests that the issue is not within the OutlookParser.

RE: TIKA - how to read chunks at a time from a very large file?

2014-08-28 Thread Allison, Timothy B.
Probably better question for the user list. Extending a ContentHandler and using that in ContentHandlerDecorator is pretty straightforward. Would it be easy enough to write to file by passing in an OutputStream to WriteOutContentHandler? -Original Message- From: ruby [mailto:rshoss...@

FW: How to exclude a mimetype in tika?

2014-09-18 Thread Allison, Timothy B.
Tika Colleagues (Tika'ers, Tikis?), Is this the right answer: Drop the relevant parsers from the tika.config file and make sure to point solr to this file in your solr request handler definition: /my/path/to/tika.config? I only have experience as a programmatic user of Tika and would use a D

RE: Apache Tika - JSON?

2014-09-26 Thread Allison, Timothy B.
The current json output option in the app and server only dump metadata…as you probably know. I plan to add a json version of the RecursiveParserWrapper (list of Metadata objects with one entry for content) to the app shortly. Would that be of any use? Are you using the app, the server, or ca

RE: Apache Tika - JSON?

2014-09-26 Thread Allison, Timothy B.
library to serialize/deserialize Metadata objects in tika-serialization. From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Friday, September 26, 2014 6:54 AM To: user@tika.apache.org Subject: RE: Apache Tika - JSON? The current json output option in the app and server only dump metadata…as

RE: Problem with content extraction

2014-10-07 Thread Allison, Timothy B.
I’ve seen this before on a few documents. You might experiment with setting PDFParserConfig’s suppressDuplicateOverlappingText to true. If that doesn’t work, I’d recommend running the pure PDFBox app’s ExtractText on the document. If you get the same doubling of letters, ask over on u...@pdf

RE: Customizing Metadata Keys

2014-10-09 Thread Allison, Timothy B.
I agree with Nick’s recommendation on post-parsing key mapping, and I’d like to put in a plug for the RecursiveParserWrapper, which may be of use for you. I’ve been intending to add that to the app commandline and to server…how are you handling embedded document metadata? Would the wrapper be

internal vs external property?

2014-11-20 Thread Allison, Timothy B.
All, What is the difference between an internal and an external Property? I'm not (quickly) seeing how Metadata is using that Boolean. Are there other pieces of code that make use of the distinction? Thank you. Best, Tim

RE: Encrypted PDF issues & build issues

2014-12-11 Thread Allison, Timothy B.
Y, sorry. As you point out, that should be fixed in PDFBox 1.8.8. A vote was just taken for that, so that will be out very soon. Last I looked at integrating PDFBox 1.8.8-SNAPSHOT, the upgrade requires us to change one test (I think?) in Tika…which is why you’re getting a failed build. Your

RE: Encrypted PDF issues & build issues

2014-12-15 Thread Allison, Timothy B.
Upgrade just made in Tika trunk. The integration required more than changing the one test…Sorry about that! Let us know if there are any surprises with the upgrade. From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, December 11, 2014 2:41 PM To: user@tika.apache.org Subject

RE: Outputting JSON from tika-server/meta

2014-12-18 Thread Allison, Timothy B.
Do you have any luck if you call /metadata instead of /meta? That should trigger MetadataEP which will return Json, no? I'm not sure why we have both handlers, but we do... -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Thursday, December 18, 2014 9:56 AM

Tika 2.0???

2014-12-18 Thread Allison, Timothy B.
ult (which unfortunately would break back compat, but in my mind would make a lot more sense) Chris Mattmann chris.mattm...@gmail.com -Original Message- From: "Allison, Timothy B." Reply-To: Date: Thursday, December 18, 2014 at 7:20 AM To: "us

RE: Outputting JSON from tika-server/meta

2014-12-18 Thread Allison, Timothy B.
ably combine them..and make JSON the default (which unfortunately would break back compat, but in my mind would make a lot more sense) Chris Mattmann chris.mattm...@gmail.com<mailto:chris.mattm...@gmail.com> -Original Message- From: "Allison, T

RE: Outputting JSON from tika-server/meta

2014-12-18 Thread Allison, Timothy B.
2014 at 15:20, Allison, Timothy B. mailto:talli...@mitre.org>> wrote: Do you have any luck if you call /metadata instead of /meta? I have no luck with that: Dec 18, 2014 3:55:21 PM org.apache.cxf.jaxrs.utils.JAXRSUtils findTargetMethod WARNING: No operation matching request path "/metadat

Re: Outputting JSON from tika-server/meta

2014-12-18 Thread Allison, Timothy B.
Peter, I'm waiting on feedback on TIKA-1497, but rmeta should get you what you want via TIKA-1498. Let us know if there are any surprises. Best, Tim -Original Message- From: Tim Allison (JIRA) [mailto:j...@apache.org] Sent: Thursday, December 18, 20

RE: Outputting JSON from tika-server/meta

2014-12-19 Thread Allison, Timothy B.
All, With many thanks to Sergey, I added JSON and XMP to “/meta” and I folded in MetadataEP into MetadataResource so that users can request a specific metadata value(s). (TIKA-1497, TIKA-1499) I also added a new endpoint “/rmeta” that is equivalent to tika-app’s –J (TIKA-1498) – JSONified view

RE: Running tika-server as a service

2015-01-08 Thread Allison, Timothy B.
Peter, I don’t have any immediate solutions, but there are two options in the pipeline (probably Tika 1.8): 1) Lewis John McGibbney on TIKA-894 is going to add a war/webapp. 2) I plan to open an issue related to TIKA-1330 that will make our current jax-rs tika-server more robust to

RE: Running tika-server as a service

2015-01-08 Thread Allison, Timothy B.
Doh! My answer focused on my interests rather than your question. Sorry. By restart, I now assume you mean system restart… TIKA-894 should help with that if you configure your server container (tomcat?) to automatically start/restart. From: Allison, Timothy B. [mailto:talli...@mitre.org

JAX-RS: SEVERE Problem with writing the data when parser hits exception?

2015-02-27 Thread Allison, Timothy B.
All, I recently noticed that I'm getting this message logged when there is an exception during parsing: SEVERE: Problem with writing the data, class org.apache.tika.server.TikaResource$5, ContentType: text/html We didn't get this message with Tika 1.6, but we are getting this with Tika 1.7 an

RE: JAX-RS: SEVERE Problem with writing the data when parser hits exception?

2015-02-27 Thread Allison, Timothy B.
ou think it should not be reported/logged ? This can be easily done, if the parser throws the exception then this exception can be propagated (wrapped if it is not RuntimeException) and caught with a custom exception mapper and the logging being blocked... Cheers, Sergey On 27/02/15 15:05, Allison

RE: Config for Tika Windows Service with Apache Commons Daemon

2015-03-04 Thread Allison, Timothy B.
Somewhere on my todo list is to add the ability to stop tika-server on the commandline. I probably won't get to this for a few months, though. I agree with Nick's recommendation to contribute to the war, if at all possible. -Original Message- From: Nick Burch [mailto:apa...@gagravarr.

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
tools. Thanks & Regards Vijay On 16 April 2015 at 12:33, Allison, Timothy B. wrote: > I entirely agree with Erick -- it is best to isolate Tika in its own jvm > if you can -- bad things can happen if you don't [1] [2]. > > Erick's blog on SolrJ is fantastic. If you wan

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
; Regards > Vijay > > > On 16 April 2015 at 12:54, Allison, Timothy B. wrote: > >> This sounds like a Tika issue, let's move discussion to that list. >> >> If you are still having problems after you upgrade to Tika 1.8, please at >> least submit the stack t

FW: TIKA OCR not working

2015-04-27 Thread Allison, Timothy B.
Trung, I haven't experimented with our OCR parser yet, but this should give a good start: https://wiki.apache.org/tika/TikaOCR . Have you installed tesseract? Tika colleagues, Any other tips? What else has to be configured and how? -Original Message- From: trung.ht [mailto:trung...@

RE: Odp.: solr issue with pdf forms

2015-04-29 Thread Allison, Timothy B.
I completely agree with Erick about the utility of the TermsComponent to see what is actually being indexed. If you find problems there and if you haven't done so already, you might also investigate further down the stack. It might make sense to run the tika-app.jar (whichever version you are

RE: Odp.: solr issue with pdf forms

2015-04-30 Thread Allison, Timothy B.
et d pdf form: Bitte^Hlegen^HSie^Hdem^HAntrag Kopien aller Einkommensnachweise bei.^HDaz Best Steve -Ursprüngliche Nachricht----- Von: Allison, Timothy B. [mailto:talli...@mitre.org] Gesendet: Mittwoch, 29. April 2015 14:16 An: solr-u...@lucene.apache.org Cc: user@tika.apache.org Betreff:

RE: extracting text from an "encrypted" pdf

2015-05-12 Thread Allison, Timothy B.
PDF encryption and access permissions are tricky (see, e.g., the discussion and links here: https://issues.apache.org/jira/browse/TIKA-1489 ). There are potentially two passwords for a PDF document, the owner password and the user password. Often, the user password is set to the empty string.

RE: Embedded images in PDF - detect, extract and/or OCR

2015-05-13 Thread Allison, Timothy B.
By default, Tika is configured not to extract embedded images from PDFs because in some edge cases, there can be thousands of images in some small PDF files (see https://issues.apache.org/jira/browse/TIKA-1294). Our choice to have the default be “don’t extract” was based on the concern that if

RE: Embedded images in PDF - detect, extract and/or OCR

2015-05-13 Thread Allison, Timothy B.
whether embedded images exist. (2) the -z option is effectively disabled for PDFs? (3) is there a way to enable detection and/or extraction from the command line, as opposed to editing the source? On Wed, May 13, 2015 at 12:18 PM, Allison, Timothy B. mailto:talli...@mitre.org>> wrote: By d

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.
Hi Mouthgalya, We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the fix will be available in Tika 1.9, which should be out within a week. As for memory issues, we worked around a memory leak in PDFBox with static caching of fonts for Tika 1.7 (may have been 1.8), but

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.
on? From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com] Sent: Thursday, June 04, 2015 10:20 AM To: Allison, Timothy B.; talli...@apache.org Cc: user@tika.apache.org; Sauparna Sarkar Subject: RE: Memory issues with PDF parser Hi Timothy, Thanks for the prompt reply. 1.)Wouldn't f

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.
Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com] Sent: Thursday, June 04, 2015 2:55 PM To: Allison, Timothy B. Cc: user@tika.apache.org; Sauparna Sarkar Subject: RE: Memory issues with PDF parser Thanks for the update Timothy, I see that Tika 1.9.-SNAPSHOT is available in maven repo. I am going to try

xml vs html parser

2015-06-16 Thread Allison, Timothy B.
All, On govdocs1, the xml parser's exceptions accounted for nearly a quarter of all thrown exceptions at one point (Tika 1.7ish). Typically, a file was mis-identified as xml when in fact it was sgml or some other text based file with some markup that wasn't meant to be xml. For kicks, I s

RE: CSV Parser in Tika

2015-06-19 Thread Allison, Timothy B.
Y, that’s my belief. As of now, we’re treating them as text files, which can lead to some really long = bogus tokens in Lucene/Solr with analyzers that don’t split on commas. ☹ Detection without filename would be difficult. From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent: Friday

RE: xml vs html parser

2015-06-19 Thread Allison, Timothy B.
Jukka Zitting [mailto:jukka.zitt...@gmail.com] Sent: Tuesday, June 16, 2015 10:26 AM To: Tika Users Subject: Re: xml vs html parser Hi, 2015-06-16 9:28 GMT-04:00 Allison, Timothy B. : > So, is there a way to make the XMLParser more lenient? I don't think so. XML is draconian by design. > O

RE: Extract PDF inline images

2015-07-06 Thread Allison, Timothy B.
Hi Andrea, The RecursiveParserWrapper, as you found, is only for extracted content and metadata. It was designed to cache metadata and content from embedded documents so that you can easily keep those two things together for each embedded document. To extract the raw bytes from embedded

RE: Extract PDF inline images

2015-07-07 Thread Allison, Timothy B.
t;UTF-8"); parser.parse(is, handler, metadata, context); 2015-07-06 12:59 GMT+02:00 Allison, Timothy B. mailto:talli...@mitre.org>>: Hi Andrea, The RecursiveParserWrapper, as you found, is only for extracted content and metadata. It was designed to cache metadata and content from e

RE: Inconsistent (buggy) behavior when using tika-server

2015-07-14 Thread Allison, Timothy B.
That looks like a bug in TikaUtils. For whatever reason, when is.available() returns 0, we are then assuming that fileUrl is not null. We need to check to make sure that fileUrl is not null before trying to open the file. if(is.available() == 0 && !"".equals(fileUrl)){ ... return TikaInputStr

robust Tika and Hadoop

2015-07-15 Thread Allison, Timothy B.
All, I'd like to fill out our Wiki a bit more on using Tika robustly within Hadoop. I'm aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven't looked carefully into these packages yet. Does anyone have any recommendations for specific configurations/design patterns that will def

RE: robust Tika and Hadoop

2015-07-20 Thread Allison, Timothy B.
against things like NoSuchMethodErrors that can be thrown by Tika if the mime-type detection code tries to use a parser that we exclude, in order to keep the Hadoop job jar size to something reasonable. -- Ken From: Allison, Timothy B. Sent: July 15, 2015 4:38:56am

RE: robust Tika and Hadoop

2015-07-21 Thread Allison, Timothy B.
er to keep the Hadoop job jar size to something reasonable. -- Ken From: Allison, Timothy B. Sent: July 15, 2015 4:38:56am PDT To: user@tika.apache.org<mailto:user@tika.apache.org> Subject: robust Tika and Hadoop All, I'd like to fill out our Wi

RE: robust Tika and Hadoop

2015-07-21 Thread Allison, Timothy B.
Thank you, Ken! From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Tuesday, July 21, 2015 10:23 AM To: user@tika.apache.org Subject: RE: robust Tika and Hadoop Hi Tim, Responses inline below. -- Ken From: Allison, Timothy B. Sent: July 21, 2015 5

FW: error Unsupported Media Type : while implementing ContentStreamUpdateRequestExample from the link http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

2015-07-22 Thread Allison, Timothy B.
What happens when you run straight tika-app against that pdf file? java -jar tika-app.jar Sample.pdf (grab tika-app from: http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.9.jar) Do you have all of the tika jars on your classpath/properly configured within your Solr setup? -Original Mes

RE: Charset Encoding

2015-07-30 Thread Allison, Timothy B.
The AutoDetectReader (within TXTParser) runs the encoding detectors in order specified in tika-parsers...resources/META-INF/services/o.a.t.detect.EncodingDetector. The AutoDetectReaders picks the first non-null response to detect. The current order is: org.apache.tika.parser.html.HtmlEncodingDe

RE: [VOTE] Apache Tika 1.10 Release Candidate #1

2015-08-03 Thread Allison, Timothy B.
+1, built Windows and Linux. Relying on previous tests for performance/comparision results. Thank you, Dave! -Original Message- From: David Meikle [mailto:loo...@gmail.com] Sent: Sunday, August 02, 2015 3:15 AM To: d...@tika.apache.org; user@tika.apache.org Subject: [VOTE] Apache Tika

RE: TikaConfig with constructor args

2015-08-27 Thread Allison, Timothy B.
That’s on my todo list (TIKA-1508). Unfortunately, that doesn’t exist yet. I’d recommend for now following the pattern of the PDFParser or the TesseractOCRParser. The config is driven by a properties file. As soon as my dev laptop becomes unbricked, I’m going to turn to TIKA-1508. Given my

RE: Does tika support "HWP"?

2015-09-02 Thread Allison, Timothy B.
Great. In the meantime, if you could open a JIRA issue and attach some example files (including the different versions), it might be helpful for the community to take a look. Thank you! -Original Message- From: Mungeol Heo [mailto:mungeol@gmail.com] Sent: Tuesday, September 01, 20

RE: tesseract issue

2015-09-09 Thread Allison, Timothy B.
You can build from source if you have an interest (and the bandwidth, time and disk space) or pull a nightly build if you don’t want to wait for 1.11, for example: https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/849/org.apache.tika$tika-app/ Thank you, Christian! Best, Tim

RE: RecursiveParser returning ContentHandler

2015-09-22 Thread Allison, Timothy B.
Y, that should be easy enough. Instead of the metadata list, we can store a list of Metadata+Handler pairs, the current “getMetadata()” can be syntactic sugar around the new getMetadataAndHandlers(). Please open a ticket and we can discuss there. Thank you. Best, Tim From: Andr

RE: Maximizing performance when parsing a lot of files

2015-09-25 Thread Allison, Timothy B.
It's best to keep Tika in its own jvm. If you are working filesystem to filesystem... The simplest thing to do would be to call tika-batch via the commandline of tika-app every so often. By default, tika-batch will skip files that it has already processed if you run it again, but you will pay

RE: Tika unable to extract PDF Text

2015-10-14 Thread Allison, Timothy B.
File works with Tika trunk. What's on your classpath: tika-app or just tika-core? Is there a chance that you don't have tika-parsers on your cp? -Original Message- From: Adam Retter [mailto:adam.ret...@googlemail.com] Sent: Wednesday, October 14, 2015 12:14 PM To: user@tika.apache.org

RE: Questions about using the Tika

2015-10-21 Thread Allison, Timothy B.
Bouncing to user@tika... If the PDFs have fixed fields (AcroForm), then that should be easy enough to parse out of the xhtml that Tika produces, or you could go with straight PDFBox. If (as I suspect), these are free text resumes, then Tika can help pull out the text, but then you're on your ow

RE: [VOTE] Apache Tika 1.11 Release Candidate #1

2015-10-21 Thread Allison, Timothy B.
+0 (some regressions in ppt content) I just finished the batch comparison run on ~1.8 million files in our govdocs1 and commoncrawl corpora comparing Tika 1.10 to 1.11-rc1. As a caveat, the eval code is still in development and there may be bugs in the reports. Results are here: https://gith

RE: Issues with extraction content of PDF files

2015-12-18 Thread Allison, Timothy B.
Hi Edwin, Thank you for reaching out to Tika. As I mentioned [0], the issue appears to be that the pdf file doesn’t contain Unicode mappings for the characters in the document. This means that PDFBox has no way of converting character codes within the PDF into anything useful. I checked wit

RE: Questions about using AutoDetect and DigestParser

2016-01-05 Thread Allison, Timothy B.
>>Question1) Shouldn't this be more specific? Like PdfParser, >>OpenDocumentParser and so on. Y, make sure to call metadata.getValues(X-Parsed-By) which returns an array of values and then iterate through that array to see the parsers that actually processed your doc. If you call metadata.get(

RE: Questions about using AutoDetect and DigestParser

2016-01-08 Thread Allison, Timothy B.
it seems it is related to my use of Scala. If I find the time I will try it again with Java to further pinpoint the problem. In the meantime I think I'll stick to java.security.MessageDigest. Kind regards -Original Message- Sent: Thursday, 07 January 2016 um 18:49:09 Uhr From: "

Re: [DISCUSS] Tika 1.12-rc1 (was Re: New Tika release)

2016-01-25 Thread Allison, Timothy B.
Dunno where you are on this...I'm still snowed in. It would be great if we could upgrade to PDFBox 1.8.11 if we haven't done so yet. TIKA-1830. Last I tried, we have to remove some "exceptional" handling in the unit test comparing the sequential to the non-sequential parser because the tests

RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Allison, Timothy B.
The problem (I think) is that tika-parsers.jar includes just the Tika parsers (wrappers) around a boatload of actual parsers/dependencies (POI, PDFBox, etc). If you are using jars, I’d recommend the tika-app.jar which includes all dependencies. From: Steven White [mailto:swhite4...@gmail.com] S

RE: Preventing OutOfMemory exception

2016-02-08 Thread Allison, Timothy B.
I’m not sure why you’d want to append document contents across documents into one handler. Typically, you’d use a new ContentHandler and new Metadata object for each parse. Calling “toString()” does not clear the content handler, and you should have 20 copies of the extracted content on your f

RE: Preventing OutOfMemory exception

2016-02-08 Thread Allison, Timothy B.
on, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. mailto:talli...@mitre.org>> wrote: I’m not sure why you’d want to append document contents across documents into one handler. Typically, you’d use a new ContentHandler and new Metadata object for each parse. Calling “toString()” does not

RE: Preventing OutOfMemory exception

2016-02-09 Thread Allison, Timothy B.
reason why I'm reusing a single instance is to cut down on overhead (I have yet to time this). Steve On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B. mailto:talli...@mitre.org>> wrote: In your actual code, are you using one BodyContentHandler for all of your files? Or a

RE: Preventing OutOfMemory exception

2016-02-09 Thread Allison, Timothy B.
ka per min. Steve On Tue, Feb 9, 2016 at 12:07 PM, Allison, Timothy B. mailto:talli...@mitre.org>> wrote: Same parser is ok to reuse…should even be ok in multithreaded applications. Do not reuse ContentHandler or Metadata objects. As a side note, if you are handling a bunch of files fr

RE: Using tika-app-1.11.jar

2016-02-11 Thread Allison, Timothy B.
Plan C: if you’re willing to store a mirror set of directories with the text versions of the files, just run tika-app.jar on your “input” directory and run your SolrJ loader on the “text/export” directory: java -jar tika-app.jar And, if you’re feeling jsonic: java -jar tika-app.jar –J -t –i

RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.
control the document corpus, > you have to build something far more tolerant as per Tim's comments. > > FWIW, > Erick > > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. > > wrote: > > I completely agree on the impulse, and for the vast majority of the >

RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.
this watch-dog monitoring and thus I have to implement my own. Can you confirm? Thanks Steve On Thu, Feb 11, 2016 at 2:45 PM, Allison, Timothy B. wrote: > x-post to Tika user's > > Y and n. If you run tika app as: > > java -jar tika-app.jar > > It runs t

  1   2   3   >