pdf acroform and tika

2012-02-23 Thread Allison, Timothy B.
Not sure if this is an issue for PDFBox or Tika, but I noticed that PDFBox's textstripper is not extracting information from the form fields in a batch of pdf documents I'm processing. Is anyone else having this problem? I regret that I'm unable to send an example document. Inelegant solution

BodyContentHandler and a docx embedded within a PDF

2013-05-22 Thread Allison, Timothy B.
I have a PDF document with a docx attachment. I wasn't having luck getting the contents of the docx with tika.parseToString(file). I dug around a bit in the PDFExtractor and found that when I changed this line: embeddedExtractor.parseEmbedded( stream, new

RE: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Allison, Timothy B.
This looks like an area for a new feature in both Tika and POI. I've only looked very briefly into the POI libraries, and I may have missed how to extract text from autoshapes. I'll open an issue in both projects. -Original Message- From: Hiroshi Tatsumi

RE: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Allison, Timothy B.
. I've opened https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix. There's some work going on on XSSFTextCell in POI that might make this more straightforward. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, July 22, 2013 8:50 AM

RE: How to extract autoshape text in Excel 2007+

2013-09-26 Thread Allison, Timothy B.
/show_bug.cgi?id=55292 It would be great if you could give me a patch. Thanks, Hiroshi Tatsumi -Original Message- From: Allison, Timothy B. Sent: Tuesday, July 23, 2013 5:10 AM To: user@tika.apache.org Subject: RE: How to extract autoshape text in Excel 2007+ Hiroshi, To fix this on your

tika server jax-rs and recursive file processing

2014-04-30 Thread Allison, Timothy B.
All, As always, apologies for the cluelessness the following reveals... I'm starting to move from embedded Tika to a server option for greater robustness. Is the jax-rs server intended not to handle embedded files recursively? If so, how are users currently handling multiply embedded

RE: Stack Overflow Question

2014-06-30 Thread Allison, Timothy B.
DefaultHandler is effectively a NullHandler; it doesn't store or do anything. Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler. If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler. QUOTE: 0down

RE: Stack Overflow Question

2014-06-30 Thread Allison, Timothy B.
returned from handler.tostring() how can i map a fileName to its content. thanks, yeshwanth On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. talli...@mitre.orgmailto:talli...@mitre.org wrote: DefaultHandler is effectively a NullHandler; it doesn't store or do anything. Try BodyContentHandler

RE: Stack Overflow Question

2014-07-01 Thread Allison, Timothy B.
Did you try the ToXMLHandler? From: yeshwanth kumar [mailto:yeshwant...@gmail.com] Sent: Monday, June 30, 2014 4:50 PM To: Allison, Timothy B. Subject: Re: Stack Overflow Question hi tim, i tried in all possible ways, instead of reading entire zip file i parsed individual zipentries, but even

RE: Stack Overflow Question

2014-07-01 Thread Allison, Timothy B.
Good to hear. Let us know if you have any other questions or when you run into surprises. From: yeshwanth kumar [mailto:yeshwant...@gmail.com] Sent: Tuesday, July 01, 2014 10:23 AM To: Allison, Timothy B. Subject: Re: Stack Overflow Question hi tim, i forgot to change the BodyContentHandler

RE: How to index the parsed content effectively

2014-07-02 Thread Allison, Timothy B.
Hi Sergey, I'd take a look at what the DataImportHandler in Solr does. If you want to store the field, you need to create the field with a String (as opposed to a Reader); which means you have to have the whole thing in memory. Also, if you're proposing adding a field entry in a

RE: How to index the parsed content effectively

2014-07-14 Thread Allison, Timothy B.
...@gmail.com] Sent: Friday, July 11, 2014 1:38 PM To: user@tika.apache.org Subject: Re: How to index the parsed content effectively Hi Tim, All. On 02/07/14 14:32, Allison, Timothy B. wrote: Hi Sergey, I'd take a look at what the DataImportHandler in Solr does. If you want to store

RE: Avoiding Out of Memory Errors

2014-07-18 Thread Allison, Timothy B.
I'm working on adding a daemon to Tika Server so that it will restart when it hits an OOM or other big problem (infinite hangs). That won't be available until Tika 1.7. To amplify Nick's recommendations: ForkParser or Server are your best options for now. Are there specific files/file

RE: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-28 Thread Allison, Timothy B.
+1 Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7 Windows 7, Java 1.7 I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs. There were several improvements in text extraction

RE: Tika - Outlook msg file with another Outlook msg as an attachment - OutlookExtractor passes empty stream

2014-07-31 Thread Allison, Timothy B.
AarKay, We have a unit test for an MSG embedded within an MSG in POIContainerExtractionTest. I also just tried a newly created msg within an msg file, and I can extract the embedded content with TikaTest.RecursiveMetaParser. This suggests that the issue is not within the OutlookParser.

RE: Apache Tika - JSON?

2014-09-26 Thread Allison, Timothy B.
library to serialize/deserialize Metadata objects in tika-serialization. From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Friday, September 26, 2014 6:54 AM To: user@tika.apache.org Subject: RE: Apache Tika - JSON? The current json output option in the app and server only dump metadata

RE: Problem with content extraction

2014-10-07 Thread Allison, Timothy B.
I’ve seen this before on a few documents. You might experiment with setting PDFParserConfig’s suppressDuplicateOverlappingText to true. If that doesn’t work, I’d recommend running the pure PDFBox app’s ExtractText on the document. If you get the same doubling of letters, ask over on

RE: Customizing Metadata Keys

2014-10-09 Thread Allison, Timothy B.
I agree with Nick’s recommendation on post-parsing key mapping, and I’d like to put in a plug for the RecursiveParserWrapper, which may be of use for you. I’ve been intending to add that to the app commandline and to server…how are you handling embedded document metadata? Would the wrapper be

internal vs external property?

2014-11-20 Thread Allison, Timothy B.
All, What is the difference between an internal and an external Property? I'm not (quickly) seeing how Metadata is using that Boolean. Are there other pieces of code that make use of the distinction? Thank you. Best, Tim

RE: Encrypted PDF issues build issues

2014-12-11 Thread Allison, Timothy B.
Y, sorry. As you point out, that should be fixed in PDFBox 1.8.8. A vote was just taken for that, so that will be out very soon. Last I looked at integrating PDFBox 1.8.8-SNAPSHOT, the upgrade requires us to change one test (I think?) in Tika…which is why you’re getting a failed build. Your

RE: Outputting JSON from tika-server/meta

2014-12-18 Thread Allison, Timothy B.
Do you have any luck if you call /metadata instead of /meta? That should trigger MetadataEP which will return Json, no? I'm not sure why we have both handlers, but we do... -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Thursday, December 18, 2014 9:56

Tika 2.0???

2014-12-18 Thread Allison, Timothy B.
(which unfortunately would break back compat, but in my mind would make a lot more sense) Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Allison, Timothy B. talli...@mitre.org Reply-To: user@tika.apache.org Date: Thursday, December 18, 2014 at 7

RE: Outputting JSON from tika-server/meta

2014-12-18 Thread Allison, Timothy B.
2014 at 15:20, Allison, Timothy B. talli...@mitre.orgmailto:talli...@mitre.org wrote: Do you have any luck if you call /metadata instead of /meta? I have no luck with that: Dec 18, 2014 3:55:21 PM org.apache.cxf.jaxrs.utils.JAXRSUtils findTargetMethod WARNING: No operation matching request path

RE: Outputting JSON from tika-server/meta

2014-12-19 Thread Allison, Timothy B.
All, With many thanks to Sergey, I added JSON and XMP to “/meta” and I folded in MetadataEP into MetadataResource so that users can request a specific metadata value(s). (TIKA-1497, TIKA-1499) I also added a new endpoint “/rmeta” that is equivalent to tika-app’s –J (TIKA-1498) – JSONified

RE: Running tika-server as a service

2015-01-08 Thread Allison, Timothy B.
Peter, I don’t have any immediate solutions, but there are two options in the pipeline (probably Tika 1.8): 1) Lewis John McGibbney on TIKA-894 is going to add a war/webapp. 2) I plan to open an issue related to TIKA-1330 that will make our current jax-rs tika-server more robust

RE: Running tika-server as a service

2015-01-08 Thread Allison, Timothy B.
Doh! My answer focused on my interests rather than your question. Sorry. By restart, I now assume you mean system restart… TIKA-894 should help with that if you configure your server container (tomcat?) to automatically start/restart. From: Allison, Timothy B. [mailto:talli...@mitre.org

JAX-RS: SEVERE Problem with writing the data when parser hits exception?

2015-02-27 Thread Allison, Timothy B.
All, I recently noticed that I'm getting this message logged when there is an exception during parsing: SEVERE: Problem with writing the data, class org.apache.tika.server.TikaResource$5, ContentType: text/html We didn't get this message with Tika 1.6, but we are getting this with Tika 1.7

RE: Odp.: solr issue with pdf forms

2015-04-29 Thread Allison, Timothy B.
I completely agree with Erick about the utility of the TermsComponent to see what is actually being indexed. If you find problems there and if you haven't done so already, you might also investigate further down the stack. It might make sense to run the tika-app.jar (whichever version you are

RE: Odp.: solr issue with pdf forms

2015-04-30 Thread Allison, Timothy B.
: Bitte^Hlegen^HSie^Hdem^HAntrag Kopien aller Einkommensnachweise bei.^HDaz Best Steve -Ursprüngliche Nachricht- Von: Allison, Timothy B. [mailto:talli...@mitre.org] Gesendet: Mittwoch, 29. April 2015 14:16 An: solr-u...@lucene.apache.org Cc: user@tika.apache.org Betreff: RE: Odp.: solr

FW: TIKA OCR not working

2015-04-27 Thread Allison, Timothy B.
Trung, I haven't experimented with our OCR parser yet, but this should give a good start: https://wiki.apache.org/tika/TikaOCR . Have you installed tesseract? Tika colleagues, Any other tips? What else has to be configured and how? -Original Message- From: trung.ht

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
. Thanks Regards Vijay On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote: I entirely agree with Erick -- it is best to isolate Tika in its own jvm if you can -- bad things can happen if you don't [1] [2]. Erick's blog on SolrJ is fantastic. If you want to have Tika

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
normally open in Adobe Reader and MS Office tools. Thanks Regards Vijay On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote: I entirely agree with Erick -- it is best to isolate Tika in its own jvm if you can -- bad things can happen if you don't [1] [2]. Erick's

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.
[mailto:mouthgalya.ganapa...@fitchratings.com] Sent: Thursday, June 04, 2015 10:20 AM To: Allison, Timothy B.; talli...@apache.org Cc: user@tika.apache.org; Sauparna Sarkar Subject: RE: Memory issues with PDF parser Hi Timothy, Thanks for the prompt reply. 1.)Wouldn't fixing the null pointer exception in turn

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.
[mailto:mouthgalya.ganapa...@fitchratings.com] Sent: Thursday, June 04, 2015 2:55 PM To: Allison, Timothy B. Cc: user@tika.apache.org; Sauparna Sarkar Subject: RE: Memory issues with PDF parser Thanks for the update Timothy, I see that Tika 1.9.-SNAPSHOT is available in maven repo. I am going to try

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.
Hi Mouthgalya, We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the fix will be available in Tika 1.9, which should be out within a week. As for memory issues, we worked around a memory leak in PDFBox with static caching of fonts for Tika 1.7 (may have been 1.8), but

RE: CSV Parser in Tika

2015-06-19 Thread Allison, Timothy B.
Y, that’s my belief. As of now, we’re treating them as text files, which can lead to some really long = bogus tokens in Lucene/Solr with analyzers that don’t split on commas. ☹ Detection without filename would be difficult. From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent:

xml vs html parser

2015-06-16 Thread Allison, Timothy B.
All, On govdocs1, the xml parser's exceptions accounted for nearly a quarter of all thrown exceptions at one point (Tika 1.7ish). Typically, a file was mis-identified as xml when in fact it was sgml or some other text based file with some markup that wasn't meant to be xml. For kicks, I

RE: Extract PDF inline images

2015-07-06 Thread Allison, Timothy B.
Hi Andrea, The RecursiveParserWrapper, as you found, is only for extracted content and metadata. It was designed to cache metadata and content from embedded documents so that you can easily keep those two things together for each embedded document. To extract the raw bytes from embedded

RE: robust Tika and Hadoop

2015-07-21 Thread Allison, Timothy B.
Thank you, Ken! From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Tuesday, July 21, 2015 10:23 AM To: user@tika.apache.org Subject: RE: robust Tika and Hadoop Hi Tim, Responses inline below. -- Ken From: Allison, Timothy B. Sent: July 21, 2015 5

FW: error Unsupported Media Type : while implementing ContentStreamUpdateRequestExample from the link http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

2015-07-22 Thread Allison, Timothy B.
What happens when you run straight tika-app against that pdf file? java -jar tika-app.jar Sample.pdf (grab tika-app from: http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.9.jar) Do you have all of the tika jars on your classpath/properly configured within your Solr setup? -Original

RE: robust Tika and Hadoop

2015-07-20 Thread Allison, Timothy B.
against things like NoSuchMethodErrors that can be thrown by Tika if the mime-type detection code tries to use a parser that we exclude, in order to keep the Hadoop job jar size to something reasonable. -- Ken From: Allison, Timothy B. Sent: July 15, 2015 4:38:56am

RE: robust Tika and Hadoop

2015-07-21 Thread Allison, Timothy B.
to keep the Hadoop job jar size to something reasonable. -- Ken From: Allison, Timothy B. Sent: July 15, 2015 4:38:56am PDT To: user@tika.apache.orgmailto:user@tika.apache.org Subject: robust Tika and Hadoop All, I'd like to fill out our Wiki a bit more

robust Tika and Hadoop

2015-07-15 Thread Allison, Timothy B.
All, I'd like to fill out our Wiki a bit more on using Tika robustly within Hadoop. I'm aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven't looked carefully into these packages yet. Does anyone have any recommendations for specific configurations/design patterns that will

RE: Inconsistent (buggy) behavior when using tika-server

2015-07-14 Thread Allison, Timothy B.
That looks like a bug in TikaUtils. For whatever reason, when is.available() returns 0, we are then assuming that fileUrl is not null. We need to check to make sure that fileUrl is not null before trying to open the file. if(is.available() == 0 !.equals(fileUrl)){ ... return

RE: [VOTE] Apache Tika 1.11 Release Candidate #1

2015-10-21 Thread Allison, Timothy B.
+0 (some regressions in ppt content) I just finished the batch comparison run on ~1.8 million files in our govdocs1 and commoncrawl corpora comparing Tika 1.10 to 1.11-rc1. As a caveat, the eval code is still in development and there may be bugs in the reports. Results are here:

RE: Tika unable to extract PDF Text

2015-10-14 Thread Allison, Timothy B.
File works with Tika trunk. What's on your classpath: tika-app or just tika-core? Is there a chance that you don't have tika-parsers on your cp? -Original Message- From: Adam Retter [mailto:adam.ret...@googlemail.com] Sent: Wednesday, October 14, 2015 12:14 PM To:

RE: Extract PDF inline images

2015-07-07 Thread Allison, Timothy B.
); 2015-07-06 12:59 GMT+02:00 Allison, Timothy B. talli...@mitre.orgmailto:talli...@mitre.org: Hi Andrea, The RecursiveParserWrapper, as you found, is only for extracted content and metadata. It was designed to cache metadata and content from embedded documents so that you can easily keep

RE: TikaConfig with constructor args

2015-08-27 Thread Allison, Timothy B.
That’s on my todo list (TIKA-1508). Unfortunately, that doesn’t exist yet. I’d recommend for now following the pattern of the PDFParser or the TesseractOCRParser. The config is driven by a properties file. As soon as my dev laptop becomes unbricked, I’m going to turn to TIKA-1508. Given my

RE: tesseract issue

2015-09-09 Thread Allison, Timothy B.
You can build from source if you have an interest (and the bandwidth, time and disk space) or pull a nightly build if you don’t want to wait for 1.11, for example: https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/849/org.apache.tika$tika-app/ Thank you, Christian! Best, Tim

RE: RecursiveParser returning ContentHandler

2015-09-22 Thread Allison, Timothy B.
Y, that should be easy enough. Instead of the metadata list, we can store a list of Metadata+Handler pairs, the current “getMetadata()” can be syntactic sugar around the new getMetadataAndHandlers(). Please open a ticket and we can discuss there. Thank you. Best, Tim From:

RE: Maximizing performance when parsing a lot of files

2015-09-25 Thread Allison, Timothy B.
It's best to keep Tika in its own jvm. If you are working filesystem to filesystem... The simplest thing to do would be to call tika-batch via the commandline of tika-app every so often. By default, tika-batch will skip files that it has already processed if you run it again, but you will pay

RE: Questions about using AutoDetect and DigestParser

2016-01-05 Thread Allison, Timothy B.
>>Question1) Shouldn't this be more specific? Like PdfParser, >>OpenDocumentParser and so on. Y, make sure to call metadata.getValues(X-Parsed-By) which returns an array of values and then iterate through that array to see the parsers that actually processed your doc. If you call

RE: Questions about using AutoDetect and DigestParser

2016-01-08 Thread Allison, Timothy B.
it is related to my use of Scala. If I find the time I will try it again with Java to further pinpoint the problem. In the meantime I think I'll stick to java.security.MessageDigest. Kind regards -Original Message- Sent: Thursday, 07 January 2016 um 18:49:09 Uhr From: "Allison, Timothy B.&qu

RE: Bypassing ExtractingRequestHandler

2016-06-14 Thread Allison, Timothy B.
text oriented. I have also thought about using DelimitedPayloadTokenFilter, which will increase the index size I imagine (how much, though?) and require more customization of Solr internals. I don't know which is the better approach. On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. <ta

RE: Weird spacing in words

2016-05-31 Thread Allison, Timothy B.
for the help. Best regards, Augusto > On 31 May 2016, at 14:35, Allison, Timothy B. <talli...@mitre.org> wrote: > > PDFs don't necessarily include spaces. In some (many?) cases, code has to do > the calculation of character widths and locations on the page to determine > whether or

RE: Preventing OutOfMemory exception

2016-02-08 Thread Allison, Timothy B.
2016 at 3:07 PM, Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>> wrote: I’m not sure why you’d want to append document contents across documents into one handler. Typically, you’d use a new ContentHandler and new Metadata object for each parse. Calling “toSt

RE: Preventing OutOfMemory exception

2016-02-08 Thread Allison, Timothy B.
I’m not sure why you’d want to append document contents across documents into one handler. Typically, you’d use a new ContentHandler and new Metadata object for each parse. Calling “toString()” does not clear the content handler, and you should have 20 copies of the extracted content on your

RE: Preventing OutOfMemory exception

2016-02-09 Thread Allison, Timothy B.
I'm reusing a single instance is to cut down on overhead (I have yet to time this). Steve On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>> wrote: In your actual code, are you using one BodyContentHandler for all of your files?

RE: Preventing OutOfMemory exception

2016-02-09 Thread Allison, Timothy B.
. Steve On Tue, Feb 9, 2016 at 12:07 PM, Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>> wrote: Same parser is ok to reuse…should even be ok in multithreaded applications. Do not reuse ContentHandler or Metadata objects. As a side note, if you are handling a bunch o

RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.
and _especially_ where you don't control the document corpus, > you have to build something far more tolerant as per Tim's comments. > > FWIW, > Erick > > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. > <talli...@mitre.org> > wrote: > > I completely agree o

RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.
and thus I have to implement my own. Can you confirm? Thanks Steve On Thu, Feb 11, 2016 at 2:45 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > x-post to Tika user's > > Y and n. If you run tika app as: > > java -jar tika-app.jar > > It runs tika-batch under

RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Allison, Timothy B.
The problem (I think) is that tika-parsers.jar includes just the Tika parsers (wrappers) around a boatload of actual parsers/dependencies (POI, PDFBox, etc). If you are using jars, I’d recommend the tika-app.jar which includes all dependencies. From: Steven White [mailto:swhite4...@gmail.com]

RE: tika is unable to extract outlook messages

2016-02-16 Thread Allison, Timothy B.
See my response to your question on the Solr users’ list here: https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201602.mbox/%3CCY1PR09MB0795E8DBA7B2B6603A45820EC7A80%40CY1PR09MB0795.namprd09.prod.outlook.com%3E I don’t think this is a Tika problem. This is the standard way that Solr’s

RE: Using tika-app-1.11.jar

2016-02-11 Thread Allison, Timothy B.
Plan C: if you’re willing to store a mirror set of directories with the text versions of the files, just run tika-app.jar on your “input” directory and run your SolrJ loader on the “text/export” directory: java -jar tika-app.jar And, if you’re feeling jsonic: java -jar tika-app.jar –J -t –i

RE: script tags in LinkContentHandler

2016-04-06 Thread Allison, Timothy B.
On #2, I'd prefer not skipping elements. I definitely understand the use case to extract what a human can see, but I suspect if your email address ends in 'forensics.com', you'd probably like to see everything as well. -Original Message- From: Joseph Naegele

RE: Jempbox runtime error

2016-04-22 Thread Allison, Timothy B.
Hi Chris, Good to hear from you. We do still use Jempbox in 1.12 for the PDFParser and the JempboxExtractor. The RTF must have an embedded PDF or Jpeg or another image file. Is there any chance Maven is not smiling upon you with transitive dependencies? When you bundle your app are you

RE: Jempbox runtime error

2016-04-22 Thread Allison, Timothy B.
viceB.mimecast.com/mimecast/click?account=C1A1=4befc68ae3c36b74613befac61365f92> [Blog]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=c18e757b199760a7639b14a093ecc854> [Twitter]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=88cffd899bb6263568309604cc938d96>

RE: Jempbox runtime error

2016-04-22 Thread Allison, Timothy B.
t.com/mimecast/click?account=C1A1=89480d9b115cbf17a99e17bd11045609> [Blog]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=7a9d8ba1eab0c90c3cdda0ff306625c2> [Twitter]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=d05873ca23f5f82ca4bbe30ab29477c0> On 22 Apr 2016, at

RE: [VOTE] Release Apache Tika 1.13 Candidate #1

2016-05-11 Thread Allison, Timothy B.
+1 Built on Windows and Linux. I'm relying on earlier pre-release tests for no surprises. :) Thank you, Dave! -Original Message- From: David Meikle [mailto:loo...@gmail.com] On Behalf Of David Meikle Sent: Monday, May 9, 2016 3:35 PM To: d...@tika.apache.org; user@tika.apache.org

RE: Need Help

2016-05-11 Thread Allison, Timothy B.
Haven’t gotten around to this yet. Sorry. Anyone else have any input? From: harsh kumar [mailto:kumarhars...@gmail.com] Sent: Friday, May 6, 2016 8:48 AM To: Allison, Timothy B. <talli...@mitre.org> Subject: Re: Need Help Hey Timothy, Can you please help me with your findings of the T

RE: My "What's new with Apache Tika 2.0" talk slides

2016-05-11 Thread Allison, Timothy B.
Great slides. Thank you, Nick. Wish I could be there... Any feedback/guidance from the audience? -Original Message- From: Nick Burch [mailto:n...@apache.org] Sent: Wednesday, May 11, 2016 5:09 PM To: user@tika.apache.org Cc: d...@tika.apache.org Subject: My "What's new with Apache

RE: Tika response encoding problem

2016-05-16 Thread Allison, Timothy B.
Our AutoDetectReader does correctly identify the encoding in this case. Do we want to add logic that checks for ??, and if that doesn’t exist then use our AutoDetectReader? From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, May 16, 2016 11:15 AM To: user@tika.apache.org Subject

RE: Tika response encoding problem

2016-05-16 Thread Allison, Timothy B.
to fix this. From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, May 16, 2016 8:04 AM To: user@tika.apache.org Subject: RE: Tika response encoding problem >>I also tried to use tika-app, since I saw in --help that I can pass the >>--encoding parameter. So I ran: To clarif

RE: Tika response encoding problem

2016-05-16 Thread Allison, Timothy B.
>>I also tried to use tika-app, since I saw in --help that I can pass the >>--encoding parameter. So I ran: To clarify (you may already understand this, sorry)…the encoding parameter specifies the output encoding; it is not a hint to Tika in encoding detection. With trunk and 1.12 in Tika

RE: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

2016-05-02 Thread Allison, Timothy B.
>> While PDFBox is a part of TIKA and the two projects are kindof "best friends >> forever" Thank you, Tilman! :) -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Saturday, April 30, 2016 5:24 PM To: us...@pdfbox.apache.org Subject: Re: is it possible to

RE: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

2016-05-02 Thread Allison, Timothy B.
The commandline I gave you outputs JSON files. If you open them in a text/JSON editor, you should see valid data. If they're corrupt, please let us know! If you're able to process JSON files, you should be good to go. Otherwise, the recommendation to use Java's ZipFile API and do the

RE: Apache Tika wikipedia page

2016-04-15 Thread Allison, Timothy B.
Fantastic. Thank you! Have a great weekend! -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Friday, April 15, 2016 7:22 PM To: d...@tika.apache.org Cc: user@tika.apache.org Subject: Apache Tika wikipedia page Hi All, I made a Wikipedia

RE: Need Help

2016-04-18 Thread Allison, Timothy B.
Ha. I'm in the process of comparing mimetype detection results from DROID, Tika and 'file' on our TIKA-1302 corpus. After that, I was going to compare our different encoding detectors on the corpus...I'll have a better answer in a few weeks. Others on this list probably have more info, but

RE: Is Tika (especially CharsetDetector) considered thread-safe?

2016-07-25 Thread Allison, Timothy B.
Charset detection _should_ be thread safe. If you can help us track down the problem (unit test?), we need to fix this. Thank you for raising this. Best, Tim -Original Message- From: c.leitin...@lirum.at [mailto:c.leitin...@lirum.at] Sent: Monday, July 25, 2016 6:01 PM To:

RE: Is Tika (especially CharsetDetector) considered thread-safe?

2016-07-25 Thread Allison, Timothy B.
f (val == null) { return "NULL"; } else { return val; } } } -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, July 25, 2016 9:21 PM To: user@tika.apache.org Subject: RE: Is Tika (especially CharsetDe

RE: detect corrupt file and build a list of them before indexing in solr

2016-07-15 Thread Allison, Timothy B.
Checking for 0 byte files is one option. The other option is to configure the logs to capture exceptions. I’ve attached the config files and the shell script that I use when running our large scale regression testing here:

RE: Extract Text from a TIFF image

2016-07-18 Thread Allison, Timothy B.
You'll need to set up tesseract to run Optical Character Recognition. While we have an integration with OCR, it is not bundled within the app. See https://wiki.apache.org/tika/TikaOCR For kicks, I ran this through Tika+Tesseract; this is the output you get once you've set up Tesseract:

RE: Extract Text from a TIFF image

2016-07-19 Thread Allison, Timothy B.
ideal solution. How to get the same results Timothy got? Thanks Gord From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: July 18, 2016 2:25 PM To: user@tika.apache.org<mailto:user@tika.apache.org> Subject: RE: Extract Text from a TIFF image You'll need to set up tesseract to run Optic

RE: detect corrupt file and build a list of them before indexing in solr

2016-07-15 Thread Allison, Timothy B.
TIKA_app1.12 2016-07-15 18:20 GMT+01:00 Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>>: Can you share the shell script/bat file you’re using? From: kostali hassan [mailto:med.has.kost...@gmail.com<mailto:med.has.kost...@gmail.com>] Sent: Friday, July 15, 2016

RE: RE: PDFPaser generates gibberish

2016-07-01 Thread Allison, Timothy B.
Ah, ok, nothing we can do about it then. Sorry. >One more thing… That sounds like a new line issue. Notepad doesn’t understand \n, whereas WordPad and MSWord do. From: Allison A. [mailto:alliso...@gmail.com] Sent: Friday, July 1, 2016 1:07 AM To: user@tika.apache.org Subject: Re: RE: PDFPaser

RE: Rest API Documentation

2017-01-23 Thread Allison, Timothy B.
Y, our license appears to have expired. Chris/Tyler, Any chance you could re-up our license? From: ネイト・フィンドリー [mailto:nat...@zenlok.com] Sent: Saturday, January 21, 2017 6:30 PM To: user@tika.apache.org Subject: Rest API Documentation The Miredot link no longer produces documentation. Is

RE: Extracting vector graphics from pdf

2017-02-28 Thread Allison, Timothy B.
This allows to collect the lines. However it won't output an image. Tilman Am 27.02.2017 um 13:20 schrieb Allison, Timothy B.: > PDFBox Colleagues, >Any recommendations? > >Best, > > Tim > > -Original Message- > From: Andisa Dewi [ma

FW: Tika calling exiftool and ffmpeg?

2016-09-01 Thread Allison, Timothy B.
From: Chris Bamford [mailto:cbamf...@mimecast.com] Sent: Thursday, September 1, 2016 7:03 AM To: Subject: Tika calling exiftool and ffmpeg? Hi I recently noticed on my linux box in the auditd logs that my JVM is making repeated attempts to

RE: Apache Tikaで、PDFの本文内の文字が連続する現象発生

2016-09-14 Thread Allison, Timothy B.
Again, relying on google translate. Y, I would think that suppressing overlapping characters should solve this problem. Try pure PDFBox, and if the problem is there, try asking on the PDFBox list. いきなりですが、表記件についてご質問させてください。 Javaで、Apache Tikaで、PDFのパース処理をしています。

RE: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread Allison, Timothy B.
Again, relying on Google translate. The problem with these files is that they don't self identify their encoding via http metaheaders, and they contain very little content so Mozilla's UniversalChardet and ICU4J don't have enough to work with. IE, Chrome and Firefox all fail on these files,

RE: Apache Tikaで、保護されたPDFを取り込むと全文が文字化けしている

2016-09-14 Thread Allison, Timothy B.
If a PDF requires a password (and it isn't the empty string) and you have the password, you need to send it in via the ParseContext: ParseContext context = new ParseContext(); context.set(PasswordProvider.class, new PasswordProvider() { public String getPassword(Metadata

RE: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread Allison, Timothy B.
Sorry, can't tell what the question is? -Original Message- From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] Sent: Wednesday, September 14, 2016 11:50 AM To: Allison, Timothy B. <talli...@mitre.org> Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字

RE: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread Allison, Timothy B.
ilto:question.answer...@gmail.com] Sent: Wednesday, September 14, 2016 11:06 AM To: user@tika.apache.org Cc: Allison, Timothy B. <talli...@mitre.org> Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け Thank you for your answer. I, character code of the file can not be determined EUC

RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
> I'll try to get a sample HTML yielding to this problem and attach it to Jira. Great! Tika 1.14 is around the corner...if this is an easy fix ... :) Thank you.

RE: Is creating new AutoDetectParsers expensive?

2016-09-30 Thread Allison, Timothy B.
You can reuse AutoDetectParser in a multithreaded environment. You shouldn’t have problems with performance or thread safety. If you find otherwise, please let us know! ☺ From: Haris Osmanagic [mailto:haris.osmana...@gmail.com] Sent: Friday, September 30, 2016 10:36 AM To: user@tika.apache.org

RE: Is creating new AutoDetectParsers expensive?

2016-09-30 Thread Allison, Timothy B.
6 at 4:46 PM Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>> wrote: You can reuse AutoDetectParser in a multithreaded environment. You shouldn’t have problems with performance or thread safety. If you find otherwise, please let us know! ☺ From: Haris Osmanagic [

RE: PDF Processing

2016-11-07 Thread Allison, Timothy B.
that the best way to request enhancements is to create a JIRA entry so it can be tracked? Thanks for your help, Jim From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, November 2, 2016 19:02 To: user@tika.apache.org<mailto:user@tika.apache.org> Subject: RE: PDF Processing It d

RE: Tika server RTF processing

2016-11-28 Thread Allison, Timothy B.
A. [mailto:alliso...@gmail.com] Sent: Thursday, November 24, 2016 10:39 PM To: user@tika.apache.org Subject: Re: Tika server RTF processing Oops, I am re-posting and attaching them. It seems Ajax calls are not passed properly. On Thu, Nov 24, 2016 at 7:06 AM, Allison, Timothy B. <talli...@mitre.

RE: Temporary Files Location

2016-11-23 Thread Allison, Timothy B.
Have you tried via java opt: -Djava.io.tmpdir=/someotherdir From: Vérène Houdebine [mailto:verene.houdeb...@orange.fr] Sent: Wednesday, November 23, 2016 8:38 AM To: user Subject: Temporary Files Location Hi! I'm using Tika on a partitioned server that doesn't have much

RE: Tika server RTF processing

2016-11-23 Thread Allison, Timothy B.
There was a bug in some RTF files in 1.13, but that was fixed in 1.14 (TIKA-1845). We now have one rtf in our test suite for tika-server. If you turn logging on, can you share a stacktrace, or can you share the offending file? From: Allison A. [mailto:alliso...@gmail.com] Sent: Tuesday,

  1   2   >