Not sure if this is an issue for PDFBox or Tika, but I noticed that PDFBox's
textstripper is not extracting information from the form fields in a batch of
pdf documents I'm processing. Is anyone else having this problem?
I regret that I'm unable to send an example document.
Inelegant solution
I have a PDF document with a docx attachment. I wasn't having luck getting the
contents of the docx with tika.parseToString(file).
I dug around a bit in the PDFExtractor and found that when I changed this line:
embeddedExtractor.parseEmbedded(
stream,
new
This looks like an area for a new feature in both Tika and POI. I've only
looked very briefly into the POI libraries, and I may have missed how to
extract text from autoshapes. I'll open an issue in both projects.
-Original Message-
From: Hiroshi Tatsumi
. I've opened https://issues.apache.org/jira/browse/TIKA-1150
for the longer term fix.
There's some work going on on XSSFTextCell in POI that might make this more
straightforward.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 22, 2013 8:50 AM
/show_bug.cgi?id=55292
It would be great if you could give me a patch.
Thanks,
Hiroshi Tatsumi
-Original Message-
From: Allison, Timothy B.
Sent: Tuesday, July 23, 2013 5:10 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+
Hiroshi,
To fix this on your
All,
As always, apologies for the cluelessness the following reveals... I'm
starting to move from embedded Tika to a server option for greater robustness.
Is the jax-rs server intended not to handle embedded files recursively? If so,
how are users currently handling multiply embedded
DefaultHandler is effectively a NullHandler; it doesn't store or do anything.
Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.
If you want to write out each embedded file as a binary, try subclassing
EmbeddedResourceHandler.
QUOTE:
0down
returned
from handler.tostring()
how can i map a fileName to its content.
thanks,
yeshwanth
On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B.
talli...@mitre.orgmailto:talli...@mitre.org wrote:
DefaultHandler is effectively a NullHandler; it doesn't store or do anything.
Try BodyContentHandler
Did you try the ToXMLHandler?
From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Monday, June 30, 2014 4:50 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question
hi tim,
i tried in all possible ways,
instead of reading entire zip file i parsed individual zipentries,
but even
Good to hear. Let us know if you have any other questions or when you run into
surprises.
From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Tuesday, July 01, 2014 10:23 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question
hi tim,
i forgot to change the BodyContentHandler
Hi Sergey,
I'd take a look at what the DataImportHandler in Solr does. If you want to
store the field, you need to create the field with a String (as opposed to a
Reader); which means you have to have the whole thing in memory. Also, if
you're proposing adding a field entry in a
...@gmail.com]
Sent: Friday, July 11, 2014 1:38 PM
To: user@tika.apache.org
Subject: Re: How to index the parsed content effectively
Hi Tim, All.
On 02/07/14 14:32, Allison, Timothy B. wrote:
Hi Sergey,
I'd take a look at what the DataImportHandler in Solr does. If you want
to store
I'm working on adding a daemon to Tika Server so that it will restart when it
hits an OOM or other big problem (infinite hangs). That won't be available
until Tika 1.7.
To amplify Nick's recommendations:
ForkParser or Server are your best options for now.
Are there specific files/file
+1
Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
Windows 7, Java 1.7
I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all
formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs.
There were several improvements in text extraction
AarKay,
We have a unit test for an MSG embedded within an MSG in
POIContainerExtractionTest. I also just tried a newly created msg within an
msg file, and I can extract the embedded content with
TikaTest.RecursiveMetaParser. This suggests that the issue is not within the
OutlookParser.
library to serialize/deserialize Metadata objects
in tika-serialization.
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Friday, September 26, 2014 6:54 AM
To: user@tika.apache.org
Subject: RE: Apache Tika - JSON?
The current json output option in the app and server only dump metadata
I’ve seen this before on a few documents. You might experiment with setting
PDFParserConfig’s suppressDuplicateOverlappingText to true. If that doesn’t
work, I’d recommend running the pure PDFBox app’s ExtractText on the document.
If you get the same doubling of letters, ask over on
I agree with Nick’s recommendation on post-parsing key mapping, and I’d like to
put in a plug for the RecursiveParserWrapper, which may be of use for you.
I’ve been intending to add that to the app commandline and to server…how are
you handling embedded document metadata? Would the wrapper be
All,
What is the difference between an internal and an external Property? I'm not
(quickly) seeing how Metadata is using that Boolean. Are there other pieces of
code that make use of the distinction?
Thank you.
Best,
Tim
Y, sorry. As you point out, that should be fixed in PDFBox 1.8.8. A vote was
just taken for that, so that will be out very soon. Last I looked at
integrating PDFBox 1.8.8-SNAPSHOT, the upgrade requires us to change one test
(I think?) in Tika…which is why you’re getting a failed build. Your
Do you have any luck if you call /metadata instead of /meta?
That should trigger MetadataEP which will return Json, no?
I'm not sure why we have both handlers, but we do...
-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Thursday, December 18, 2014 9:56
(which unfortunately would break back
compat, but in my mind would make a lot more sense)
Chris Mattmann
chris.mattm...@gmail.com
-Original Message-
From: Allison, Timothy B. talli...@mitre.org
Reply-To: user@tika.apache.org
Date: Thursday, December 18, 2014 at 7
2014 at 15:20, Allison, Timothy B.
talli...@mitre.orgmailto:talli...@mitre.org wrote:
Do you have any luck if you call /metadata instead of /meta?
I have no luck with that:
Dec 18, 2014 3:55:21 PM org.apache.cxf.jaxrs.utils.JAXRSUtils findTargetMethod
WARNING: No operation matching request path
All,
With many thanks to Sergey, I added JSON and XMP to “/meta” and I folded in
MetadataEP into MetadataResource so that users can request a specific metadata
value(s). (TIKA-1497, TIKA-1499)
I also added a new endpoint “/rmeta” that is equivalent to tika-app’s –J
(TIKA-1498) – JSONified
Peter,
I don’t have any immediate solutions, but there are two options in the
pipeline (probably Tika 1.8):
1) Lewis John McGibbney on TIKA-894 is going to add a war/webapp.
2) I plan to open an issue related to TIKA-1330 that will make our current
jax-rs tika-server more robust
Doh! My answer focused on my interests rather than your question. Sorry. By
restart, I now assume you mean system restart… TIKA-894 should help with that
if you configure your server container (tomcat?) to automatically start/restart.
From: Allison, Timothy B. [mailto:talli...@mitre.org
All,
I recently noticed that I'm getting this message logged when there is an
exception during parsing:
SEVERE: Problem with writing the data, class
org.apache.tika.server.TikaResource$5, ContentType: text/html
We didn't get this message with Tika 1.6, but we are getting this with Tika 1.7
I completely agree with Erick about the utility of the TermsComponent to see
what is actually being indexed. If you find problems there and if you haven't
done so already, you might also investigate further down the stack. It might
make sense to run the tika-app.jar (whichever version you are
:
Bitte^Hlegen^HSie^Hdem^HAntrag Kopien aller Einkommensnachweise bei.^HDaz
Best
Steve
-Ursprüngliche Nachricht-
Von: Allison, Timothy B. [mailto:talli...@mitre.org]
Gesendet: Mittwoch, 29. April 2015 14:16
An: solr-u...@lucene.apache.org
Cc: user@tika.apache.org
Betreff: RE: Odp.: solr
Trung,
I haven't experimented with our OCR parser yet, but this should give a good
start: https://wiki.apache.org/tika/TikaOCR .
Have you installed tesseract?
Tika colleagues,
Any other tips? What else has to be configured and how?
-Original Message-
From: trung.ht
.
Thanks Regards
Vijay
On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote:
I entirely agree with Erick -- it is best to isolate Tika in its own jvm
if you can -- bad things can happen if you don't [1] [2].
Erick's blog on SolrJ is fantastic. If you want to have Tika
normally open in Adobe Reader and MS Office tools.
Thanks Regards
Vijay
On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org
wrote:
I entirely agree with Erick -- it is best to isolate Tika in its own jvm
if you can -- bad things can happen if you don't [1] [2].
Erick's
[mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Thursday, June 04, 2015 10:20 AM
To: Allison, Timothy B.; talli...@apache.org
Cc: user@tika.apache.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser
Hi Timothy,
Thanks for the prompt reply.
1.)Wouldn't fixing the null pointer exception in turn
[mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Thursday, June 04, 2015 2:55 PM
To: Allison, Timothy B.
Cc: user@tika.apache.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser
Thanks for the update Timothy,
I see that Tika 1.9.-SNAPSHOT is available in maven repo. I am going to try
Hi Mouthgalya,
We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the
fix will be available in Tika 1.9, which should be out within a week.
As for memory issues, we worked around a memory leak in PDFBox with static
caching of fonts for Tika 1.7 (may have been 1.8), but
Y, that’s my belief.
As of now, we’re treating them as text files, which can lead to some really
long = bogus tokens in Lucene/Solr with analyzers that don’t split on commas. ☹
Detection without filename would be difficult.
From: lewis john mcgibbney [mailto:lewi...@apache.org]
Sent:
All,
On govdocs1, the xml parser's exceptions accounted for nearly a quarter of
all thrown exceptions at one point (Tika 1.7ish). Typically, a file was
mis-identified as xml when in fact it was sgml or some other text based file
with some markup that wasn't meant to be xml.
For kicks, I
Hi Andrea,
The RecursiveParserWrapper, as you found, is only for extracted content and
metadata. It was designed to cache metadata and content from embedded
documents so that you can easily keep those two things together for each
embedded document.
To extract the raw bytes from embedded
Thank you, Ken!
From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Tuesday, July 21, 2015 10:23 AM
To: user@tika.apache.org
Subject: RE: robust Tika and Hadoop
Hi Tim,
Responses inline below.
-- Ken
From: Allison, Timothy B.
Sent: July 21, 2015 5
What happens when you run straight tika-app against that pdf file?
java -jar tika-app.jar Sample.pdf
(grab tika-app from: http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.9.jar)
Do you have all of the tika jars on your classpath/properly configured within
your Solr setup?
-Original
against things like NoSuchMethodErrors that
can be thrown by Tika if the mime-type detection code tries to use a parser
that we exclude, in order to keep the Hadoop job jar size to something
reasonable.
-- Ken
From: Allison, Timothy B.
Sent: July 15, 2015 4:38:56am
to keep the Hadoop job jar size to something
reasonable.
-- Ken
From: Allison, Timothy B.
Sent: July 15, 2015 4:38:56am PDT
To: user@tika.apache.orgmailto:user@tika.apache.org
Subject: robust Tika and Hadoop
All,
I'd like to fill out our Wiki a bit more
All,
I'd like to fill out our Wiki a bit more on using Tika robustly within
Hadoop. I'm aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven't
looked carefully into these packages yet.
Does anyone have any recommendations for specific configurations/design
patterns that will
That looks like a bug in TikaUtils.
For whatever reason, when is.available() returns 0, we are then assuming that
fileUrl is not null. We need to check to make sure that fileUrl is not null
before trying to open the file.
if(is.available() == 0 !.equals(fileUrl)){
...
return
+0 (some regressions in ppt content)
I just finished the batch comparison run on ~1.8 million files in our govdocs1
and commoncrawl corpora comparing Tika 1.10 to 1.11-rc1. As a caveat, the eval
code is still in development and there may be bugs in the reports.
Results are here:
File works with Tika trunk. What's on your classpath: tika-app or just
tika-core? Is there a chance that you don't have tika-parsers on your cp?
-Original Message-
From: Adam Retter [mailto:adam.ret...@googlemail.com]
Sent: Wednesday, October 14, 2015 12:14 PM
To:
);
2015-07-06 12:59 GMT+02:00 Allison, Timothy B.
talli...@mitre.orgmailto:talli...@mitre.org:
Hi Andrea,
The RecursiveParserWrapper, as you found, is only for extracted content and
metadata. It was designed to cache metadata and content from embedded
documents so that you can easily keep
That’s on my todo list (TIKA-1508). Unfortunately, that doesn’t exist yet.
I’d recommend for now following the pattern of the PDFParser or the
TesseractOCRParser. The config is driven by a properties file.
As soon as my dev laptop becomes unbricked, I’m going to turn to TIKA-1508.
Given my
You can build from source if you have an interest (and the bandwidth, time and
disk space) or pull a nightly build if you don’t want to wait for 1.11, for
example:
https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/849/org.apache.tika$tika-app/
Thank you, Christian!
Best,
Tim
Y, that should be easy enough. Instead of the metadata list, we can store a
list of Metadata+Handler pairs, the current “getMetadata()” can be syntactic
sugar around the new getMetadataAndHandlers().
Please open a ticket and we can discuss there.
Thank you.
Best,
Tim
From:
It's best to keep Tika in its own jvm.
If you are working filesystem to filesystem... The simplest thing to do would
be to call tika-batch via the commandline of tika-app every so often. By
default, tika-batch will skip files that it has already processed if you run it
again, but you will pay
>>Question1) Shouldn't this be more specific? Like PdfParser,
>>OpenDocumentParser and so on.
Y, make sure to call metadata.getValues(X-Parsed-By) which returns an array of
values and then iterate through that array to see the parsers that actually
processed your doc. If you call
it is related to my use of
Scala. If I find the time I will try it again with Java to further pinpoint the
problem. In the meantime I think I'll stick to java.security.MessageDigest.
Kind regards
-Original Message-
Sent: Thursday, 07 January 2016 um 18:49:09 Uhr
From: "Allison, Timothy B.&qu
text oriented. I have also thought about using
DelimitedPayloadTokenFilter, which will increase the index size I imagine (how
much, though?) and require more customization of Solr internals. I don't know
which is the better approach.
On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. <ta
for the help.
Best regards,
Augusto
> On 31 May 2016, at 14:35, Allison, Timothy B. <talli...@mitre.org> wrote:
>
> PDFs don't necessarily include spaces. In some (many?) cases, code has to do
> the calculation of character widths and locations on the page to determine
> whether or
2016 at 3:07 PM, Allison, Timothy B.
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
I’m not sure why you’d want to append document contents across documents into
one handler. Typically, you’d use a new ContentHandler and new Metadata object
for each parse. Calling “toSt
I’m not sure why you’d want to append document contents across documents into
one handler. Typically, you’d use a new ContentHandler and new Metadata object
for each parse. Calling “toString()” does not clear the content handler, and
you should have 20 copies of the extracted content on your
I'm reusing a single instance is to cut down on
overhead (I have yet to time this).
Steve
On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B.
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
In your actual code, are you using one BodyContentHandler for all of your
files?
.
Steve
On Tue, Feb 9, 2016 at 12:07 PM, Allison, Timothy B.
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
Same parser is ok to reuse…should even be ok in multithreaded applications.
Do not reuse ContentHandler or Metadata objects.
As a side note, if you are handling a bunch o
and _especially_ where you don't control the document corpus,
> you have to build something far more tolerant as per Tim's comments.
>
> FWIW,
> Erick
>
> On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
> <talli...@mitre.org>
> wrote:
> > I completely agree o
and
thus I have to implement my own. Can you confirm?
Thanks
Steve
On Thu, Feb 11, 2016 at 2:45 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:
> x-post to Tika user's
>
> Y and n. If you run tika app as:
>
> java -jar tika-app.jar
>
> It runs tika-batch under
The problem (I think) is that tika-parsers.jar includes just the Tika parsers
(wrappers) around a boatload of actual parsers/dependencies (POI, PDFBox, etc).
If you are using jars, I’d recommend the tika-app.jar which includes all
dependencies.
From: Steven White [mailto:swhite4...@gmail.com]
See my response to your question on the Solr users’ list here:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201602.mbox/%3CCY1PR09MB0795E8DBA7B2B6603A45820EC7A80%40CY1PR09MB0795.namprd09.prod.outlook.com%3E
I don’t think this is a Tika problem. This is the standard way that Solr’s
Plan C: if you’re willing to store a mirror set of directories with the text
versions of the files, just run tika-app.jar on your “input” directory and run
your SolrJ loader on the “text/export” directory:
java -jar tika-app.jar
And, if you’re feeling jsonic:
java -jar tika-app.jar –J -t –i
On #2, I'd prefer not skipping elements. I definitely understand the use case
to extract what a human can see, but I suspect if your email address ends in
'forensics.com', you'd probably like to see everything as well.
-Original Message-
From: Joseph Naegele
Hi Chris,
Good to hear from you. We do still use Jempbox in 1.12 for the PDFParser and
the JempboxExtractor. The RTF must have an embedded PDF or Jpeg or another
image file.
Is there any chance Maven is not smiling upon you with transitive
dependencies? When you bundle your app are you
viceB.mimecast.com/mimecast/click?account=C1A1=4befc68ae3c36b74613befac61365f92>
[Blog]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=c18e757b199760a7639b14a093ecc854>
[Twitter]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=88cffd899bb6263568309604cc938d96>
t.com/mimecast/click?account=C1A1=89480d9b115cbf17a99e17bd11045609>
[Blog]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=7a9d8ba1eab0c90c3cdda0ff306625c2>
[Twitter]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=d05873ca23f5f82ca4bbe30ab29477c0>
On 22 Apr 2016, at
+1
Built on Windows and Linux. I'm relying on earlier pre-release tests for no
surprises. :)
Thank you, Dave!
-Original Message-
From: David Meikle [mailto:loo...@gmail.com] On Behalf Of David Meikle
Sent: Monday, May 9, 2016 3:35 PM
To: d...@tika.apache.org; user@tika.apache.org
Haven’t gotten around to this yet. Sorry.
Anyone else have any input?
From: harsh kumar [mailto:kumarhars...@gmail.com]
Sent: Friday, May 6, 2016 8:48 AM
To: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: Need Help
Hey Timothy,
Can you please help me with your findings of the T
Great slides. Thank you, Nick. Wish I could be there...
Any feedback/guidance from the audience?
-Original Message-
From: Nick Burch [mailto:n...@apache.org]
Sent: Wednesday, May 11, 2016 5:09 PM
To: user@tika.apache.org
Cc: d...@tika.apache.org
Subject: My "What's new with Apache
Our AutoDetectReader does correctly identify the encoding in this case.
Do we want to add logic that checks for ??, and if that doesn’t exist
then use our AutoDetectReader?
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, May 16, 2016 11:15 AM
To: user@tika.apache.org
Subject
to fix this.
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, May 16, 2016 8:04 AM
To: user@tika.apache.org
Subject: RE: Tika response encoding problem
>>I also tried to use tika-app, since I saw in --help that I can pass the
>>--encoding parameter. So I ran:
To clarif
>>I also tried to use tika-app, since I saw in --help that I can pass the
>>--encoding parameter. So I ran:
To clarify (you may already understand this, sorry)…the encoding parameter
specifies the output encoding; it is not a hint to Tika in encoding detection.
With trunk and 1.12 in Tika
>> While PDFBox is a part of TIKA and the two projects are kindof "best friends
>> forever"
Thank you, Tilman! :)
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Saturday, April 30, 2016 5:24 PM
To: us...@pdfbox.apache.org
Subject: Re: is it possible to
The commandline I gave you outputs JSON files. If you open them in a text/JSON
editor, you should see valid data. If they're corrupt, please let us know!
If you're able to process JSON files, you should be good to go. Otherwise, the
recommendation to use Java's ZipFile API and do the
Fantastic. Thank you!
Have a great weekend!
-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Friday, April 15, 2016 7:22 PM
To: d...@tika.apache.org
Cc: user@tika.apache.org
Subject: Apache Tika wikipedia page
Hi All,
I made a Wikipedia
Ha. I'm in the process of comparing mimetype detection results from DROID,
Tika and 'file' on our TIKA-1302 corpus.
After that, I was going to compare our different encoding detectors on the
corpus...I'll have a better answer in a few weeks.
Others on this list probably have more info, but
Charset detection _should_ be thread safe. If you can help us track down the
problem (unit test?), we need to fix this.
Thank you for raising this.
Best,
Tim
-Original Message-
From: c.leitin...@lirum.at [mailto:c.leitin...@lirum.at]
Sent: Monday, July 25, 2016 6:01 PM
To:
f (val == null) {
return "NULL";
} else {
return val;
}
}
}
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 25, 2016 9:21 PM
To: user@tika.apache.org
Subject: RE: Is Tika (especially CharsetDe
Checking for 0 byte files is one option. The other option is to configure the
logs to capture exceptions. I’ve attached the config files and the shell
script that I use when running our large scale regression testing here:
You'll need to set up tesseract to run Optical Character Recognition. While we
have an integration with OCR, it is not bundled within the app.
See https://wiki.apache.org/tika/TikaOCR
For kicks, I ran this through Tika+Tesseract; this is the output you get once
you've set up Tesseract:
ideal solution. How to get the same results Timothy got?
Thanks
Gord
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: July 18, 2016 2:25 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: Extract Text from a TIFF image
You'll need to set up tesseract to run Optic
TIKA_app1.12
2016-07-15 18:20 GMT+01:00 Allison, Timothy B.
<talli...@mitre.org<mailto:talli...@mitre.org>>:
Can you share the shell script/bat file you’re using?
From: kostali hassan
[mailto:med.has.kost...@gmail.com<mailto:med.has.kost...@gmail.com>]
Sent: Friday, July 15, 2016
Ah, ok, nothing we can do about it then. Sorry.
>One more thing…
That sounds like a new line issue. Notepad doesn’t understand \n, whereas
WordPad and MSWord do.
From: Allison A. [mailto:alliso...@gmail.com]
Sent: Friday, July 1, 2016 1:07 AM
To: user@tika.apache.org
Subject: Re: RE: PDFPaser
Y, our license appears to have expired.
Chris/Tyler,
Any chance you could re-up our license?
From: ネイト・フィンドリー [mailto:nat...@zenlok.com]
Sent: Saturday, January 21, 2017 6:30 PM
To: user@tika.apache.org
Subject: Rest API Documentation
The Miredot link no longer produces documentation. Is
This allows to collect the lines. However it won't output an image.
Tilman
Am 27.02.2017 um 13:20 schrieb Allison, Timothy B.:
> PDFBox Colleagues,
>Any recommendations?
>
>Best,
>
> Tim
>
> -Original Message-
> From: Andisa Dewi [ma
From: Chris Bamford [mailto:cbamf...@mimecast.com]
Sent: Thursday, September 1, 2016 7:03 AM
To:
Subject: Tika calling exiftool and ffmpeg?
Hi
I recently noticed on my linux box in the auditd logs that my JVM is making
repeated attempts to
Again, relying on google translate. Y, I would think that suppressing
overlapping characters should solve this problem. Try pure PDFBox, and if the
problem is there, try asking on the PDFBox list.
いきなりですが、表記件についてご質問させてください。
Javaで、Apache Tikaで、PDFのパース処理をしています。
Again, relying on Google translate.
The problem with these files is that they don't self identify their encoding
via http metaheaders, and they contain very little content so Mozilla's
UniversalChardet and ICU4J don't have enough to work with. IE, Chrome and
Firefox all fail on these files,
If a PDF requires a password (and it isn't the empty string) and you have the
password, you need to send it in via the ParseContext:
ParseContext context = new ParseContext();
context.set(PasswordProvider.class, new PasswordProvider() {
public String getPassword(Metadata
Sorry, can't tell what the question is?
-Original Message-
From: question.answer...@gmail.com [mailto:question.answer...@gmail.com]
Sent: Wednesday, September 14, 2016 11:50 AM
To: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字
ilto:question.answer...@gmail.com]
Sent: Wednesday, September 14, 2016 11:06 AM
To: user@tika.apache.org
Cc: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
Thank you for your answer.
I, character code of the file can not be determined EUC
> I'll try to get a sample HTML yielding to this problem and attach it to Jira.
Great! Tika 1.14 is around the corner...if this is an easy fix ... :)
Thank you.
You can reuse AutoDetectParser in a multithreaded environment. You shouldn’t
have problems with performance or thread safety.
If you find otherwise, please let us know! ☺
From: Haris Osmanagic [mailto:haris.osmana...@gmail.com]
Sent: Friday, September 30, 2016 10:36 AM
To: user@tika.apache.org
6 at 4:46 PM Allison, Timothy B.
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
You can reuse AutoDetectParser in a multithreaded environment. You shouldn’t
have problems with performance or thread safety.
If you find otherwise, please let us know! ☺
From: Haris Osmanagic
[
that the best way to request enhancements is to create a JIRA entry
so it can be tracked?
Thanks for your help,
Jim
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday, November 2, 2016 19:02
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: PDF Processing
It d
A. [mailto:alliso...@gmail.com]
Sent: Thursday, November 24, 2016 10:39 PM
To: user@tika.apache.org
Subject: Re: Tika server RTF processing
Oops, I am re-posting and attaching them. It seems Ajax calls are not passed
properly.
On Thu, Nov 24, 2016 at 7:06 AM, Allison, Timothy B.
<talli...@mitre.
Have you tried via java opt:
-Djava.io.tmpdir=/someotherdir
From: Vérène Houdebine [mailto:verene.houdeb...@orange.fr]
Sent: Wednesday, November 23, 2016 8:38 AM
To: user
Subject: Temporary Files Location
Hi!
I'm using Tika on a partitioned server that doesn't have much
There was a bug in some RTF files in 1.13, but that was fixed in 1.14
(TIKA-1845). We now have one rtf in our test suite for tika-server.
If you turn logging on, can you share a stacktrace, or can you share the
offending file?
From: Allison A. [mailto:alliso...@gmail.com]
Sent: Tuesday,
1 - 100 of 164 matches
Mail list logo