On Tue, 21 Nov 2017, Jim Idle wrote:
Following up on this, I will try cancelling my thread based tasks after
a pre-set time limit. That is only going to work if Tika and the
underlying parsers behave correctly with the interrupted exception.
Anyone had any success with that? I am mainly looking
On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
Does Tika library provide an efficient binary file check?
How do you define "binary"?
Only things with a mimetype that starts text/ ? Or do you want to include
application/xml files? Or things that extend form XML like DIF and
FictionBook? Only
On Fri, 12 Jan 2018, Luís Filipe Nassif wrote:
I can list some of them currently needing temp files: jpeg, zip (for
detection) and derived (docx, xlsx, pptx), ole2 (for detection) and derived
(doc, xls, ppt), mdb, pst, rar, 7zip, sqlite...
I've had a go at recording this in the Wiki, along with
On Fri, 12 Jan 2018, Martin Todorov wrote:
We're working on implementing a new artifact repository manager. Most of
the files in the repositories will be binaries (usually archives such as
jar, war, ear, zip, tar, tar.bz2, tar.gz, but not necessarily, or limited
to just these).
Unfortunately, y
On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
I am not an expert on mime types and how they extend. My definition of
binary is any file that is not in human readable form. Any other file,
I'd like to index. Would that answer your question?
Some of us humans here can read a wide range of form
On Fri, 19 Jan 2018, Kudrettin Güleryüz wrote:
One more thing, regarding application/xml vs text/xml
I think I'll skip application/xml for now and just include text/xml
Assuming application/xml is compressed XML such as Open office documents
and text/xml as uncompressed XML
Nope! They're both
On Mon, 5 Feb 2018, Matteo Alessandroni wrote:
I'm using Apache Tika to detect a file Mime Type from its base64
rapresentation. Unfortunately I don't have other info about the file
(e.g. extension).
and it gives me "text/plain" for JSON and PDF files, but I would like to
obtain a more specifi
On Mon, 19 Feb 2018, Mark Kerzner wrote:
Is that a good approach? Is the 10 seconds time normal? I am using the
latest most powerful Mac and I get similar results on an i7 processor in
Ubuntu.
Tika uses the open source Tesseract OCR engine. Tesseract is optimised for
ease of contributions and
On Thu, 1 Mar 2018, Jim Idle wrote:
Malicious RTF files take advantage of the fact that Microsoft do not
follow their own RTF spec. Specifically, Word et al only looks for the
opening sequence:
{rt
Thought the spec says it should be:
{rtf1
I don't think that Tika can assume that all RTF us
On Sat, 3 Mar 2018, Jean-Nicolas Boulay Desjardins wrote:
I am using this command:
java -classpath /home/$USER/Projects/Lab/tika/classes/ -jar
./tika-app/target/tika-app-1.17.jar
Java ignores -classpath if you also specify -jar
In /home/$USER/Projects/Lab/tika/classes/ I have:
sqlite-jdbc-3.
On Sun, 25 Mar 2018, McGreevy, Anthony wrote:
I am currently playing with Tika to see how it works with regards to
extraction of subfiles.
Do you mean files or resources embedded within another file?
If so... With the Tika App, you want -z to have these extracted. With the
Tika java classes,
On Wed, 18 Apr 2018, Jean-Nicolas Boulay Desjardins wrote:
I converted this RSS XML content to hex:
Then send it to Tika... Tika returns: text/plain
Base 64 encoded XML is no longer valid XML, so this is as expected.
Why am I not getting the rss mime type?
You need to send Tika the real
On Thu, 19 Apr 2018, AJ Weber wrote:
But I can't find that jar anywhere in any of the download areas. (I
don't know why, but my maven isn't working properly.)
You need to use Maven / Gradle / Ivy to fetch it, and everything it
depends on
Can someone point me to the location of such a jar an
On Mon, 23 Apr 2018, lewis john mcgibbney wrote:
Using the tika-server, I am having issues parsing the attachment ENVI hdr
file at [0] with the EnviHeaderParser [1].
Is there any way I can explicitly force execution of the EnviHeaderParser?
I think not directly on a per-request basis. All the
yOn Tue, 4 Sep 2018, Tucker Barbour wrote:
I've exported a GMail archive in MBOX format using takeout.google.com. The
MBOX archive also includes GChat messages. However, the GChat messages do not
include a Date header. Instead the date sent is included in what appears to
be a non-conforming RFC
On Wed, 17 Oct 2018, Tim Allison wrote:
This is one of the limitations of a streaming write. As I look at
the code of the MP3Parser, I _think_ it would be trivial to write the
metadata before writing any content, and it wouldn't get in the way of
a streaming parse because the parser reads the wh
On Wed, 16 Oct 2019, Eric Pugh wrote:
I’m looking at running Tika Server mode in a Linux box (and sorry, I
don’t know the specific flavour….). Is there a nice service script to
deal with bring Tika back up if the Linux box is restarted?
Are you using a systemd-based linux, or a different one,
On Tue, 12 Nov 2019, Katsuya Tomioka wrote:
I'm having trouble accessing encoding detectors in OSGi with Tika 1.22.
AutoDetectParser returns "Failed to detect the character encoding of a
document" for non-Latin text. We are migrating from 1.10, I'm sure many
things are different. It seems like
On Fri, 3 Jan 2020, Mike Dalrymple wrote:
I've just started using Tika to process PDFs with embedded images. I'm
getting fantastic results but I'm having to post-process the generated
XHTML to correct the value of the src attribute on the img elements.
That is expected. A simple sax handler sh
On Mon, 20 Apr 2020, Bradley Beach wrote:
I have tried every permutation of adding sqlite-jdbc-3.30.1.jar to my
classpath but still get:
java -classpath ".:sqlite-jdbc-3.30.1.jar" -jar tika-server-1.24.jar
--host=localhost --port=12345
You can't combine -classpath and -jar, you have to use on
On Wed, 22 Apr 2020, Tim Allison wrote:
Y. Agreed. Where should we document this? Where would you look for it?
The Tika Server and Tika App both get a fair bit of use from non-Java devs
Maybe we need a quickstart for non-Java folks section, and probably a
python-specific one as we get loads o
On Wed, 11 Nov 2020, nensick wrote:
I am exploring the available features and I managed also to extract
Office macros but I still don't find a way to get the links.
Imagine to have a PDF, a DOCX in which you have a "click here" text as a link
pointing
to a website (let's say example[.]com). Ho
On Tue, 22 Dec 2020, Peter Kronenberg wrote:
I'm trying to detect the mimetype of a file using both
Tika.detect(InputStream)
and
Tika.detect(File)
I get 2 different results. I'm testing with a Microsoft Word (.doc) file.
The InputStream one is based on just the first few kb of the file. That
On Tue, 22 Dec 2020, Peter Kronenberg wrote:
Oh, so reading the stream doesn't read the whole file?
Not for Detect, no. The assumption is that Detect is normally followed by
Parse, so you won't want the Stream consuming, so we do a mark/reset to
check the first few kb only
I know for Office
On Wed, 23 Dec 2020, Peter Kronenberg wrote:
But yet, if I understand correctly, using a TikaInputStream *will* spool
the entire stream to disk so it can read everything, right? If I
re-read the stream to parse, is it making 2 passes?
TikaInputStream has logic in it dump the stream to a temp
On Wed, 23 Dec 2020, Peter Kronenberg wrote:
Best is to wrap as a TikaInputStream, detect using all the detectors
via >DefaultDetector, then parse after that.
But sometimes the detect will read the whole file, right? For example,
for Word. So is it then making 2 passes?
Nope, we stash the
On Mon, 28 Dec 2020, Peter Kronenberg wrote:
For the metadata that comes back from a parse (example below), clearly,
the fields are dependent on the file type and information available.
Are there any 'standard' fields that come back for all/any files? Such
as Author, date, x-parsed-by, etc. I
On Thu, 31 Dec 2020, Peter Kronenberg wrote:
I've got Tika working with Tesseract on PDF files, but it seems that if
I give it a PDF file that has both searchable text and images, the text
is OCRed twice.
Is this a PDF where some other tool has already done the OCR and stored
the text it foun
On Thu, 11 Feb 2021, Tim Allison wrote:
I can replicate this on my windows laptop.
The weird thing is that the image file is actually there and if I pause the
debugger at the point after imagemagick has complained that the file isn't
there but before Tika does the clean up,
Windows is funny ab
On Tue, 23 Feb 2021, Peter Kronenberg wrote:
I was re-reading some emails with Nick Burch back around Dec 22-23 and
maybe I mis-understood him, but it sounds like he was saying that
TiksInputStream was smart enough to automatically spool the stream to
disk to allow re-use.
If a parser knows
On Fri, 26 Feb 2021, Peter Kronenberg wrote:
For most audio files, using the AudioParser, the buffer is still at the
beginning. Even though there is no text extraction, I would think that
Tika still needs to read through the stream. The MP3Parser consumes the
stream, but the MP4Parser does not
On Mon, 1 Mar 2021, Peter Kronenberg wrote:
But the issue is that different parsers return the stream in different
states. Sometimes the stream is all used up (although not closed). And
other times, the stream has been re-set to the beginning where it can be
re-used. Is this expected behavior
On Mon, 1 Mar 2021, Tim Allison wrote:
detectors should return the stream reset to the beginning.
I agree - needs to be ready for the parser to then process
Parsers, IIRC, should return the stream fully(?) read but not closed.
Not always - if the parser wanted a File then it may not have to
On Sat, 6 Mar 2021, Subhajit Das wrote:
But, the fonts and packages are not available on RHEL, as those are
Debian packages.
Please suggest alternate option to setup all supported fonts and
packages on RHEL.
Without a RHEL support login I can't be sure if these help or not, but I'd
suggest
On Mon, 15 Mar 2021, Subhajit Das wrote:
It seems that TikaServer 1.25 header like “X-Tika-PDFOcrStrategy” is
case sensitive.
Yes. That's bcause those then get mapped onto underlying Java classes and
methods, which are case sensitive
According to
:https://stackoverflow.com/questions/525897
On Wed, 14 Apr 2021, Peter Kronenberg wrote:
Anyone have any thoughts on this?
I think both an absolute and a percentage would be good, but I don't have
enough experience to comment on your suggested numbers for those two
thresholds, sorry!
Your idea on best vs fast touches on much older di
On Fri, 16 Apr 2021, Maloney, Patrick (ITS) wrote:
UNSUBSCRIBE
To unsubscribe from the Apache Tika users list, send an email to
user-unsubscr...@tika.apache.org and then reply to confirm. This info is
also included in every email
Nick
On Fri, 16 Apr 2021, Maloney, Patrick (ITS) wrote:
Thanks, but that info is not in the individual e-mails...I checked for
that.
Hmm, that might be an issue with your email client. Every list message has
this in the headers
Mailing-List: contact user-h...@tika.apache.org; run by ezmlm
On Tue, 13 Apr 2021, Subhajit Das wrote:
The Tika Docker image (full) uses ‘ttf-mscorefonts-installer’. The
licence used by it is Microsoft licence and dosen’t seems to allow
commercial use.
Can any please confirm if it is ok to use? Or should a customized
version to be used for production?
On Sat, 17 Apr 2021, Lewis John McGibbney wrote:
Please point me to the code for the ‘ttf-mscorefonts-installer’.
The bit of the Tika docker file that pulls them in is:
https://github.com/apache/tika-docker/blob/master/full/Dockerfile#L21
I think the EULA (which we auto-accept during installat
On Thu, 27 May 2021, Cristian Zamfir wrote:
I am running some stress tests of the latest tika server docker (not
modified in any way, just pulled from the registry) and seeing that after a
few hours I see OOM in the logs. The container has a limit of 4GB set in
K8S. I am wondering if you have any
On Wed, 2 Jun 2021, Cristian Zamfir wrote:
1. Do you have a recommendation for a stress test that would allow me to
easily test OOM behavior?
Depends what kind of OOM you're interested in. If you fire a lot of
memory-hungry documents at a single server at once, you can trigger an
OOM. Alterna
On Thu, 10 Jun 2021, Cristian Zamfir wrote:
It would be nice if this was feasible via the headers of each request. I
find it more convenient to use if/else in my code than in the yaml files
used for k8s configuration. Is there such an option?
Three options, see
https://cwiki.apache.org/conflu
On Thu, 10 Jun 2021, Cristian Zamfir wrote:
Thanks Nick. Looks like the option I was looking for is the 3rd one, but
the docs say it is only available in Tika 2.x - am I right?
I've just done a grep of the codebase, and it isn't in the 1.x branch,
only main = 2.x. So, Tika 2.x only
Nick
On Thu, 10 Jun 2021, Cristian Zamfir wrote:
Got it, thanks. What are your thoughts on using Tika 2.x while still in
beta? Is it likely to be more stable than 1,26? I presume it has passed
the same extensive test suite.
Usage stability wise, it's as good as 1.x.
API stability wise things are s
On Fri, 11 Jun 2021, Cristian Zamfir wrote:
I think for most people it would be quite critical to have logs working. Do
you happen to know how I can reach out to the person maintaining the docker
images https://hub.docker.com/u/dameikle to see if they are available to
update the images? Sounds li
On Thu, 22 Jul 2021, David Pilato wrote:
TL;DR: the created date of the document changes depending on the timezone.
That does seem a bug
For example:
• Asia/Sakhalin gives dcterms:created=2016-07-06T23:38:00Z
• Asia/Colombo gives dcterms:created=2016-07-07T05:08:00Z
• Europe/Stockholm gives
On Fri, 27 Aug 2021, Peter Kronenberg wrote:
When Tika extracts from a Microsoft Word document, deleted text is
extracted, with no indication that it is deleted. In fact, if a word
was deleted and replaced by another word, both words just show up
side-by-side. Is there a way to get some sort
On Thu, 21 Oct 2021, nskarthik wrote:
Question : Need to extract Text / images at page level using java.
Did not find any example on www or Tika website.
For PDF, you should fetch the contents as XHTML rather than plain text.
You can then split on the page divs. This isn't available for forma
On Thu, 10 Feb 2022, Willy T. Koch wrote:
As for content detection, today the content-type field with mime type is
returned. What we would need is a mime-type to file extension lookup and
it seems logical that this was also returned by Tika.
How are you calling Tika? We already have APIs for t
On Thu, 10 Feb 2022, Willy T. Koch wrote:
…and calling it as a webservice with Postman/curl.
Ah, I think we might not be exposing the full details of the mime types
via the server, only details of their parsers and the heirarchy, eg
http://localhost:9998/mime-types#audio/vorbis
(We have that
On Thu, 10 Feb 2022, Nick Burch wrote:
On Thu, 10 Feb 2022, Willy T. Koch wrote:
…and calling it as a webservice with Postman/curl.
Ah, I think we might not be exposing the full details of the mime types via
the server, only details of their parsers and the heirarchy, eg
http://localhost
On Tue, 22 Feb 2022, Tim Allison wrote:
I guess the question is how far do we want to bake this in? I could see
adding a field for the default extension in the
CompositeDetector/DefaultDetector. This would then be triggered on
embedded files, too. I can't imagine this would add much cost
co
On Thu, 24 Feb 2022, Tim Allison wrote:
A separate endpoint, then? That would be cleaner.
We already have some mime details related endpoints, would be an extension
or related endpoint to those, see earlier-thread:
https://lists.apache.org/thread/jlym8ypnrj978hmzjgvkc1fpxnc7g51h
Nick
On Fri, 18 Feb 2022, Willy T. Koch wrote:
Den Tor 17 feb 2022, kl. 20:00, skrev Nick Burch:
Tika devs - any thoughts on this? It's a pretty small code change (we
already have the data on the mime type!), just need feedback on extending
the existing API vs adding a new one
By also retu
On Tue, 8 Mar 2022, Willy T. Koch wrote:
That’s fantastic, thank you!
Looking forward to testing when the Tika Docker repo is updated with
this release.
That may take a few weeks, but if you don't mind building Tika from
source, you should be able to give it a whirl now. (As far as I'm aware
On Tue, 26 Apr 2022, Stephen H wrote:
Second, there seems to be some work missing in the handling of metadata
from certain parsers when using ForkParser. For example, for
OpenDocument ODP and ODS files and Microsoft Open XML formats, while the
document text is returned there is no metadata in e
On Tue, 26 Apr 2022, Stephen H wrote:
On 26/04/2022 12:22, Nick Burch wrote:
Are you able to write a short junit unit test case which shows this issue?
We have a bunch of small test OOXML and ODF files that could be used
I've done this - if I create an issue in Jira with it would that
On Fri, 3 Jun 2022, Cihad Guzel wrote:
I want to pass the content's words through some filters while parsing in
Tika. How can I add custom filtering?
Does the content handler work for this? Is there a document about this?
A custom content handler is a pretty good way to do that. Tika just use
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
I am currently trying to validate our Tika setup and was looking for a
set of example data I could use
If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-pa
On Thu, 1 Sep 2022, Mark Kerzner SHMsoft, Inc. wrote:
Yes, please. If I make some changes, I will start with small ones. I will
also verify them with you.
Great, thanks in advance for your contributions!
Can you please head to https://cwiki.apache.org/confluence/display/tika/ ,
click Sign Up
On Thu, 29 Sep 2022, Peter Conrad wrote:
thanks. That's definitely an improvement. But I think it's not
sufficient.
AFAICS your code uses "aliases" as in "if it's type X then it can also
be type Y". However there's also cases where a specific instance of
type X can also be type Y but not all ins
On Wed, 26 Oct 2022, Tim Allison wrote:
I've been struggling with this too. Outside of Docker, what I've been
doing is using a bin/ directory and throwing everything in there and then
starting tika-server: java -cp "bin/*"
org.apache.tika.server.core.cli.TikaServerCli ...
If we moved to that mo
On Sun, 30 Oct 2022, Christian Ribeaud wrote:
I am using the default configuration. I think, we could reduce my
problem to following code snippet:
Is there a reason that you aren't using one of the built-in Tika content
handlers? Generally they should be taking care of everything for you with
On Thu, 5 Jan 2023, Georg.Fischer wrote:
The tika.jar has >54 MB, and I suspect that the loading of the big jar
(under Windows) is hindering the performance. I should perhaps move to
Linux, or try the Tika server.
The Tika App jar has always been the "kitchen sink included quickstart"
option
On Mon, 6 Mar 2023, Chris Bamford via user wrote:
From both performance and thread safety points of view what is the best
approach for the use / reuse of the following objects:
Tika
ParseContext
Parser
Metadata
The Tika object and/or TikaConfig object should only be created once and
then re-
On Wed, 22 Mar 2023, Tim Allison wrote:
Thank you, Richard, for raising this. In looking at these file
formats, it looks like crw is based on ciff, cr2 is based on tiff and
cr3 is based on quicktime.
Always fun when the core of a format (or at least the container) swaps
between versions!
F
On Fri, 28 Apr 2023, שי ברק wrote:
Inside the container probably - makes more sense to me
In that case, create a custom Docker container that adds in your custom
config to your Docker image, as per Konstantin's instructions:
https://lists.apache.org/thread/l0od2b6tp6odyd661ftjqmkkf27o6hdl
Th
On Fri, 28 Apr 2023, שי ברק wrote:
I don’t know if it’s possible but I’m trying to avoid typing this ‘ ––
config’ when I start the container. I wish to have all of these settings
to be written inside the Dockerfile.
Since you're doing your own custom docker container, you could override
the E
On Tue, 20 Jun 2023, Neha Kamat via user wrote:
I am currently working on an application wherein I would like to
whitelist the filetypes supported by TIKA And discard rest of the files
to avoid unknown behaviour/memory leaks. I am currently referring to
https://cwiki.apache.org/confluence/displ
On Thu, 3 Aug 2023, Cristian Zamfir wrote:
I am interested in trying out Tika with a different OCR engine and
wondering how Tesseract is integrated.
Largely as "just another parser", but IIRC with a bit of logic to allow
the "normal" image parsers to also have a go at the file to grab metadata
On Wed, 29 Nov 2023, Neha Kamat via user wrote:
We are currently using TIKA for parsing/extracting content from pst
files.Is there a way we can tell parsing engine to parse as list of
emails instead of string of emails?
Depends how you're calling Tika?
Tika App? Tika Server? Python Wrapper? J
On Fri, 26 Apr 2024, Mauler, David wrote:
I'm in the process of troubleshooting an issue with certain mp4 video
files and tika. After a bunch of digging, it appears to be related to
whatever ISO is set for the mp4 file. An mp4 with an ISO of
14496-12:2003 will be detected as video/quicktime but
On Wed, 23 Jun 2010, McGibbney, Lewis John wrote:
and have received the following output when running tests on Apache Tika
Parsers
Try looking in tika-parsers/target/surefire-reports/
There should be text and xml files for each test, which will describe
what's gone wrong. Hopefully they'll p
On Wed, 23 Jun 2010, Mango wrote:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 42
at
javax.swing.text.rtf.RTFReader$AttributeTrackingDestination.handleKeyword(Unknown
Source)
This looks like a fault in the core java rtf parser :/
Do you know how your rtf file was created?
I
On Thu, 24 Jun 2010, Mango wrote:
From what I can tell is that thre rest of .rtf files are parsed without
a problem. Only those with embedded Visio diagrams create problems.
Wierd thing is I just tried parsing a .doc document with embedded .vsd
and it was parsed without a problem.
One test wo
On Mon, 28 Jun 2010, Jana, Kumar Raja wrote:
We use Apache Tika in our application before sending the content to Solr
for Indexing. Some of our documents are pretty large (over 150 MB in
size with "only text" content over 30 MB).
What file formats are these in?
There are some file formats (eg
On Tue, 3 Aug 2010, Chris Bamford wrote:
BTW I now get a runtime error with POI when trying to extract text from
a Corel presentation (.shw) file - which seems to get classified as an
MS Office doc?!
We don't currently seem to have any .shw files in the test suite, so it's
possible that thing
On Tue, 3 Aug 2010, Chris Bamford wrote:
...
Any chance you could create some sample files, and upload them to jira? We
can use them as a basis for future unit tests
Yes sure - please point me at the URL
https://issues.apache.org/jira/browse/TIKA
Exception in thread "main" java.lang.NoSuch
On Wed, 8 Sep 2010, Sergiy Karpenko wrote:
When I test content and metadata extraction by Tika, I met next usecases:
- Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED,
MSOffice.CREATION_DATE)
Date returned as String, but format is different for different document
types. Probably you alread
On Fri, 24 Sep 2010, Mattmann, Chris A (388J) wrote:
by our very own Nick Burch! :)
See here: http://s.apache.org/JMu
Glad you like it :)
I'll hopefully do another few posts about Tika in the next week or so, but
they'll be more about fine grained control of how Tika in Alfr
On Wed, 29 Sep 2010, Grant Ingersoll wrote:
IPTC (image)
There is support for this in Staffan Olsson's git fork (see TIKA-482). I'm
hoping Staffan will submit an updated patch of this shortly which we'll
then be able to apply :)
XMP (image/video) -- yes, AFAICT
We can generate XMP metad
On Thu, 30 Sep 2010, Jan Høydahl / Cominvent wrote:
Here's an open source Java implementation of a decompressor:
http://www.freeutils.net/source/jtnef/rtfcompressed.jsp
Alas that's under the GPL, so can't be used in an official distribution of
Tika. (You can use it yourself if you want though,
On Thu, 30 Sep 2010, Jan Høydahl / Cominvent wrote:
We could implement the decoder without distributing tnef.jar, using
Class.forName() and simply disabling the decoder if the jar is not on
classpath? Then it is up to the user to download the jar and thereby
accept the GPL license.
I've got a
On Fri, 8 Oct 2010, Jan Høydahl / Cominvent wrote:
My question was for a very specific usecase which is easy to do by a
small source code modification but perhaps harder to do with
configuration only.
Looking at the AutoDetectParser source code, the last parser registered
for a given mime typ
On Fri, 8 Oct 2010, Jan Høydahl / Cominvent wrote:
Magic is most often great, but I generally prefer to have some way of
explicitly telling the software what to do :)
That's very much available to you! See the different constructors to the
AutoDetectParser for examples of how to control what d
On Thu, 21 Oct 2010, qubit wrote:
When translating a text file -- file.txt -- through tika and looking at the
raw output, tika is essentially inserting no markup for line breaks or
paragraphs.
Most of the logic in TXTParser is around languages and types, there's not
much on the markup
Also,
On Fri, 22 Oct 2010, qubit wrote:
Thank you for your reply -- I will look into making the patch; it will get
me immersed in the code so I understand it better.
The code you probably want to look at is TXTParser in the tika-parser
package. The parser quickstart guide at
http://tika.apache.org/0
On Fri, 5 Nov 2010, Roland Cornelissen wrote:
Caused by: java.io.IOException: Unable to read entire block; 1 byte
read; expected 512 bytes
at
org.apache.poi.poifs.storage.RawDataBlock.(RawDataBlock.java:62)
This is normally caused by truncated files. However, it might be worth
trying with a
On Sat, 6 Nov 2010, Shay Banon wrote:
Just wanted to check in and see if this has progressed since I last
asked?
There has been some, I'd suggest you try a recent SVN checkout, then open
a JIRA if you spot any more cases where two parsers give different
responses for the same effective input
On Fri, 17 Dec 2010, Shaun Cutts wrote:
Caused by: java.lang.NullPointerException
at
com.sun.org.apache.xml.internal.serializer.ToStream.writeAttrString(ToStream.java:1962)
at
com.sun.org.apache.xml.internal.serializer.ToStream.processAttributes(ToStream.java:1942)
This doesn'
On Mon, 20 Dec 2010, Shaun Cutts wrote:
As you are being used for scraping purposes, however, you should
probably be able to read anything excel can write, including
inconsistent unicode. (If it is inconsistent -- I note that I don't
receive a "processingInstruction" callback to write the docum
On Mon, 20 Dec 2010, jason.holmb...@emc.com wrote:
Just starting to use Tika 0.8 in conjunction with DokuWiki, and I
noticed the dependency on Bouncy Castle through PDFBox. Is it possible
to remove this dependency, given that we're not using Tika for any
encryption purposes?
You may be surpri
On Tue, 21 Dec 2010, Shaun Cutts wrote:
ok, but in when I call parse, then my ContentHandler.characters()
callback gets a char [], and this is passed as:
(Pdb) ch
array('c', '\xa9 2010 Crane Data LLC. All rights reserved.')
so when I try unicode I get an error:
(Pdb) ch.tounicode()
*** ValueE
On Wed, 22 Dec 2010, Shaun Cutts wrote:
Ok -- I have rewritten my simple output formatter in java (see below).
The unicode problems were perhaps java/python problems as you suspected.
The real problem is that tika passes an "Attributes" which has a null
value returned by getValue().
Are you a
On Wed, 9 Feb 2011, Dmitrii Dimandt wrote:
When I convert this file to a text format, I get this:
01/10/10 40,452
the 10th of January 2010 is about 40,450 days from 1st of January 1900,
which is how all excel dates actually get stored internally If you
take a date cell and reformat it
On Wed, 9 Feb 2011, Dmitrii Dimandt wrote:
So I guess that the problem is probably inherent in Excel itself and I
pity Apple's develpers for getting this right.
Because, on top of all things, 01/10/10 (which is October 10th, 2010
over here in Europe) is 40450 days from January 1st, 1900.
The
On Mon, 28 Mar 2011, Roberto Martelloni wrote:
I'm trying to find the list of all supported file type in tika 0.9,
anyone can suggest to me where to find it ?
Easiest way is to ask tika-app:
java -jar tika-app-0.9.jar --list-parser-details
That'll give you back the list of all the avai
On Mon, 28 Mar 2011, Withanage, Dulip wrote:
We are interesting in extracting the row metadata (not formatted in XHML
as SAX events) from the files using tika.
Generally speaking, all of the metadata that is extracted is placed into
the Metadata object you supply when parsing. The SAX events a
On Mon, 28 Mar 2011, Withanage, Dulip wrote:
thank you for your prompt help, that looks promising.
Here is my user case.
1. generate the tika app using the source.
2. integrate itto a thirdparty application
3. use the tika extraction fuctionalities for images.
4. I have attached the image, it'
1 - 100 of 451 matches
Mail list logo