On Wed, 16 Jan 2013, Cedric Meury wrote:
A) Why does Tika only support 3-gram profiles? In the code, the legacy
format is even referenced in comments (LanguageProfileBuilder):
It looks like wherever the code came from had made that change. Sadly,
there's no issue number with the commit:
On 04/01/13 20:00, Jack Park wrote:
A two-column scientific paper.
The PDF parser has a few options that can be set, to control how some
aspects of the parsing are done. Sorting text by position is one of
them, which makes the parsing take a little longer, but will often
improve accuracy on
On Mon, 7 Jan 2013, Maciej Liżewski wrote:
I am using tika with Apache Solr. What I need to achieve is to process
all images with provided external parser instead of default image/jpeg
parser. In general this is all about some external OCR software.
Your best bet is to include a parsers
On Mon, 7 Jan 2013, Maciej Liżewski wrote:
And is there some default parser to recursively index all files in archive?
You can just use AutoDetectParser, if you don't need any special handling.
I think a lot of people have a small custom parser that outputs some
special markup / flags it in
On 04/01/13 12:09, Maciej Liżewski wrote:
1. does tika recursively fetch content from archives (zip, rar, etc)?
If you ask it to. You need to attach the parser you want to use for
recursion to the ParseContext, and it'll be called for any embedded
resources. (If you want, you can give your
On Sat, 8 Dec 2012, Lewis John Mcgibbney wrote:
We use Tika 1.2 over in Nutch, I wonder what kind of support Tika has
for parsing .zip files and whether someone can comment on whether I can
work towards dropping the legacy parser for Nutch?
Tika has pretty good support for archive formats,
On Wed, 28 Nov 2012, samir pendharkar wrote:
1) When header/footer gets extracted as text, it also include what seems
like formatting information/metadata. Example -
? DATE \@ MM/dd/yy ?09/16/12? extracted in the text
Actual document only shows 09/16/12 in the footer
Looks like some date
On Wed, 21 Nov 2012, Juha Haaga wrote:
Caused by:
org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException:
unsupported feature encryption used in entry …
Is this error caused by lack of password or lack of zip decrypting
functionality? Is it possible to provide the zip file
On 18/11/12 16:33, Jason Judge wrote:
I'm was using libgcj 4.4.6, which seems to be the latest for CentOS 6.3,
so far as I can see (in rpm format, at least).
I've installed openjdk:
# yum install java-1.7.0-openjdk
and that works a treat - fast too. Demo page here:
r
On Fri, 9 Nov 2012, Norman M wrote:
I am using Apache Tika to extract text from PPT/PPTX files.
Is Poi really using streaming to parse files?
Some bits. xls file processing is stream based, for ppt the whole file
gets processed and then the text parts are located and picked out.
File file
On Sat, 27 Oct 2012, goog cheng wrote:
the file is in memory, i have to save it in disk and then fetch it back
again?
Some Tika parsers only work with files, so if you don't stream it to disk
Tika will do. Otherwise, you could always stream it to Tika on stdin?
Nick
On 26/10/12 00:52, goog cheng wrote:
in python, an opened file object
And how are you currently calling Tika from Python?
Nick
On Fri, 26 Oct 2012, goog cheng wrote:
Tika supports input file in CLI . But if the input is filestream, is
there a command to do it? Any help would be greatly appreciated!
What do you mean by a filestream, a pipe? Something else?
Nick
On Fri, 28 Sep 2012, David Patterson wrote:
I want to process a maven pom.xml with special code.
I added the following to the existing xml file of mimetypes:
mime-type type=application/maven-pom
glob pattern=pom.xml /
/mime-type
You'll be much better off adding it to a custom mimetypes
On Thu, 27 Sep 2012, Vigneshwaran wrote:
I am new to Apache Tika. I want Tika to output only the names of the
files within the archive (if the input file is an archive) and the file
content as usual if the input file is not an archive. Is there a way I
can do that?
Yup. Rather than passing
On Thu, 30 Aug 2012, Alex Cougarman wrote:
Hi. Is it possible to specifically extract footer/header and body text
out of a Word document using Solr? In other words, we'd like to
index/store those items in different Solr fields.
As long as the have a suitable style applied, yes Tika will be
On 22/08/12 06:29, Ramachandran, Karthik wrote:
I'm having some trouble with the TNEFParser so I would like to prevent
the AutoDetect parser from using it.
Is there a way to override the default org.apache.tika.parser.Parser to
prevent it from using the the TNEFParser?
For a long term fix,
On Fri, 17 Aug 2012, Alexander Cougarman wrote:
I'm using this C# code to call the parser directly via its URL; it
returns JSON:
var url = @http://localhost:8983/solr/update/extract;;
You might have more luck asking on the SOLR lists, as it looks like your
question is with how SOLR
On Thu, 16 Aug 2012, Alexander Cougarman wrote:
Is it possible to return just the raw text of the document extracted by
Tika? In other words, we don't want it in XML or JSON, just the text in
it.
Yes. Are you using the TikaApp jar, calling the Tika facade class, or
calling a parser directly?
On Wed, 18 Jul 2012, rodgersh wrote:
And here is my custom-mimetypes.xml file:
?xml version=1.0 encoding=UTF-8?
mime-info
mime-type type=image/nitf
alias type=image/ntf/
glob pattern=*.nitf/
/mime-type
/mime-info
I've no idea about OSGi, so I can't comment on what you need to
On Sun, 1 Jul 2012, Nick Burch wrote:
Tika snapshots are available in the Snapshot Repository:
http://repository.apache.org/snapshots/org/apache/tika/
The current latest tika-app snapshot is:
https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.2-SNAPSHOT/tika
On Sun, 1 Jul 2012, Zabrane Mickael wrote:
So no way to avoid load the file in RAM? That's sad.
Any other advices guys?
You could try asking on the Apache PDFBox list for advice - Tika's PDF
extraction is all powered by PDFBox
Nick
On Sun, 1 Jul 2012, Jason Judge wrote:
Am I understanding it correctly that tika-server and tika-app are just
two examples of the way tika can be used, and are just thrown together
as a quick-start demo rather than core functionality of the main part of
the project, which is a collection of
On Sun, 1 Jul 2012, Jason Judge wrote:
So, feature requests, command line, or...learn java. It is going to be a
busy Summer :-)
Anyone up for a Tika hackathon weekend in Oxford later this summer? Jason
could hop on a direct train down from Newcastle, and we're only an hour
from Heathrow for
On Wed, 6 Jun 2012, Paulini, Matthew CTR USAF AFMC AFRL/RISA wrote:
When using tika-app-0.10.jar, I receive video/quicktime when passing
the .mp4 byte array. But, tika-app-1.1.jar is throwing an exception
(TIKA-198: Illegal IOException from
org.apache.tika.parser.mp4.MP4Parser@8491b8) from the
On Fri, 18 May 2012, Karthik Deivasigamani wrote:
I wanted to try out tika for our parsing needs. Tried downloading the
tika-app-1.0.jar 1.1 jar file and also build it locally from the src
using mvn. Both seem to give me the same error message as below :
*[karthik@karthik-linux
On Thu, 17 May 2012, Alec Swan wrote:
1. We don't know how to tell if we don't have enough heap space to
process the file and skip the file in this case. Allowing out of
memory errors take down our process is not acceptable.
In that kind of situation, you should be looking at using something
On Wed, 16 May 2012, Alec Swan wrote:
Memory consumption stays under 90MB which is less than max heap size
(128M). No out-of-memory errors are thrown during test
There is absolutely no way that you're going to be able to parse a PDF,
DOC/DOCX or PPT/PPTX of more than about 20mb in size on a
On Thu, 3 May 2012, Ilya Zavorin wrote:
However, I am having trouble porting any code, to be able to step
through it using a simple wrapper app on Android. Specifically, I am
using Eclipse and having this issue:
On Tue, 17 Apr 2012, Taylor, Wade wrote:
Hi, I'm having trouble detecting a file as application/xml. When I
detect a URL containing XML the detection works and I get
application/xml as the media type.
Hmm, that's odd. I've taken your sample xml, popped it in a new file, and
run java -jar
On Tue, 17 Apr 2012, William Hays wrote:
I believe you answered a different question than what I asked. My
observation was specifically about the AutoDetectParser listing its
supported mediatypes, not about the HTMLParser.
The Tika App uses AutoDetectParser internally, so if it's finding the
On Tue, 17 Apr 2012, Taylor, Wade wrote:
Since I couldn't get that to work I went back to basics and tried a
simple XML string:
new Tika().detect(new ByteArrayInputStream(?xml version=\1.0\
encoding=\UTF-8\?rootchildtext/child/root.getBytes(;
but this gets detected as text/plain too and I
On Mon, 16 Apr 2012, Kevin Miller wrote:
This is with plain Tika 1.1 built via Maven from source downloaded from
Apache's Tika project. I am not familiar with how to adjust what
dependencies Tika builds pull down. If you point me in a direction I'll
give it a try.
In svn, the dependency has
On Tue, 13 Mar 2012, Jon Gorrono wrote:
The tika-app jar properly identifies the misnamed file so it's either a
classpath or a implementation issue
You'll need to have the Tika Parsers jar (and associated dependencies) for
it to work properly. We do have unit tests for this, and as long as
On Fri, 9 Mar 2012, Mark Kerzner wrote:
Standard 1.0 of Tika, with whatever POI is included in it by default
It's probably worth re-testing with the Tika 1.1 release candidate, and
seeing if that fixes it (it has a newer POI version in it)
Nick
On Thu, 8 Mar 2012, Harry Simons wrote:
I tried the BFF Validator, and it is indeed failing!
If you're able to share the error log, that could be helpful
However, the file got created by MS Word only, and I doubt if it's
'corrupt'... since both MS Word and LibreOffice can load it fine
On Thu, 8 Mar 2012, Shalom Ben-Zvi wrote:
I installed tika-bundle-1.0 and tika-core-1.0 into servicemix.
I'm invoking Tika from my bundle, actually a camel route.
I don't know a lot about OSGi, but that might be your issue - you have
some bits of Tika coming from a bundle, and some bits from
On Mon, 27 Feb 2012, Lewis John Mcgibbney wrote:
After compiling I get
[javac] MimeUtil.java:165: incompatible types
[javac] found :
java.lang.Objectjava.io.Serializablejava.lang.Comparable? extends
java.lang.Objectjava.io.Serializablejava.lang.Comparable?
[javac] required:
On Sun, 26 Feb 2012, Lewis John Mcgibbney wrote:
// If no mime-type header, or cannot find a corresponding registered
// mime-type, then guess a mime-type from the url pattern
type = this.mimeTypes.getMimeType(url) != null ? this.mimeTypes
.getMimeType(url) : type;
}
On Tue, 14 Feb 2012, Stephan Mühlstrasser wrote:
https://issues.apache.org/jira/browse/TIKA-527
Is there any documentation of the syntax of the configuration file
available?
You could look at the code that process the file, but the example in that
JIRA ought to cover most uses cases
The
On Wed, 8 Feb 2012, Markus Jelsma wrote:
In Nutch we have a copy of Tika-core. But with just that lib we also
have access to the Tika.parser API from the other module. How does this
all work because i have had confusing results in the past (and now).
Tika Core comes with the core of Tika,
On Tue, 7 Feb 2012, Public Network Services wrote:
Counting tags only, apparently there are 1,304 different variations of
MIME types there (!), so I would like to map them to, say, a few custom
top-level categories like Office, PDF, Audio, Video, or similar.
Assuming this is not done in Tika,
On Tue, 7 Feb 2012, Public Network Services wrote:
I know about the predefined types in the MediaType class.
I think you might have missed something - what I was refering to was
things like video at the start of video/mp4 being a good way to spot
that it's a video :)
Perhaps we should get
On Fri, 27 Jan 2012, Public Network Services wrote:
I had a look at the MIME types list and there are 50 different Office
formats, including many for Microsoft Word/Excel/Powerpoint!
Yup, there are quite a few different formats (with and without macros,
normal and templates etc), and they
On Fri, 20 Jan 2012, hpvpl wrote:
I've tried to checkout the code and recompile but face pom issue when
doing the mvn clean install
That should work just fine, there shouldn't be any issues with compiling
from trunk
Nick
On Fri, 20 Jan 2012, Allison, Timothy B. wrote:
I'm just getting started with Tika, and I tried the basic
AutoDetectParser and the basic ParsingReader on a batch of a few
thousand docx files (tika-app v1.0). On my laptop, I was able to
extract text at a rate of 200 docs per minute. When I
On Thu, 19 Jan 2012, hpvpl wrote:
When I parse a Mp3 source I have a problem with the last character of the
album, Author and artist metadata. I get a ? character at the end of the
metadata.
Can you try with a recent nightly build? Only a problem like that was
fixed recently
Nick
On Thu, 5 Jan 2012, ola nowak wrote:
Should java -jar tika-app.jar -list-parsers list it?
Nope. The service loading isn't magic - it won't go and find random jars
that you haven't told it about!
You'll instead need something like:
java -classpath MyParser.jar:tika-app-1.1-SNAPSHOT.jar
On Mon, 2 Jan 2012, Albretch Mueller wrote:
How can someone know that the heading for a PDF file corresponds to the
heading of a MS Word and or RTF file or the title on an HTML file
corresponds to the title of a media file?
They can't - both formats allow you to make something look like a
On Sat, 31 Dec 2011, Albretch Mueller wrote:
I think all major file formats should be somehow functionally
specified through something like
~
core.tika.formatHandlers.getAll[DefinedFormat]Handlers
In code:
TikaConfig config = TikaConfig.getDefaultConfig();
SetMediaType supported =
On Fri, 30 Dec 2011, ola nowak wrote:
I've added my parser to the list but I don't know how to explicity tell
AutoDetectParser to use my parser.
You probably need to do two steps:
* Add a custom mimetypes entry that detects your special XML files as a
suitable (probbaly new) mimetype
* Have
On Thu, 8 Dec 2011, Andrzej Bialecki wrote:
I guess that could work, but it would be very messy - I would have to
keep a list of all potentially interesting mime types in my code, which
is difficult to maintain.
Or a list of interesting parsers in your other case!
It would be much better if
On 05/12/11 21:41, Albretch Mueller wrote:
If you're interested in helping ...
Yes, I can and would offer man/mind hours to including movie media
files parsing (and eventually processing) in tika
Great!
I am definitely more inclined to use ffmpeg (your third option) but I
think we
On Mon, 5 Dec 2011, Paul Pearcy wrote:
It appears that under the hood pdfbox can work with either a
RandomAccessFile
(http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessFile.html
) or a RandomAccessBuffer
On Tue, 29 Nov 2011, chethan wrote:
but there is no property call PASSWORD m.set(Metadata.PASSWORD,
NiceAndSecret);, it is throwing an error
Ah, that key seems to be hard coded into the PDFBox parser
Short term, you can get the metadata key from PDFBoxParser
Medium term, any chance you could
On Fri, 11 Nov 2011, Swapna Vuppala wrote:
Am using Tika to index .msg files of Outlook. It has been working very
good for me but am facing problem while indexing some .msg files. The
indexing fails with the below Solr exception
SEVERE: org.apache.solr.common.SolrException: Invalid Date
On Wed, 28 Sep 2011, Swapna Vuppala wrote:
Am new to using Solr and Tika. Am trying to index .msg files (Outlook
mails) into Solr. For this, I need a list of metadata extracted by Tika
from emails. I would like to know what all fields from a .msg file are
extracted by Tika's outlookextractor.
On Sun, 25 Sep 2011, Mattmann, Chris A (388J) wrote:
Description ResourcePathLocationType
The method getBookmarkStartList() is undefined for the type CTP
XWPFWordExtractorDecorator.java
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml
On Fri, 23 Sep 2011, Mattmann, Chris A (388J) wrote:
Weird. I am not seeing that in my Eclipse .classpath file:
[chipotle:~/tmp] mattmann% grep -R poi $HOME/src/tika/.classpath
classpathentry kind=var
path=M2_REPO/org/apache/poi/ooxml-schemas/1.0/ooxml-schemas-1.0.jar/
On Tue, 30 Aug 2011, Jukka Zitting wrote:
Yes, I think you're right. I believe the problem here is the
openContainer field within TikaInputStream where the container-aware
type detection code stores the already opened container (in this case an
NPOIFSFileSystem object) to avoid having to
On Tue, 9 Aug 2011, Charles wrote:
FYI, here is a list of apparent Tika 0.8 conversion failures when run
from Xapian's omindex on a Debian 6 Squeeze 64-bit system with 4 GB memory
It'd be interesting to know if a recent nightly snapshot build does any
better? Especially as we're gearing up
On Sun, 24 Jul 2011, Jakub Liska wrote:
currently it is only possibly to getExtension from MimeType. But there
is no way of getting MimeType from already detected MediaType, to get
the file extension.
Start with your TikaConfig, and call getMimeRepository() to get the
MimeTypes. From there
On Wed, 20 Jul 2011, alexander sulz wrote:
While indexing PDF's with solr I stumbled upon one copy which threw an
Unexpected RuntimeException from
org.apache.tika.parser.pdf.PDFParser@b9b618
Should I upload that PDF somwhere? If yes, where?
This looks like an Apache PDFBox bug (Tika uses
On Sat, 16 Jul 2011, Christian Zange wrote:
My Installation always stops at the
org.apache.tika.fork.ForkParserTest.xml
I seem to recall there being some issues with a test in 0.9 on certain
platforms. It might be worth trying a recent svn checkout and see if that
fixes it
One for us - is
On Fri, 15 Jul 2011, Fernando Arreola wrote:
I actually have a AFM parser that I have been working on.
Great!
Not sure if you want to take a look and see if it is good enough. Should
I just attach a reply on this thread or is there a better way to get it
to you?
Please attach it to jira
On Thu, 14 Jul 2011, Denis Voloshin wrote:
As you asked, I send you unit test which demonstrates the problem.
I've moved this to a JIRA - https://issues.apache.org/jira/browse/TIKA-683.
Please see my comments there
Nick
On Fri, 10 Jun 2011, Andrzej Bialecki wrote:
I have a feeling that .pfa and .pbf are the fonts themselves, and the
.pfm and .afm files are metadata about them. Can anyone confirm? If so,
we should split this entry into two
The files ending with m are font metrics.
Thanks for the info. I've
On Wed, 13 Jul 2011, Denis Voloshin wrote:
I'd like to know if there is any updates regard the question I submitted
on 26/06/2011
Wasn't this just a problem with how you were rendering your text?
If there's something else, can you try writing a small unit test that
shows up the problem, and
On Wed, 13 Jul 2011, Denis Voloshin wrote:
No need for unit test the problem is reproduced with tika-application
command line tool
Any chance you could work it up as a unit test though? We use the unit
tests to both ensure a bug is fixed now, as well as to ensure no
regressions occur in
On Tue, 28 Jun 2011, Trevor Watson wrote:
When run from the executable, i get the following information (vs an MP3
file)
xmpDM:releaseDate=2009
Content-Length=4136960
xmpDM:audioChannelType=Stereo
xmpDM:album= Author=The B52's
This has come from the MP3 parser
When run from the DLL i get
On Mon, 20 Jun 2011, Troy Witthoeft wrote:
I made some changes, and brought inline with other tika parser examples
I have seen. I've looked over IOUtils, however I'm a bit rusty on my
Java. By rusty I mean inept.
If you want, open a new jira and upload a sample small cadkey file along
with
On Wed, 15 Jun 2011, Troy Witthoeft wrote:
I don't think it should be that hard to implement. For instance, opening
prt files with a text editor shows that user inputted text fields are stored
as ASCII(?) characters.
Here's an image of a prt file open in notepad [http://i.imgur.com/CPTU0.png]
On Thu, 16 Jun 2011, Troy Witthoeft wrote:
I'm no file decoder, but I did review about a dozen prts created with
different versions of the program, and different companies. The closest
thing I can find to a common header or sequence of bytes is the
occurrence of sextuple 3's and nine 0's just
On Thu, 16 Jun 2011, Troy Witthoeft wrote:
Thanks to your pointers, I did notice that there is a common delimiter [0A
00] that follows the ASCII text.
0x0a is \n
0x00 is null
So your strings are usually terminated with a new line, but always with a
null. I'd suggest you use the \n to decide
On Thu, 9 Jun 2011, Fernando Arreola wrote:
I read through 5 minute quick start tutorial and started following the steps
detailed there. I noticed that the tika-mimetypes.xml file already has an
entry which contains the afm and pfb file types.
mime-type type=application/x-font-type1
glob
On Tue, 24 May 2011, Christanto Leonardo wrote:
What is the minimum jar required to use the best Tika detection can offer?
My hunch is it'd be tika-core, all the tika-core dependencies,
tika-parsers, poi, and a few bits of commons, but you'd need to do some
tests...
Currently I am using
On Tue, 10 May 2011, Shinichiro Abe wrote:
For example, I used Tika0.8 (POI-3.7, pdfbox-1.3.1).In fact I used solr3.1.
But it raised a text extraction error in a special excel.
This issue seemed to be fixed at POI-3.8Bata2.
At this time, Can I replace POI with POI-3.8Bata2?
You'd probably want
On Wed, 6 Apr 2011, Withanage, Dulip wrote:
1. I tried calling the --medatadata option and it gives me the
metadataname:value. So this looks promising to me, if i could format the
above output as xml. what is your advice to do it the best way?
You'll probably want to write some java code at
On Mon, 28 Mar 2011, Roberto Martelloni wrote:
I'm trying to find the list of all supported file type in tika 0.9,
anyone can suggest to me where to find it ?
Easiest way is to ask tika-app:
java -jar tika-app-0.9.jar --list-parser-details
That'll give you back the list of all the
On Mon, 28 Mar 2011, Withanage, Dulip wrote:
We are interesting in extracting the row metadata (not formatted in XHML
as SAX events) from the files using tika.
Generally speaking, all of the metadata that is extracted is placed into
the Metadata object you supply when parsing. The SAX events
On Fri, 22 Oct 2010, qubit wrote:
Thank you for your reply -- I will look into making the patch; it will get
me immersed in the code so I understand it better.
The code you probably want to look at is TXTParser in the tika-parser
package. The parser quickstart guide at
On Fri, 24 Sep 2010, Mattmann, Chris A (388J) wrote:
by our very own Nick Burch! :)
See here: http://s.apache.org/JMu
Glad you like it :)
I'll hopefully do another few posts about Tika in the next week or so, but
they'll be more about fine grained control of how Tika in Alfresco works
On Wed, 8 Sep 2010, Sergiy Karpenko wrote:
When I test content and metadata extraction by Tika, I met next usecases:
- Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED,
MSOffice.CREATION_DATE)
Date returned as String, but format is different for different document
types. Probably you
201 - 283 of 283 matches
Mail list logo