Re: n-gram profile format

2013-01-18 Thread Nick Burch
On Wed, 16 Jan 2013, Cedric Meury wrote: A) Why does Tika only support 3-gram profiles? In the code, the legacy format is even referenced in comments (LanguageProfileBuilder): It looks like wherever the code came from had made that change. Sadly, there's no issue number with the commit:

Re: PDF parse failing to capture entire text

2013-01-10 Thread Nick Burch
On 04/01/13 20:00, Jack Park wrote: A two-column scientific paper. The PDF parser has a few options that can be set, to control how some aspects of the parsing are done. Sorting text by position is one of them, which makes the parsing take a little longer, but will often improve accuracy on

Re: Tika configuration file

2013-01-07 Thread Nick Burch
On Mon, 7 Jan 2013, Maciej Liżewski wrote: I am using tika with Apache Solr. What I need to achieve is to process all images with provided external parser instead of default image/jpeg parser. In general this is all about some external OCR software. Your best bet is to include a parsers

RE: fetching content from archives and images

2013-01-07 Thread Nick Burch
On Mon, 7 Jan 2013, Maciej Liżewski wrote: And is there some default parser to recursively index all files in archive? You can just use AutoDetectParser, if you don't need any special handling. I think a lot of people have a small custom parser that outputs some special markup / flags it in

Re: fetching content from archives and images

2013-01-04 Thread Nick Burch
On 04/01/13 12:09, Maciej Liżewski wrote: 1. does tika recursively fetch content from archives (zip, rar, etc)? If you ask it to. You need to attach the parser you want to use for recursion to the ParseContext, and it'll be called for any embedded resources. (If you want, you can give your

Re: Parsing .zip files

2012-12-09 Thread Nick Burch
On Sat, 8 Dec 2012, Lewis John Mcgibbney wrote: We use Tika 1.2 over in Nutch, I wonder what kind of support Tika has for parsing .zip files and whether someone can comment on whether I can work towards dropping the legacy parser for Nutch? Tika has pretty good support for archive formats,

Re: Header footer of RTF files are not extracted correctly

2012-11-28 Thread Nick Burch
On Wed, 28 Nov 2012, samir pendharkar wrote: 1) When header/footer gets extracted as text, it also include what seems like formatting information/metadata. Example - ? DATE \@ MM/dd/yy ?09/16/12? extracted in the text Actual document only shows 09/16/12 in the footer Looks like some date

Re: Providing password for extracting contents of zip files in email attachments

2012-11-21 Thread Nick Burch
On Wed, 21 Nov 2012, Juha Haaga wrote: Caused by: org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException: unsupported feature encryption used in entry … Is this error caused by lack of password or lack of zip decrypting functionality? Is it possible to provide the zip file

Re: Unable to parse the default media type registry error

2012-11-19 Thread Nick Burch
On 18/11/12 16:33, Jason Judge wrote: I'm was using libgcj 4.4.6, which seems to be the latest for CentOS 6.3, so far as I can see (in rpm format, at least). I've installed openjdk: # yum install java-1.7.0-openjdk and that works a treat - fast too. Demo page here: r

Re: Is Tika really using streaming to parse files?

2012-11-10 Thread Nick Burch
On Fri, 9 Nov 2012, Norman M wrote: I am using Apache Tika to extract text from PPT/PPTX files. Is Poi really using streaming to parse files? Some bits. xls file processing is stream based, for ppt the whole file gets processed and then the text parts are located and picked out. File file

Re: input filestream in command line

2012-10-27 Thread Nick Burch
On Sat, 27 Oct 2012, goog cheng wrote: the file is in memory, i have to save it in disk and then fetch it back again? Some Tika parsers only work with files, so if you don't stream it to disk Tika will do. Otherwise, you could always stream it to Tika on stdin? Nick

Re: input filestream in command line

2012-10-26 Thread Nick Burch
On 26/10/12 00:52, goog cheng wrote: in python, an opened file object And how are you currently calling Tika from Python? Nick

Re: input filestream in command line

2012-10-25 Thread Nick Burch
On Fri, 26 Oct 2012, goog cheng wrote: Tika supports input file in CLI . But if the input is filestream, is there a command to do it? Any help would be greatly appreciated! What do you mean by a filestream, a pipe? Something else? Nick

Re: Trying to create a new mime-type entry

2012-09-28 Thread Nick Burch
On Fri, 28 Sep 2012, David Patterson wrote: I want to process a maven pom.xml with special code. I added the following to the existing xml file of mimetypes: mime-type type=application/maven-pom glob pattern=pom.xml / /mime-type You'll be much better off adding it to a custom mimetypes

Re: Extract only the filenames from an archive

2012-09-27 Thread Nick Burch
On Thu, 27 Sep 2012, Vigneshwaran wrote: I am new to Apache Tika. I want Tika to output only the names of the files within the archive (if the input file is an archive) and the file content as usual if the input file is not an archive. Is there a way I can do that? Yup. Rather than passing

Re: FW: Extract footer/header text out of Word docs

2012-08-30 Thread Nick Burch
On Thu, 30 Aug 2012, Alex Cougarman wrote: Hi. Is it possible to specifically extract footer/header and body text out of a Word document using Solr? In other words, we'd like to index/store those items in different Solr fields. As long as the have a suitable style applied, yes Tika will be

Re: Preventing AutoDetect parser from using org.apache.tika.parser.microsoft.TNEFParser

2012-08-22 Thread Nick Burch
On 22/08/12 06:29, Ramachandran, Karthik wrote: I'm having some trouble with the TNEFParser so I would like to prevent the AutoDetect parser from using it. Is there a way to override the default org.apache.tika.parser.Parser to prevent it from using the the TNEFParser? For a long term fix,

RE: Return raw text from document

2012-08-17 Thread Nick Burch
On Fri, 17 Aug 2012, Alexander Cougarman wrote: I'm using this C# code to call the parser directly via its URL; it returns JSON: var url = @http://localhost:8983/solr/update/extract;; You might have more luck asking on the SOLR lists, as it looks like your question is with how SOLR

Re: Return raw text from document

2012-08-16 Thread Nick Burch
On Thu, 16 Aug 2012, Alexander Cougarman wrote: Is it possible to return just the raw text of the document extracted by Tika? In other words, we don't want it in XML or JSON, just the text in it. Yes. Are you using the TikaApp jar, calling the Tika facade class, or calling a parser directly?

Re: using tika with eclipse

2012-07-18 Thread Nick Burch
On Wed, 18 Jul 2012, rodgersh wrote: And here is my custom-mimetypes.xml file: ?xml version=1.0 encoding=UTF-8? mime-info mime-type type=image/nitf alias type=image/ntf/ glob pattern=*.nitf/ /mime-type /mime-info I've no idea about OSGi, so I can't comment on what you need to

Re: Server mode documentation?

2012-07-01 Thread Nick Burch
On Sun, 1 Jul 2012, Nick Burch wrote: Tika snapshots are available in the Snapshot Repository: http://repository.apache.org/snapshots/org/apache/tika/ The current latest tika-app snapshot is: https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.2-SNAPSHOT/tika

Re: Text extraction from large PDF files

2012-07-01 Thread Nick Burch
On Sun, 1 Jul 2012, Zabrane Mickael wrote: So no way to avoid load the file in RAM? That's sad. Any other advices guys? You could try asking on the Apache PDFBox list for advice - Tika's PDF extraction is all powered by PDFBox Nick

Re: Server mode documentation?

2012-07-01 Thread Nick Burch
On Sun, 1 Jul 2012, Jason Judge wrote: Am I understanding it correctly that tika-server and tika-app are just two examples of the way tika can be used, and are just thrown together as a quick-start demo rather than core functionality of the main part of the project, which is a collection of

Re: Server mode documentation?

2012-07-01 Thread Nick Burch
On Sun, 1 Jul 2012, Jason Judge wrote: So, feature requests, command line, or...learn java. It is going to be a busy Summer :-) Anyone up for a Tika hackathon weekend in Oxford later this summer? Jason could hop on a direct train down from Newcastle, and we're only an hour from Heathrow for

Re: TIKA-198: Illegal IOException... MP4

2012-06-06 Thread Nick Burch
On Wed, 6 Jun 2012, Paulini, Matthew CTR USAF AFMC AFRL/RISA wrote: When using tika-app-0.10.jar, I receive video/quicktime when passing the .mp4 byte array. But, tika-app-1.1.jar is throwing an exception (TIKA-198: Illegal IOException from org.apache.tika.parser.mp4.MP4Parser@8491b8) from the

Re: Unable to read default mimetypes error message

2012-05-21 Thread Nick Burch
On Fri, 18 May 2012, Karthik Deivasigamani wrote: I wanted to try out tika for our parsing needs. Tried downloading the tika-app-1.0.jar 1.1 jar file and also build it locally from the src using mvn. Both seem to give me the same error message as below : *[karthik@karthik-linux

Re: Tika fails to extract text from very large files

2012-05-17 Thread Nick Burch
On Thu, 17 May 2012, Alec Swan wrote: 1. We don't know how to tell if we don't have enough heap space to process the file and skip the file in this case. Allowing out of memory errors take down our process is not acceptable. In that kind of situation, you should be looking at using something

Re: Tika fails to extract text from very large files

2012-05-16 Thread Nick Burch
On Wed, 16 May 2012, Alec Swan wrote: Memory consumption stays under 90MB which is less than max heap size (128M). No out-of-memory errors are thrown during test There is absolutely no way that you're going to be able to parse a PDF, DOC/DOCX or PPT/PPTX of more than about 20mb in size on a

Re: porting Tika 1.0 to Android 4.0

2012-05-03 Thread Nick Burch
On Thu, 3 May 2012, Ilya Zavorin wrote: However, I am having trouble porting any code, to be able to step through it using a simple wrapper app on Android. Specifically, I am using Eclipse and having this issue:

Re: Problem detecting XML

2012-04-17 Thread Nick Burch
On Tue, 17 Apr 2012, Taylor, Wade wrote: Hi, I'm having trouble detecting a file as application/xml. When I detect a URL containing XML the detection works and I get application/xml as the media type. Hmm, that's odd. I've taken your sample xml, popped it in a new file, and run java -jar

Re: HTML not listed as supported type in the AutoDetectParser

2012-04-17 Thread Nick Burch
On Tue, 17 Apr 2012, William Hays wrote: I believe you answered a different question than what I asked. My observation was specifically about the AutoDetectParser listing its supported mediatypes, not about the HTMLParser. The Tika App uses AutoDetectParser internally, so if it's finding the

Re: Problem detecting XML

2012-04-17 Thread Nick Burch
On Tue, 17 Apr 2012, Taylor, Wade wrote: Since I couldn't get that to work I went back to basics and tried a simple XML string: new Tika().detect(new ByteArrayInputStream(?xml version=\1.0\ encoding=\UTF-8\?rootchildtext/child/root.getBytes(; but this gets detected as text/plain too and I

Re: .PPT failing to parse that worked back in 0.9

2012-04-16 Thread Nick Burch
On Mon, 16 Apr 2012, Kevin Miller wrote: This is with plain Tika 1.1 built via Maven from source downloaded from Apache's Tika project. I am not familiar with how to adjust what dependencies Tika builds pull down. If you point me in a direction I'll give it a try. In svn, the dependency has

Re: 'looking' inside an OOXML container

2012-03-15 Thread Nick Burch
On Tue, 13 Mar 2012, Jon Gorrono wrote: The tika-app jar properly identifies the misnamed file so it's either a classpath or a implementation issue You'll need to have the Tika Parsers jar (and associated dependencies) for it to work properly. We do have unit tests for this, and as long as

Re: OutOfMemoryError in Tika

2012-03-09 Thread Nick Burch
On Fri, 9 Mar 2012, Mark Kerzner wrote: Standard 1.0 of Tika, with whatever POI is included in it by default It's probably worth re-testing with the Tika 1.1 release candidate, and seeing if that fixes it (it has a newer POI version in it) Nick

Re: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@

2012-03-08 Thread Nick Burch
On Thu, 8 Mar 2012, Harry Simons wrote: I tried the BFF Validator, and it is indeed failing! If you're able to share the error log, that could be helpful However, the file got created by MS Word only, and I doubt if it's 'corrupt'... since both MS Word and LibreOffice can load it fine

Re: tika in servicemix - empty result, parsers not found.

2012-03-08 Thread Nick Burch
On Thu, 8 Mar 2012, Shalom Ben-Zvi wrote: I installed tika-bundle-1.0 and tika-core-1.0 into servicemix. I'm invoking Tika from my bundle, actually a camel route. I don't know a lot about OSGi, but that might be your issue - you have some bits of Tika coming from a bundle, and some bits from

Re: Removing use of deprecated .getMimeType(url) towards detect in Tika 0.10

2012-02-29 Thread Nick Burch
On Mon, 27 Feb 2012, Lewis John Mcgibbney wrote: After compiling I get [javac] MimeUtil.java:165: incompatible types [javac] found : java.lang.Objectjava.io.Serializablejava.lang.Comparable? extends java.lang.Objectjava.io.Serializablejava.lang.Comparable? [javac] required:

Re: Removing use of deprecated .getMimeType(url) towards detect in Tika 0.10

2012-02-26 Thread Nick Burch
On Sun, 26 Feb 2012, Lewis John Mcgibbney wrote: // If no mime-type header, or cannot find a corresponding registered // mime-type, then guess a mime-type from the url pattern type = this.mimeTypes.getMimeType(url) != null ? this.mimeTypes .getMimeType(url) : type; }

Re: Problem with overriding built-in parser

2012-02-16 Thread Nick Burch
On Tue, 14 Feb 2012, Stephan Mühlstrasser wrote: https://issues.apache.org/jira/browse/TIKA-527 Is there any documentation of the syntax of the configuration file available? You could look at the code that process the file, but the example in that JIRA ought to cover most uses cases The

Re: tika-core, tika-parser?

2012-02-08 Thread Nick Burch
On Wed, 8 Feb 2012, Markus Jelsma wrote: In Nutch we have a copy of Tika-core. But with just that lib we also have access to the Tika.parser API from the other module. How does this all work because i have had confusing results in the past (and now). Tika Core comes with the core of Tika,

Re: Mapping Tika MIME Types into Top-Level Categories

2012-02-07 Thread Nick Burch
On Tue, 7 Feb 2012, Public Network Services wrote: Counting tags only, apparently there are 1,304 different variations of MIME types there (!), so I would like to map them to, say, a few custom top-level categories like Office, PDF, Audio, Video, or similar. Assuming this is not done in Tika,

Re: Mapping Tika MIME Types into Top-Level Categories

2012-02-07 Thread Nick Burch
On Tue, 7 Feb 2012, Public Network Services wrote: I know about the predefined types in the MediaType class. I think you might have missed something - what I was refering to was things like video at the start of video/mp4 being a good way to spot that it's a video :) Perhaps we should get

Re: File Content Type Detection

2012-01-28 Thread Nick Burch
On Fri, 27 Jan 2012, Public Network Services wrote: I had a look at the MIME types list and there are 50 different Office formats, including many for Microsoft Word/Excel/Powerpoint! Yup, there are quite a few different formats (with and without macros, normal and templates etc), and they

Re: trouble with last character ? whn using Mp3Parser metadata.get()

2012-01-20 Thread Nick Burch
On Fri, 20 Jan 2012, hpvpl wrote: I've tried to checkout the code and recompile but face pom issue when doing the mvn clean install That should work just fine, there shouldn't be any issues with compiling from trunk Nick

Re: FW: Default Tika extraction of docx 5X slower than XWPFWordExtractor?

2012-01-20 Thread Nick Burch
On Fri, 20 Jan 2012, Allison, Timothy B. wrote: I'm just getting started with Tika, and I tried the basic AutoDetectParser and the basic ParsingReader on a batch of a few thousand docx files (tika-app v1.0). On my laptop, I was able to extract text at a rate of 200 docs per minute. When I

Re: trouble with last character ? whn using Mp3Parser metadata.get()

2012-01-19 Thread Nick Burch
On Thu, 19 Jan 2012, hpvpl wrote: When I parse a Mp3 source I have a problem with the last character of the album, Author and artist metadata. I get a ? character at the end of the metadata. Can you try with a recent nightly build? Only a problem like that was fixed recently Nick

Re: External parser in a jar file

2012-01-06 Thread Nick Burch
On Thu, 5 Jan 2012, ola nowak wrote: Should java -jar tika-app.jar -list-parsers list it? Nope. The service loading isn't magic - it won't go and find random jars that you haven't told it about! You'll instead need something like: java -classpath MyParser.jar:tika-app-1.1-SNAPSHOT.jar

Re: ... all major file formats

2012-01-02 Thread Nick Burch
On Mon, 2 Jan 2012, Albretch Mueller wrote: How can someone know that the heading for a PDF file corresponds to the heading of a MS Word and or RTF file or the title on an HTML file corresponds to the title of a media file? They can't - both formats allow you to make something look like a

Re: ... all major file formats

2011-12-31 Thread Nick Burch
On Sat, 31 Dec 2011, Albretch Mueller wrote: I think all major file formats should be somehow functionally specified through something like ~ core.tika.formatHandlers.getAll[DefinedFormat]Handlers In code: TikaConfig config = TikaConfig.getDefaultConfig(); SetMediaType supported =

Re: Writing my own parser

2011-12-30 Thread Nick Burch
On Fri, 30 Dec 2011, ola nowak wrote: I've added my parser to the list but I don't know how to explicity tell AutoDetectParser to use my parser. You probably need to do two steps: * Add a custom mimetypes entry that detects your special XML files as a suitable (probbaly new) mimetype * Have

Re: Recursive parsing

2011-12-08 Thread Nick Burch
On Thu, 8 Dec 2011, Andrzej Bialecki wrote: I guess that could work, but it would be very messy - I would have to keep a list of all potentially interesting mime types in my code, which is difficult to maintain. Or a list of interesting parsers in your other case! It would be much better if

Re: parsers implementations for media files (mpeg, flv, webm)

2011-12-05 Thread Nick Burch
On 05/12/11 21:41, Albretch Mueller wrote: If you're interested in helping ... Yes, I can and would offer man/mind hours to including movie media files parsing (and eventually processing) in tika Great! I am definitely more inclined to use ffmpeg (your third option) but I think we

Re: Processing large amounts of PDFs in parallel without running out of memory

2011-12-05 Thread Nick Burch
On Mon, 5 Dec 2011, Paul Pearcy wrote: It appears that under the hood pdfbox can work with either a RandomAccessFile (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessFile.html ) or a RandomAccessBuffer

Re: how to parse password protected pdf files from tika.

2011-11-29 Thread Nick Burch
On Tue, 29 Nov 2011, chethan wrote: but there is no property call PASSWORD m.set(Metadata.PASSWORD, NiceAndSecret);, it is throwing an error Ah, that key seems to be hard coded into the PDFBox parser Short term, you can get the metadata key from PDFBoxParser Medium term, any chance you could

Re: Problem indexing msg files

2011-11-12 Thread Nick Burch
On Fri, 11 Nov 2011, Swapna Vuppala wrote: Am using Tika to index .msg files of Outlook. It has been working very good for me but am facing problem while indexing some .msg files. The indexing fails with the below Solr exception SEVERE: org.apache.solr.common.SolrException: Invalid Date

Re: Metadata extracted by OutlookExtractor

2011-09-28 Thread Nick Burch
On Wed, 28 Sep 2011, Swapna Vuppala wrote: Am new to using Solr and Tika. Am trying to index .msg files (Outlook mails) into Solr. For this, I need a list of metadata extracted by Tika from emails. I would like to know what all fields from a .msg file are extracted by Tika's outlookextractor.

Re: Weird Eclipse errors?

2011-09-26 Thread Nick Burch
On Sun, 25 Sep 2011, Mattmann, Chris A (388J) wrote: Description ResourcePathLocationType The method getBookmarkStartList() is undefined for the type CTP XWPFWordExtractorDecorator.java /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml

Re: Weird Eclipse errors?

2011-09-24 Thread Nick Burch
On Fri, 23 Sep 2011, Mattmann, Chris A (388J) wrote: Weird. I am not seeing that in my Eclipse .classpath file: [chipotle:~/tmp] mattmann% grep -R poi $HOME/src/tika/.classpath classpathentry kind=var path=M2_REPO/org/apache/poi/ooxml-schemas/1.0/ooxml-schemas-1.0.jar/

Re: Tika leaves files open

2011-08-30 Thread Nick Burch
On Tue, 30 Aug 2011, Jukka Zitting wrote: Yes, I think you're right. I believe the problem here is the openContainer field within TikaInputStream where the container-aware type detection code stores the already opened container (in this case an NPOIFSFileSystem object) to avoid having to

Re: Tika 0.8 failure rates

2011-08-10 Thread Nick Burch
On Tue, 9 Aug 2011, Charles wrote: FYI, here is a list of apparent Tika 0.8 conversion failures when run from Xapian's omindex on a Debian 6 Squeeze 64-bit system with 4 GB memory It'd be interesting to know if a recent nightly snapshot build does any better? Especially as we're gearing up

Re: How to get extension from MediaType

2011-07-24 Thread Nick Burch
On Sun, 24 Jul 2011, Jakub Liska wrote: currently it is only possibly to getExtension from MimeType. But there is no way of getting MimeType from already detected MediaType, to get the file extension. Start with your TikaConfig, and call getMimeRepository() to get the MimeTypes. From there

Re: unparseable PDF - Unexpected RuntimeException

2011-07-20 Thread Nick Burch
On Wed, 20 Jul 2011, alexander sulz wrote: While indexing PDF's with solr I stumbled upon one copy which threw an Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@b9b618 Should I upload that PDF somwhere? If yes, where? This looks like an Apache PDFBox bug (Tika uses

Re: Installation of Apache Tika 0.9 on Ubuntu 10.04

2011-07-17 Thread Nick Burch
On Sat, 16 Jul 2011, Christian Zange wrote: My Installation always stops at the org.apache.tika.fork.ForkParserTest.xml I seem to recall there being some issues with a test in 0.9 on certain platforms. It might be worth trying a recent svn checkout and see if that fixes it One for us - is

Re: Adding Font Parsers

2011-07-16 Thread Nick Burch
On Fri, 15 Jul 2011, Fernando Arreola wrote: I actually have a AFM parser that I have been working on. Great! Not sure if you want to take a look and see if it is good enough. Should I just attach a reply on this thread or is there a better way to get it to you? Please attach it to jira

Re: non-West European languages support

2011-07-15 Thread Nick Burch
On Thu, 14 Jul 2011, Denis Voloshin wrote: As you asked, I send you unit test which demonstrates the problem. I've moved this to a JIRA - https://issues.apache.org/jira/browse/TIKA-683. Please see my comments there Nick

Re: Adding Font Parsers

2011-07-15 Thread Nick Burch
On Fri, 10 Jun 2011, Andrzej Bialecki wrote: I have a feeling that .pfa and .pbf are the fonts themselves, and the .pfm and .afm files are metadata about them. Can anyone confirm? If so, we should split this entry into two The files ending with m are font metrics. Thanks for the info. I've

Re: non-West European languages support

2011-07-13 Thread Nick Burch
On Wed, 13 Jul 2011, Denis Voloshin wrote: I'd like to know if there is any updates regard the question I submitted on 26/06/2011 Wasn't this just a problem with how you were rendering your text? If there's something else, can you try writing a small unit test that shows up the problem, and

Re: non-West European languages support

2011-07-13 Thread Nick Burch
On Wed, 13 Jul 2011, Denis Voloshin wrote: No need for unit test the problem is reproduced with tika-application command line tool Any chance you could work it up as a unit test though? We use the unit tests to both ensure a bug is fixed now, as well as to ensure no regressions occur in

Re: Tika / IKVM / C# - Information works in executable, but not in DLL

2011-06-29 Thread Nick Burch
On Tue, 28 Jun 2011, Trevor Watson wrote: When run from the executable, i get the following information (vs an MP3 file) xmpDM:releaseDate=2009 Content-Length=4136960 xmpDM:audioChannelType=Stereo xmpDM:album= Author=The B52's This has come from the MP3 parser When run from the DLL i get

Re: Cadkey prt parser?

2011-06-21 Thread Nick Burch
On Mon, 20 Jun 2011, Troy Witthoeft wrote: I made some changes, and brought inline with other tika parser examples I have seen. I've looked over IOUtils, however I'm a bit rusty on my Java. By rusty I mean inept. If you want, open a new jira and upload a sample small cadkey file along with

Re: Cadkey prt parser?

2011-06-16 Thread Nick Burch
On Wed, 15 Jun 2011, Troy Witthoeft wrote: I don't think it should be that hard to implement. For instance, opening prt files with a text editor shows that user inputted text fields are stored as ASCII(?) characters. Here's an image of a prt file open in notepad [http://i.imgur.com/CPTU0.png]

Re: Cadkey prt parser?

2011-06-16 Thread Nick Burch
On Thu, 16 Jun 2011, Troy Witthoeft wrote: I'm no file decoder, but I did review about a dozen prts created with different versions of the program, and different companies. The closest thing I can find to a common header or sequence of bytes is the occurrence of sextuple 3's and nine 0's just

Re: Cadkey prt parser?

2011-06-16 Thread Nick Burch
On Thu, 16 Jun 2011, Troy Witthoeft wrote: Thanks to your pointers, I did notice that there is a common delimiter [0A 00] that follows the ASCII text. 0x0a is \n 0x00 is null So your strings are usually terminated with a new line, but always with a null. I'd suggest you use the \n to decide

Re: Adding Font Parsers

2011-06-10 Thread Nick Burch
On Thu, 9 Jun 2011, Fernando Arreola wrote: I read through 5 minute quick start tutorial and started following the steps detailed there. I noticed that the tika-mimetypes.xml file already has an entry which contains the afm and pfb file types. mime-type type=application/x-font-type1 glob

Re: Minimum jar for detection

2011-05-25 Thread Nick Burch
On Tue, 24 May 2011, Christanto Leonardo wrote: What is the minimum jar required to use the best Tika detection can offer? My hunch is it'd be tika-core, all the tika-core dependencies, tika-parsers, poi, and a few bits of commons, but you'd need to do some tests... Currently I am using

Re: Dependency policy

2011-05-10 Thread Nick Burch
On Tue, 10 May 2011, Shinichiro Abe wrote: For example, I used Tika0.8 (POI-3.7, pdfbox-1.3.1).In fact I used solr3.1. But it raised a text extraction error in a special excel. This issue seemed to be fixed at POI-3.8Bata2. At this time, Can I replace POI with POI-3.8Bata2? You'd probably want

RE: XMP Metadata extraction

2011-04-06 Thread Nick Burch
On Wed, 6 Apr 2011, Withanage, Dulip wrote: 1. I tried calling the --medatadata option and it gives me the metadataname:value. So this looks promising to me, if i could format the above output as xml. what is your advice to do it the best way? You'll probably want to write some java code at

Re: what are the list of supported file type in tika 0.9 ?

2011-03-28 Thread Nick Burch
On Mon, 28 Mar 2011, Roberto Martelloni wrote: I'm trying to find the list of all supported file type in tika 0.9, anyone can suggest to me where to find it ? Easiest way is to ask tika-app: java -jar tika-app-0.9.jar --list-parser-details That'll give you back the list of all the

Re: XMP Metadata extraction

2011-03-28 Thread Nick Burch
On Mon, 28 Mar 2011, Withanage, Dulip wrote: We are interesting in extracting the row metadata (not formatted in XHML as SAX events) from the files using tika. Generally speaking, all of the metadata that is extracted is placed into the Metadata object you supply when parsing. The SAX events

Re: question and possible error about output xhtml

2010-10-22 Thread Nick Burch
On Fri, 22 Oct 2010, qubit wrote: Thank you for your reply -- I will look into making the patch; it will get me immersed in the code so I understand it better. The code you probably want to look at is TXTParser in the tika-parser package. The parser quickstart guide at

Re: Great 2-part blog article on Apache Tika

2010-09-25 Thread Nick Burch
On Fri, 24 Sep 2010, Mattmann, Chris A (388J) wrote: by our very own Nick Burch! :) See here: http://s.apache.org/JMu Glad you like it :) I'll hopefully do another few posts about Tika in the next week or so, but they'll be more about fine grained control of how Tika in Alfresco works

Re: How can I configure Tika to extract dates in single format?

2010-09-08 Thread Nick Burch
On Wed, 8 Sep 2010, Sergiy Karpenko wrote: When I test content and metadata extraction by Tika, I met next usecases: - Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED, MSOffice.CREATION_DATE) Date returned as String, but format is different for different document types. Probably you

<    1   2   3