ANN: Textmining.org extractor library v1.0 released

2008-02-04 Thread Ryan Ackley
FYI, I just updated the textmining.org homepage with the following info. The tm-extractors library has a new release! v1.0. You can download it here: http://text-mining.googlecode.com/files/tm-extractors-1.0.jar The tm-extractors library is a pure java library for extracting text from Word docum

Re: extracting non-english text from word, pdf, etc....??

2007-08-03 Thread Ryan Ackley
The textmining library (textmining.org) for Word docs should work fine with non-english text as well. Let me know if it doesn't On 8/2/07, Ben Litchfield <[EMAIL PROTECTED]> wrote: > In terms of PDF documents... > > PDFBox should work just fine with any latin based languages; at this > time certai

Re: Related Article question

2007-07-07 Thread Ryan Ackley
I was playing around with MoreLikeThis and I noticed the problems you are talking about as well. One idea I thought of was for MoreLikeThis to focus only on proper nouns for the purposes of similarity or give a significant boost to those. Pretty much the same idea you had in #1. I found a list o

Re: index word files ( doc )

2007-03-26 Thread Ryan Ackley
The 512 byte thing is a limitation of POIFS I think. I could be wrong though. Have you tried opening the file with just POIFS? On 3/26/07, Antony Bowesman <[EMAIL PROTECTED]> wrote: Ryan Ackley wrote: > Yes I do have plans for adding fast save support and support for more > file

Re: index word files ( doc )

2007-03-26 Thread Ryan Ackley
the rich formatting. On 3/26/07, jafarim <[EMAIL PROTECTED]> wrote: Good to know that your devised commercial feature is already offered by Enhydra Snapper as an open-source feature. Check here: http://www.enhydra.org/apps/snapper/index.html On 3/26/07, Ryan Ackley <[EMAIL PROTECTE

Re: index word files ( doc )

2007-03-25 Thread Ryan Ackley
so handles a greater variety of files. Ryan, thanks for fixing your site. Do you have any plans/ideas on how to parse the 'fast-saved' files and any ideas on Word files older than the Word 6 format? Regards Antony Ryan Ackley wrote: > As the author of both Word POI and

Re: index word files ( doc )

2007-03-24 Thread Ryan Ackley
to on this is in the "Lucene in Action" book. On 3/24/07, jafarim <[EMAIL PROTECTED]> wrote: Can anyone make a comparison between the two, namely POI API and the one from textmining.org? On 3/24/07, Ryan Ackley <[EMAIL PROTECTED]> wrote: > > The site is down but you c

Re: index word files ( doc )

2007-03-24 Thread Ryan Ackley
The site is down but you can download the word extractor library direct here: http://www.textmining.org/textmining.zip Going to fix the site this weekend. On 3/24/07, Sami Siren <[EMAIL PROTECTED]> wrote: Antony Bowesman wrote: >> Are there other sollutions? There's also antiword [1] which c

Re: TextMining.org Word extractor

2007-03-21 Thread Ryan Ackley
[EMAIL PROTECTED]> wrote: Last I remember, it was being voted on by the Incubator committee. Good to hear TextMining is back in action! Does that mean you are back on POI Word again too? -Grant On Mar 20, 2007, at 10:35 PM, Ryan Ackley wrote: > Someone pointed me there already. Looks interes

Re: TextMining.org Word extractor

2007-03-20 Thread Ryan Ackley
http://wiki.apache.org/incubator/TikaProposal Better home for your lib, perhaps? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Ryan Ackley <[EMAIL PROTECTED]> T

Re: TextMining.org Word extractor

2007-03-20 Thread Ryan Ackley
I've been out of the loop for a while. I just saw this recent thread and re-subscribed to the list. In the next month or two I will be able to put some time into the textmining library. Fast saved files are on the list of improvements as well as other features that have been requested. I would al

Re: Zilverline Search Engine version 1.4.0 released

2005-06-11 Thread Ryan Ackley
Michael, Cool, looks nice. I downloaded the distribution and I notice that you are using several bsd-licensed libraries besides lucene including the textmining.org library. I couldn't find any acknowledgement of those libraries in your documentation. The Apache 2.0 license lets you just inclu