Re: Lucene appreciation

2004-12-16 Thread Ben
, it doesn't make sense for me to see: company%3Amicrosoft It does make sense if you display: company:microsoft Cheers, Ben On Thu, 16 Dec 2004 11:38:20 -0500, Erik Hatcher [EMAIL PROTECTED] wrote: Rony - nice work! I subscribed to an alert already. The wiki is self-serve, just log

Re: PDFBox deprecated methods

2005-01-05 Thread ben
after parsing PDDocument.load() - A convenience method that does all the PDFParser stuff and returns a PDDocument LucenePDFDocument.getDocument() - to go straight from a File/URL to a lucene document object Ben Quoting Daniel Cortes [EMAIL PROTECTED]: Ok I reply myself the method deprecated

Search results excerpt similar to Google

2005-01-27 Thread Ben
Hi Is it hard to implement a function that displays the search results excerpts similar to Google? Is it just string manipulations or there are some logic behind it? I like their excerpts. Thanks - To unsubscribe, e-mail:

MultiFieldQueryParser 1.8 isn't parsing phrases

2005-02-19 Thread Ben
this MultiFieldQueryParser with Lucene 1.4.3. Of course I changed some of the boolean stuff to make it works with the production release. Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL

Re: MultiFieldQueryParser 1.8 isn't parsing phrases

2005-02-19 Thread Ben
Thanks On Sat, 19 Feb 2005 16:09:49 +0100, Daniel Naber [EMAIL PROTECTED] wrote: On Saturday 19 February 2005 15:26, Ben wrote: When I try to search for phrases using the MultiFieldQueryParser v1.8 from CVS, it gives me NullPointerException. This has just been fixed in SVN (I assume

Sorting isn't working for my date field

2005-02-21 Thread Ben
)); searcher.search(query, new SortField(date, false)); they both return the same order. Any idea? Thanks. Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Sorting date stored in milliseconds time

2005-02-25 Thread Ben
Hi I store my date in milliseconds, how can I do a sort on it? SortField has INT, FLOAT and STRING. Do I need to create a new sort class, to sort the long value? Thanks Ben - To unsubscribe, e-mail: [EMAIL PROTECTED

Multiple indexes

2005-03-01 Thread Ben
on jGuru, it just mentions them using multiple indexes. I would like to do something like them. Any resources on the Internet that I can learn from? Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands

Re: Multiple indexes

2005-03-01 Thread Ben
Is it true that for each index I have to create a seperate instance for FSDirectory, IndexWriter and IndexReader? Do I need to create a seperate locking mechanism as well? I have already implemented a program using just one index. Thanks, Ben On Tue, 1 Mar 2005 22:09:05 -0500, Erik Hatcher

PDF Text Stripper

2002-07-09 Thread Ben Litchfield
any bugs or feature requests. The library can be retrieved from http://www.csh.rit.edu/~ben/projects/pdfparser/ -Ben Litchfield -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: PDF Text Stripper

2002-07-09 Thread Ben Litchfield
Can you send me the PDF document that you are having problems with and I will look into it. There are still some issues that I am working out with the spacing of characters. -Ben On Tue, 9 Jul 2002, Keith Gunn wrote: On Tue, 9 Jul 2002, Ben Litchfield wrote: Hi, I have written a PDF

Re: problems with HTML Parser

2002-08-14 Thread Ben Litchfield
Maurits, You can get a PDF parser from http://www.pdfbox.org -Ben On Wed, 14 Aug 2002, Maurits van Wijland wrote: Keith, I haven't noticed the problem with the Parser...but you trigger me by saying that you have a PDFParser!!! Are you able to contribute this PDFParser?? Maurits

Re: pdfbox on solaris

2002-08-28 Thread Ben Litchfield
. The easiest workaround is to increase the maximum heap size(mhs) of the jvm using the -Xmx option of the jvm. Example: java -Xmx128m java app The default mhs of java is 64m since JDK1.2 so maybe try 128 or 256. -Ben http://www.pdfbox.org On Wed, 28 Aug 2002, Deenesh wrote: Hi, i am using

IOException not a directory

2002-10-28 Thread Ben Litchfield
Has anybody seen this type of error before. This used to work and all of a sudden broke. That path is a folder. Ben Litchfield 2002-10-28 12:51:31,109 [Default] java.io.IOException: \\Finsrv04\JBoss-2.4.1_Tomcat-3.2.3\fast_generated_output\lucene\website\index not a directory 2002-10-28 12

Re: PDF Text extraction

2002-12-27 Thread Ben Litchfield
String line = null; while( (line = contentsReader.readLine() ) != null ) { System.out.println( line ); } I have not tested if this compiles but it should be pretty close. Ben Litchfield On Fri, 27 Dec 2002, Suhas Indra wrote: Hello List I am using PDFBox to index some of the PDF documents

RE: OutOfMemoryException while Indexing an XML file/PdfParser

2003-02-18 Thread Ben Litchfield
I am aware of the issues with parsing certain PDF documents. I am currently working on refactoring PDFBox to deal with large documents. You will see this in the next release. I would like to thank people for feedback and sending problem documents. Ben Litchfield http://www.pdfbox.org On Tue

[ANN] PDFBox 0.6.0

2003-03-05 Thread Ben Litchfield
would fail with some pdfs with double endobj definitions -Added PDF document summary fields to the lucene document Thank you, Ben Litchfield http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional

Re: [ANN] PDFBox 0.6.0

2003-03-06 Thread Ben Litchfield
continue? If you continued, is this the only error that you got? -Ben -- On Thu, 6 Mar 2003, Eric Anderson wrote: Ben- In attempting to use the PDFBox-0.6.0, I rec'd the following error when attempting to scan a reasonably sized PDF repository. Any thoughts? caught a class

Re: [ANN] PDFBox 0.6.0

2003-03-09 Thread Ben Litchfield
I believe this problem has been fixed with 0.6.1. Please give it a try. Ben Litchfield -- On Thu, 6 Mar 2003, Eric Anderson wrote: When it throws the exception, the indexer fails, so I cannot continue the index. It appears that it's only related to some files, as I have been able

Re: getting PDFBox O/P into a stream

2003-03-25 Thread Ben Litchfield
can be called from the command line to create an index. Ben Litchfield -- On Tue, 25 Mar 2003, Ramrakhiani, Vikas wrote: Can some one please help me with the command to get O/P from PDFBox on command line or into streams rather that dumping it into a text file. thanks, vikas

RE: out of memory

2003-04-02 Thread Ben Litchfield
It is possible that it is one single PDF that is having an issue. Can you track it down to that one and let me know which it is. It would be very helpful if you could send it to me as well. Ben http://www.pdfbox.org On Wed, 2 Apr 2003, Eoghan S wrote: i have tried every memory setting

Re: Lucene demo ideas?

2003-09-17 Thread Ben Litchfield
- Index text and HTML files. Any others? What, no PDF files!! Ben -- http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does the Lucene search engine work with PDF's?

2003-10-17 Thread Ben Litchfield
You need to be able to extract the text from them and feed that to lucene. http://ww.pdfbox.org can extract text from pdf documents. Ben On Fri, 17 Oct 2003, Andre Hughes wrote: Hello, Can the Lucene search engine index and search though PDF documents? What are the file format limits

Re: Exotic format indexing?

2003-10-30 Thread Ben Litchfield
Unfortunately, it is not quite so easy. I am not sure about Word documents but PDFs usually have there contents compressed so a raw fishing around for text would be pointless. Your best bet is to use a package like the one from textmining.org that handles various formats for you. Ben On Thu

Re: Missing pdf document title

2003-11-10 Thread Ben Litchfield
/) to verify that lucene is getting the field. Other than that I would double check your code that gets the Title field correctly. Ben On Mon, 10 Nov 2003, Zhou, Oliver wrote: Hi, I'm using lucene demo IndexHTML.java with pdfbox-0.6.4 to index pdf files. It created the index files. However, the pdf

RE: Lucene refresh index function (incremental indexing).

2003-11-25 Thread Ben Litchfield
Yes, just add the log4j configuration. The easiest way to do that is as a system parameter like this java -Dlog4j.configuration=log4j.xml org.apache.lucene.demo.IndexHTML -create -index c:\\index .. Where log4j.xml is the path to your log4j config, PDFBox has an example one you can use. Ben

RE: Lucene refresh index function (incremental indexing).

2003-11-25 Thread Ben Litchfield
Logging uses log4j and can be configured. If you are having issues with specific PDFs then you can post a bug on the sourceforge site or mail me the PDFs directly and I will look at them. Ben http://www.pdfbox.org On Tue, 25 Nov 2003, Zhou, Oliver wrote: I do have other problems with PDFBox

RE: use Lucene LOCAL (looking for a frontend)

2004-01-28 Thread Ben Keeping
Not being funny, but if you have no experience in Java, then why are you using a Java API for index building/text searching ? -Original Message- From: Sebastian Fey [mailto:[EMAIL PROTECTED] Sent: 28 January 2004 14:01 To: Lucene Users List Subject: RE: use Lucene LOCAL (looking for a

RE: use Lucene LOCAL (looking for a frontend)

2004-01-28 Thread Ben Keeping
For an out of the box job, I found searchblox pretty impressive, and easy to install. -Original Message- From: Sebastian Fey [mailto:[EMAIL PROTECTED] Sent: 28 January 2004 14:23 To: Lucene Users List Subject: AW: use Lucene LOCAL (looking for a frontend) Not being funny, but if you

Re: Indexing japanese PDF documents

2004-03-22 Thread Ben Litchfield
Yes he did, but I was away the past couple days. As this is more of a PDFBox issue I responded in the PDFBox forums, please follow the thread there if you are interested. Ben On Mon, 22 Mar 2004, Otis Gospodnetic wrote: I have not tried these other tools yet. Have you asked Ben Litchfield

Re: Problem while Indexing Pdf files

2004-03-25 Thread Ben Litchfield
The latest release of PDFBox changed the way it dealt with fonts and introduced this bug, please try the version in CVS and let me know if you are still having a problem. Ben On Thu, 25 Mar 2004, Ankur Goel wrote: Hi, I have to index PDF files. For that I am using pdfbox. But when I try

Re: too many files open error

2004-03-26 Thread Ben Litchfield
As PDFBox is an all Java solution there is no specific linux/unix version. The source that is available with the downloaded package should suit your needs. What does the sourceforge site not provide for you? Ben On Fri, 26 Mar 2004, Charlie Smith wrote: Is there another source

building a search query

2004-06-09 Thread Ben Pryor
these two different ways of modeling complex queries (in the addClause method). Is this the best approach? What have others done? Thanks, Ben

Re: PDFBox problem.

2004-07-23 Thread Ben Litchfield
I usually use use -Dlog4j.configuration=log4j.xml when invoking java from the command line, but I believe this depends on your environment. ex java -Dlog4j.configuration=log4j.xml org.pdfbox.ExtractText input.pdf Ben On Fri, 23 Jul 2004, Christiaan Fluit wrote: We invoke the following

Re: pdfbox performance.

2004-07-28 Thread Ben Litchfield
Different PDFs will exhibit different extraction speeds because of the way that PDF documents are structured. I assume you are using the latest version 0.6.6, could you give 0.6.5 a try and see if you notice faster speeds. Ben On Thu, 29 Jul 2004, Miroslaw Milewski wrote: Paul Smith wrote

Re: PDFBox Issue

2004-08-17 Thread Ben Litchfield
it appears that you might have an older log4j in your classpath Logger.getLogger( Class ) is available in 1.2.5 and 1.2.8 Ben On Tue, 17 Aug 2004, Don Vaillancourt wrote: Wow, this is an old message. I managed to get my code to work by using the previous version of PDFBox. I had used

Re: Fw: pdf search

2004-08-20 Thread Ben Litchfield
through the intro tutorial to understand how to index/search text using lucene. Ben On Fri, 20 Aug 2004, Santosh wrote: How can I search through PDF? - Original Message - From: Santosh To: Lucene Users List Sent: Friday, August 20, 2004 5:59 PM Subject: pdf search Hi, I am new

Re: integration of lucene with pdfbox

2004-08-23 Thread Ben Litchfield
If you can use lucene on its own then you already know how to add a lucene Document to the index. So you need to be able to take a PDF and get a lucene Document. org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument() does that for you. Ben On Mon, 23 Aug 2004, Santosh wrote: I

Moving from a single server to a cluster

2004-09-07 Thread Ben Sinclair
. I looked into JDBC Directory, but it's not tested under Oracle and doesn't seem like a very mature project. What are other people doing to solve this problem? -- Ben Sinclair [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL

Re: pdf in Chinese

2004-09-08 Thread Ben Litchfield
This appears to be more of a PDFBox issue than a lucene issue, please post an issue to the PDFBox site. Also note, that because of certain encodings that a PDF writer can use, it is impossible to extract text from all PDF documents. Ben On Wed, 8 Sep 2004, [EMAIL PROTECTED] wrote

PDF-Text Performance comparison

2004-09-08 Thread Ben Litchfield
PDFTextStream, PDFBox, Etymon PJ and JPedal. It appears to be fairly well done. http://snowtide.com/home/PDFTextStream/Performance PDFBox: slow PDF text extraction for Java applications http://www.pdfbox.org :) Ben - To unsubscribe, e

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Ben Litchfield
I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. What PDF parser are you using? Is the problem within the parser and not lucene? Are you releasing all resources? Ben

Re: Google Desktop Could be Better

2004-10-16 Thread Ben Litchfield
that the capability of indexing PDF documents would outweigh the extra time for the download. Ben On Sat, 16 Oct 2004, Bill Tschumy wrote: On Oct 16, 2004, at 9:47 PM, Ben Litchfield wrote: types. It uses Lucene underneath. I'm thinking about extending it in the direction that Google Desktop

RE: Highlighting PDF file after the search

2004-09-27 Thread Ben Litchfield
://sourceforge.net/tracker/index.php?func=detailaid=1035635group_id=78314atid=552835 Ben On Mon, 27 Sep 2004 [EMAIL PROTECTED] wrote: Bruce, You are right, i tried this morning and when i try to stream the higlighter output as pdf, acrobat was not able to read or open it!! Which project do you recommend

Re: Need advice: what pdf lib to use?

2004-10-22 Thread Ben Litchfield
://www.etymon.com/ Ben http://www.pdfbox.org On Fri, 22 Oct 2004 [EMAIL PROTECTED] wrote: Hello all, I need a piece of advice/experience.. What pdf parser (written in java) u'd recommend? I played now with PDFBox-0.6.7a and would not say I was satisfied too much with it On certain pdf's

Re: Need advice: what pdf lib to use?

2004-10-25 Thread Ben Litchfield
PDFBox does not 'stumble' when it gives that message, that is correct functionality if that permission is not allowed. If your company is willing to pay a 'fortune' why not sponsor a change to an open source project for half a fortune. Ben http://www.pdfbox.org On Mon, 25 Oct 2004 [EMAIL

Re: Need advice: what pdf lib to use?

2004-10-25 Thread Ben Litchfield
not then they are in violation of copyright law. That being said, PDFBox is open source so a user could make modifications to the source code, or as a PDF library could change permissions on a document. Ben On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: Yes Ben, You are right. This would be correct functionality from

Re: PDF Index Time

2004-11-18 Thread Ben Litchfield
be just as easy to integrate as PDFBox is. They list pricings on there site as well, which is nice that it is not hidden as some software companies do. Ben On Thu, 18 Nov 2004, Luke Shannon wrote: Hi; I am using the PDFBox's getLuceneDocument method to parse my PDF documents. It returns good

.NET Version of Lucene

2004-12-06 Thread Ben Litchfield
://www.gnu.org/software/classpath/license.html for more details. Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Ben Rooney
(unexpected excpetion trying to execute search, e); } } } thanks in advance for any help ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Ben Rooney
mechanism or does the SearchIndexer cache the results already? if it caches them already, then to clear the cache, is it again removing any references to the SearchIndexer instance? thanks again, ben On Tue, 2004-07-12 at 15:18 -0500, Erik Hatcher wrote: On Dec 7, 2004, at 3:06 PM, Ben Rooney

Re: QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Ben Rooney
? thanks ben On Tue, 2004-07-12 at 12:29 -0800, Chris Hostetter wrote: : executes the search, i would keep a static reference to SearchIndexer : and then when i want to invalidate the cache, set it to null or create : design of your system. But, yes, you do need to keep a reference

Re: C# Ports

2004-12-15 Thread Ben Litchfield
I have created a DLL from the lucene jars for use in the PDFBox project. It uses IKVM(http://www.ikvm.net) to create a DLL from a jar. The binary version can be found here http://www.csh.rit.edu/~ben/projects/pdfbox/nightly-release/PDFBox-.NET-0.7.0-dev.zip This includes the ant script used

Re: FOP Generated PDF and PDFBox

2005-01-21 Thread Ben Litchfield
Are you indexing the FOP PDF's differently than other PDF documents? Can I assume that you are using PDFBox's LucenePDFDocument.getDocument() method? Ben On Fri, 21 Jan 2005, Luke Shannon wrote: Hello; Our CMS now allows users to create PDF documents (uses FOP) and than search them. I

Re: Use an executable from java ...

2005-01-31 Thread Ben Litchfield
I will assume you are asking this question on the lucene mailing list because you now want to index that PDF document. Have you tried PDFBox? It can't create an html file for you but it can extract text. Ben http://www.pdfbox.org On Mon, 31 Jan 2005, Bertrand VENZAL wrote: Hi all, I ve

Re: Investingating Lucene For Project

2005-03-01 Thread Ben Litchfield
, but it sounds like your requirements are pretty basic so it shouldn't be that hard. If all the above will work, what kind of license does this require? I have not been able to find a link to that yet on the jakarta site. http://www.apache.org/licenses/LICENSE-2.0 Ben