, it
doesn't make sense for me to see:
company%3Amicrosoft
It does make sense if you display:
company:microsoft
Cheers,
Ben
On Thu, 16 Dec 2004 11:38:20 -0500, Erik Hatcher
[EMAIL PROTECTED] wrote:
Rony - nice work! I subscribed to an alert already.
The wiki is self-serve, just log
after
parsing
PDDocument.load() - A convenience method that does all the PDFParser stuff and
returns a PDDocument
LucenePDFDocument.getDocument() - to go straight from a File/URL to a lucene
document object
Ben
Quoting Daniel Cortes [EMAIL PROTECTED]:
Ok I reply myself
the method deprecated
Hi
Is it hard to implement a function that displays the search results
excerpts similar to Google?
Is it just string manipulations or there are some logic behind it? I
like their excerpts.
Thanks
-
To unsubscribe, e-mail:
this MultiFieldQueryParser with Lucene 1.4.3.
Of course I changed some of the boolean stuff to make it works with
the production release.
Thanks,
Ben
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL
Thanks
On Sat, 19 Feb 2005 16:09:49 +0100, Daniel Naber
[EMAIL PROTECTED] wrote:
On Saturday 19 February 2005 15:26, Ben wrote:
When I try to search for phrases using the MultiFieldQueryParser v1.8
from CVS, it gives me NullPointerException.
This has just been fixed in SVN (I assume
));
searcher.search(query, new SortField(date, false));
they both return the same order.
Any idea? Thanks.
Ben
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Hi
I store my date in milliseconds, how can I do a sort on it? SortField
has INT, FLOAT and STRING. Do I need to create a new sort class, to
sort the long value?
Thanks
Ben
-
To unsubscribe, e-mail: [EMAIL PROTECTED
on jGuru, it just
mentions them using multiple indexes. I would like to do something
like them.
Any resources on the Internet that I can learn from?
Thanks,
Ben
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands
Is it true that for each index I have to create a seperate instance
for FSDirectory, IndexWriter and IndexReader? Do I need to create a
seperate locking mechanism as well?
I have already implemented a program using just one index.
Thanks,
Ben
On Tue, 1 Mar 2005 22:09:05 -0500, Erik Hatcher
any bugs or feature requests.
The library can be retrieved from
http://www.csh.rit.edu/~ben/projects/pdfparser/
-Ben Litchfield
--
To unsubscribe, e-mail: mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Can you send me the PDF document that you are having problems with and I
will look into it.
There are still some issues that I am working out with the spacing of
characters.
-Ben
On Tue, 9 Jul 2002, Keith Gunn wrote:
On Tue, 9 Jul 2002, Ben Litchfield wrote:
Hi,
I have written a PDF
Maurits,
You can get a PDF parser from http://www.pdfbox.org
-Ben
On Wed, 14 Aug 2002, Maurits van Wijland wrote:
Keith,
I haven't noticed the problem with the Parser...but you trigger me
by saying that you have a PDFParser!!!
Are you able to contribute this PDFParser??
Maurits
.
The easiest workaround is to increase the maximum heap size(mhs) of the
jvm using the -Xmx option of the jvm.
Example:
java -Xmx128m java app
The default mhs of java is 64m since JDK1.2 so maybe try 128 or 256.
-Ben
http://www.pdfbox.org
On Wed, 28 Aug 2002, Deenesh wrote:
Hi,
i am using
Has anybody seen this type of error before. This used to work and all of
a sudden broke. That path is a folder.
Ben Litchfield
2002-10-28 12:51:31,109 [Default] java.io.IOException:
\\Finsrv04\JBoss-2.4.1_Tomcat-3.2.3\fast_generated_output\lucene\website\index
not a directory
2002-10-28 12
String line = null;
while( (line = contentsReader.readLine() ) != null )
{
System.out.println( line );
}
I have not tested if this compiles but it should be pretty close.
Ben Litchfield
On Fri, 27 Dec 2002, Suhas Indra wrote:
Hello List
I am using PDFBox to index some of the PDF documents
I am aware of the issues with parsing certain PDF documents. I am
currently working on refactoring PDFBox to deal with large documents. You
will see this in the next release. I would like to thank people for
feedback and sending problem documents.
Ben Litchfield
http://www.pdfbox.org
On Tue
would fail with some pdfs with double endobj
definitions
-Added PDF document summary fields to the lucene document
Thank you,
Ben Litchfield
http://www.pdfbox.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional
continue? If you continued, is this the
only error that you got?
-Ben
--
On Thu, 6 Mar 2003, Eric Anderson wrote:
Ben-
In attempting to use the PDFBox-0.6.0, I rec'd the following error when
attempting to scan a reasonably sized PDF repository.
Any thoughts?
caught a class
I believe this problem has been fixed with 0.6.1. Please give it a try.
Ben Litchfield
--
On Thu, 6 Mar 2003, Eric Anderson wrote:
When it throws the exception, the indexer fails, so I cannot continue the index.
It appears that it's only related to some files, as I have been able
can be called from the command line to create an
index.
Ben Litchfield
--
On Tue, 25 Mar 2003, Ramrakhiani, Vikas wrote:
Can some one please help me with the command to get O/P from PDFBox on
command line or into streams rather that dumping it into a text file.
thanks,
vikas
It is possible that it is one single PDF that is having an issue. Can you
track it down to that one and let me know which it is. It would be very
helpful if you could send it to me as well.
Ben
http://www.pdfbox.org
On Wed, 2 Apr 2003, Eoghan S wrote:
i have tried every memory setting
- Index text and HTML files. Any others?
What, no PDF files!!
Ben
--
http://www.pdfbox.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
You need to be able to extract the text from them and feed that to lucene.
http://ww.pdfbox.org can extract text from pdf documents.
Ben
On Fri, 17 Oct 2003, Andre Hughes wrote:
Hello,
Can the Lucene search engine index and search though PDF documents?
What are the file format limits
Unfortunately, it is not quite so easy. I am not sure about Word
documents but PDFs usually have there contents compressed so a raw
fishing around for text would be pointless. Your best bet is to use a
package like the one from textmining.org that handles various formats for
you.
Ben
On Thu
/) to verify that lucene is
getting the field.
Other than that I would double check your code that gets the Title field
correctly.
Ben
On Mon, 10 Nov 2003, Zhou, Oliver wrote:
Hi,
I'm using lucene demo IndexHTML.java with pdfbox-0.6.4 to index pdf files.
It created the index files. However, the pdf
Yes, just add the log4j configuration. The easiest way to do that is as a
system parameter like this
java -Dlog4j.configuration=log4j.xml org.apache.lucene.demo.IndexHTML
-create -index c:\\index ..
Where log4j.xml is the path to your log4j config, PDFBox has an example
one you can use.
Ben
Logging uses log4j and can be configured. If you are having issues with
specific PDFs then you can post a bug on the sourceforge site or mail me
the PDFs directly and I will look at them.
Ben
http://www.pdfbox.org
On Tue, 25 Nov 2003, Zhou, Oliver wrote:
I do have other problems with PDFBox
Not being funny, but if you have no experience in Java, then why are you using a Java
API for index building/text searching ?
-Original Message-
From: Sebastian Fey [mailto:[EMAIL PROTECTED]
Sent: 28 January 2004 14:01
To: Lucene Users List
Subject: RE: use Lucene LOCAL (looking for a
For an out of the box job, I found searchblox pretty impressive, and easy to install.
-Original Message-
From: Sebastian Fey [mailto:[EMAIL PROTECTED]
Sent: 28 January 2004 14:23
To: Lucene Users List
Subject: AW: use Lucene LOCAL (looking for a frontend)
Not being funny, but if you
Yes he did, but I was away the past couple days. As this is more of a
PDFBox issue I responded in the PDFBox forums, please follow the thread
there if you are interested.
Ben
On Mon, 22 Mar 2004, Otis Gospodnetic wrote:
I have not tried these other tools yet.
Have you asked Ben Litchfield
The latest release of PDFBox changed the way it dealt with fonts and
introduced this bug, please try the version in CVS and let me know if you
are still having a problem.
Ben
On Thu, 25 Mar 2004, Ankur Goel wrote:
Hi,
I have to index PDF files. For that I am using pdfbox. But when I try
As PDFBox is an all Java solution there is no specific linux/unix version.
The source that is available with the downloaded package should suit your
needs. What does the sourceforge site not provide for you?
Ben
On Fri, 26 Mar 2004, Charlie Smith wrote:
Is there another source
these two different ways of modeling complex queries (in the
addClause method). Is this the best approach? What have others done?
Thanks,
Ben
I usually use use -Dlog4j.configuration=log4j.xml when invoking java from
the command line, but I believe this depends on your environment.
ex
java -Dlog4j.configuration=log4j.xml org.pdfbox.ExtractText input.pdf
Ben
On Fri, 23 Jul 2004, Christiaan Fluit wrote:
We invoke the following
Different PDFs will exhibit different extraction speeds because of the way
that PDF documents are structured.
I assume you are using the latest version 0.6.6, could you give 0.6.5 a
try and see if you notice faster speeds.
Ben
On Thu, 29 Jul 2004, Miroslaw Milewski wrote:
Paul Smith wrote
it appears that you might have
an older log4j in your classpath
Logger.getLogger( Class ) is available in 1.2.5 and 1.2.8
Ben
On Tue, 17 Aug 2004, Don Vaillancourt wrote:
Wow, this is an old message.
I managed to get my code to work by using the previous version of
PDFBox. I had used
through the intro tutorial to understand how to index/search text using
lucene.
Ben
On Fri, 20 Aug 2004, Santosh wrote:
How can I search through PDF?
- Original Message -
From: Santosh
To: Lucene Users List
Sent: Friday, August 20, 2004 5:59 PM
Subject: pdf search
Hi,
I am new
If you can use lucene on its own then you already know how to add a lucene
Document to the index. So you need to be able to take a PDF and get a
lucene Document.
org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument()
does that for you.
Ben
On Mon, 23 Aug 2004, Santosh wrote:
I
.
I looked into JDBC Directory, but it's not tested under Oracle and
doesn't seem like a very mature project.
What are other people doing to solve this problem?
--
Ben Sinclair
[EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL
This appears to be more of a PDFBox issue than a lucene issue, please post
an issue to the PDFBox site.
Also note, that because of certain encodings that a PDF writer can use, it
is impossible to extract text from all PDF documents.
Ben
On Wed, 8 Sep 2004, [EMAIL PROTECTED] wrote
PDFTextStream, PDFBox, Etymon PJ and JPedal. It appears to be fairly well
done.
http://snowtide.com/home/PDFTextStream/Performance
PDFBox: slow PDF text extraction for Java applications
http://www.pdfbox.org
:)
Ben
-
To unsubscribe, e
I can say that gc is not collecting these objects since I forced gc
runs when indexing every now and then (when parsing pdf-type objects,
that is): No effect.
What PDF parser are you using? Is the problem within the parser and not
lucene? Are you releasing all resources?
Ben
that
the capability of indexing PDF documents would outweigh the extra time for
the download.
Ben
On Sat, 16 Oct 2004, Bill Tschumy wrote:
On Oct 16, 2004, at 9:47 PM, Ben Litchfield wrote:
types. It uses Lucene underneath. I'm thinking about extending it in
the direction that Google Desktop
://sourceforge.net/tracker/index.php?func=detailaid=1035635group_id=78314atid=552835
Ben
On Mon, 27 Sep 2004 [EMAIL PROTECTED] wrote:
Bruce,
You are right, i tried this morning and when i try to stream the
higlighter output as pdf, acrobat was not able to read or open it!!
Which project do you recommend
://www.etymon.com/
Ben
http://www.pdfbox.org
On Fri, 22 Oct 2004 [EMAIL PROTECTED] wrote:
Hello all,
I need a piece of advice/experience..
What pdf parser (written in java) u'd recommend?
I played now with PDFBox-0.6.7a and would not say I was satisfied too much
with it
On certain pdf's
PDFBox does not 'stumble' when it gives that message, that is correct
functionality if that permission is not allowed.
If your company is willing to pay a 'fortune' why not sponsor a change to
an open source project for half a fortune.
Ben
http://www.pdfbox.org
On Mon, 25 Oct 2004 [EMAIL
not
then they are in violation of copyright law.
That being said, PDFBox is open source so a user could make modifications
to the source code, or as a PDF library could change permissions on a
document.
Ben
On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:
Yes Ben, You are right.
This would be correct functionality from
be just as easy to integrate as
PDFBox is. They list pricings on there site as well, which is nice that
it is not hidden as some software companies do.
Ben
On Thu, 18 Nov 2004, Luke Shannon wrote:
Hi;
I am using the PDFBox's getLuceneDocument method to parse my PDF
documents. It returns good
://www.gnu.org/software/classpath/license.html for more details.
Ben
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
(unexpected excpetion trying to execute search,
e);
}
}
}
thanks in advance for any help
ben
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
mechanism or does
the SearchIndexer cache the results already? if it caches them already,
then to clear the cache, is it again removing any references to the
SearchIndexer instance?
thanks again,
ben
On Tue, 2004-07-12 at 15:18 -0500, Erik Hatcher wrote:
On Dec 7, 2004, at 3:06 PM, Ben Rooney
?
thanks
ben
On Tue, 2004-07-12 at 12:29 -0800, Chris Hostetter wrote:
: executes the search, i would keep a static reference to SearchIndexer
: and then when i want to invalidate the cache, set it to null or create
: design of your system. But, yes, you do need to keep a reference
I have created a DLL from the lucene jars for use in the PDFBox project.
It uses IKVM(http://www.ikvm.net) to create a DLL from a jar.
The binary version can be found here
http://www.csh.rit.edu/~ben/projects/pdfbox/nightly-release/PDFBox-.NET-0.7.0-dev.zip
This includes the ant script used
Are you indexing the FOP PDF's differently than other PDF documents?
Can I assume that you are using PDFBox's LucenePDFDocument.getDocument()
method?
Ben
On Fri, 21 Jan 2005, Luke Shannon wrote:
Hello;
Our CMS now allows users to create PDF documents (uses FOP) and than search
them.
I
I will assume you are asking this question on the lucene mailing list
because you now want to index that PDF document.
Have you tried PDFBox? It can't create an html file for you but it can
extract text.
Ben
http://www.pdfbox.org
On Mon, 31 Jan 2005, Bertrand VENZAL wrote:
Hi all,
I ve
, but it
sounds like your requirements are pretty basic so it shouldn't be that
hard.
If all the above will work, what kind of license does this require? I
have not been able to find a link to that yet on the jakarta site.
http://www.apache.org/licenses/LICENSE-2.0
Ben
56 matches
Mail list logo