Re: Highlighting problem

2004-03-02 Thread Matt Tucker
Clandes,

I'm not sure what the "body" field is that you're indexing. If it's a 
database record, one option may be to store the database primary key for 
the record as a field value -- basically a document ID. In that case, 
you would do the search and get back matching document ID's. You could 
then pull the bodies from the database and send them through the 
hilighting routine without ever needing to store the full body value in 
Lucene.

Regards,
Matt
Clandes Tino wrote:

Hi all, 
I have incorporated highlighting package
(http://home.clara.net/markharwood/lucene/highlight.htm)
but I am worried about the following issue.

If I want to display "body" field content’s best
segments, containing term from query highlighted, I
have to define Field "body" as Stored.
So, complete process would be like this:
Index related work:
1. parse uploaded document into temp ASCII file
2. read ASCII file and append its content to String 
3. make Field as Text(String name, String value)

Search related work:
1. Retrieve field “body” String value from the hit
(again - only way to do this - as I have understood –
is to declare Field “body” as Stored)
2. pass the String value to Highlighter methods.
 
Besides that in Lucene FAQ I have read that “body”
fields are not good candidates to be declared as
Stored. Index size is one obvious reason, but I am
wondering, how it implies Lucene search performance in
general?

Has somebody an idea how to include highlight
functionality in Unstored Field?
Regards and thanx in advance
Milan 



	
	
		
___
Yahoo! Messenger - Communicate instantly..."Ping" 
your friends today! Download Messenger Now 
http://uk.messenger.yahoo.com/download/index.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Did you mean...

2004-02-11 Thread Matt Tucker
Timo,

We implemented that type of system using a spelling engine by Wintertree:

http://www.wintertree-software.com

There are some free Java spelling packages out there too that you could 
likely use.

Regards,
Matt
[EMAIL PROTECTED] wrote:

Hi!

Can I do things like Google's "Did you mean...?" correction for mistyped words 
with Lucene?

Warm Regards,
Timo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene 1.3 Release Schedule

2003-07-15 Thread Matt Tucker
Scott,

Not sure on the official release schedule, but I don't think there's any 
reason to wait to upgrade. We've been using 1.3 CVS builds and then the 
RC in Jive Forums for many, many months and it's a very solid 
improvement over 1.2.

-Matt

Scott Farquhar wrote:
Morning all!

Is there a document where I can find the release schedule for Lucene 
1.3, or a list of issues that are still remaining to be fixed?

I would like to upgrade our product (JIRA) to a later Lucene version, 
and I was waiting for 1.3 to be released.  However, I see that it has 
been in rc1 for 4 months - so I was curious to hear if I should wait for 
a final release.

Thanks for your time.  If there is a url / document answering my 
concerns - please feel free to just point me at that.

Cheers,
Scott


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: java.io.IOException: Cannot delete deletetable

2003-06-19 Thread Matt Tucker
Rob,

Yep, we've only ever seen the error on Windows. And, yes, 1.3 RC1 has 
the fix but 1.2 does not.

Regards,
Matt
Rob Outar wrote:

We use windows and linux but I have only seen this error on Windows so far.
I will check the Jar file I am using to make sure it is the most recent, I
am assuming the most recent is Lucene 1.3 RC1 ?
Thanks,

Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: java.io.IOException: Cannot delete deletetable

2003-06-19 Thread Matt Tucker
Rob,

Are you using the very latest Lucene code? The standard File.renameTo 
operation fails every once in awhile, especially on Windows. I sent in a 
patch that was put in somewhat recently. It fixed all the errors we were 
seeing with renames.

Regards,
Matt
Rob Outar wrote:

Hi all,

I am intermittently getting the above exception while build an index.  I
have been trying for an house to reproduce it but can't as of yet.  But in
any case I was wondering if anyone knew anything about the above error and
if so how to stop it from occurring.  In the stack trace I printed out, it
looked like it was in the rename method of FSDirectory that the exception
occurred.  As soon as I can replicate I will post the exception and any
additional information requested.
Thanks as always,

Rob

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Finding out which field caused the hit?

2003-05-27 Thread Matt Tucker
Dan,

I don't have an answer to this question, unfortunately, but just wanted 
to say that we'd really love to see a better API for this too. :)

Regards,
Matt
Armbrust, Daniel C. wrote:

Is there a (better) way that I can use to figure out which field in a document caused the document to be returned from a query?  Currently, after I do a search across all of my fields and documents, I am researching on each document that had a hit, on each field individually, and keeping track of the scores..  The highest scoring field is the one that I credit with returning the document. 

This is fine for a small index, with a small number of fields, but it definitely doesn't seem like the correct way to go about getting this information.

Any suggestions would be appreciated, 

Thanks, 

Dan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: OutOfMemoryException while Indexing an XML file/PdfParser

2003-02-18 Thread Matt Tucker
Rob,

We ran into this problem too, and our solution was to use a native PDF
text extractor (PDFBox just can't seem to handle large PDFs well).
Basically, we try to parse with the native app first, and if that fails,
we parse with PDFBox. We used:

http://www.foolabs.com/xpdf/

A code snippet for using this is:

String[] cmd = new String[] { 
PATH_TO_XPDF, 
"-enc", "UTF-8", "-q", filename, "-"}; 
Process p = Runtime.getRuntime().exec(cmd); 
BufferedInputStream bis = new
BufferedInputStream(p.getInputStream()); 
InputStreamReader reader = new InputStreamReader(bis, "UTF-8"); 
out = new StringWriter(); 
char [] buf = new char[512]; 
int len; 
while ((len = reader.read(buf)) >= 0) { 
out.write(buf, 0, len); 
} 
reader.close();

Regards,
Matt

> -Original Message-
> From: Pinky Iyer [mailto:[EMAIL PROTECTED]] 
> Sent: Tuesday, February 18, 2003 5:23 PM
> To: Lucene Users List
> Subject: RE: OutOfMemoryException while Indexing an XML file/PdfParser
> 
> 
> 
> I am having similar problem but indexing pdf documents using 
> pdfbox parser (available at www.pdfbox.com). I get an 
> exception saying "Exception in thread "main" 
> java.lang.OutOfMemoryError" Any body who has implemented the 
> above code? Any help appreciated??? Thanks! PI  Rob Outar 
> <[EMAIL PROTECTED]> wrote:We are aware of DOM 
> limitations/memory problems, but I am using SAX to parse the 
> file and index elements and attributes in my content handler.
> 
> Thanks,
> 
> Rob
> 
> -Original Message-
> From: Tatu Saloranta [mailto:[EMAIL PROTECTED]]
> Sent: Friday, February 14, 2003 8:18 PM
> To: Lucene Users List
> Subject: Re: OutOfMemoryException while Indexing an XML file
> 
> 
> On Friday 14 February 2003 07:27, Aaron Galea wrote:
> > I had this problem when using xerces to parse xml documents. The 
> > problem I think lies in the Java garbage collector. The way 
> I solved 
> > it was to
> create
> 
> It's unlikely that GC is the culprit. Current ones are good 
> at purging objects that are unreachable, and only throw 
> OutOfMem exception when they really have no other choice. 
> Usually it's the app that has some dangling references to 
> objects that prevent GC from collecting objects not useful any more.
> 
> However, it's good to note that Xerces (and DOM parsers in 
> general) generally use more memory than the input XML files 
> they process; this because they usually have to keep the 
> whole document struct in memory, and there is overhead on top 
> of text segments. So it's likely to be at least 2 * input 
> file size (files usually use UTF-8 which most of the time 
> uses 1 byte per char; in memory 16-bit unicode-2 chars are 
> used for performance), plus some additional overhead for 
> storing element structure information and all that.
> 
> And since default max. java heap size is 64 megs, big XML 
> files can cause problems.
> 
> More likely however is that references to already processed 
> DOM trees are not nulled in a loop or something like that? 
> Especially if doing one JVM process for item solves the problem.
> 
> > a shell script that invokes a java program for each xml 
> file that adds 
> > it to the index.
> 
> -+ Tatu +-
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> -
> Do you Yahoo!?
> Yahoo! Shopping - Send Flowers for Valentine's Day
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: corrupted index

2002-03-17 Thread Matt Tucker

Hey all,

Actually, using shutdown hooks might not be the best idea since Lucene is very 
often used in server-side Java environments. Many app-servers throw security 
errors when trying to add shutdown hooks, and I've seen Weblogic crash before 
when having them in a webapp. Has anyone else run into this?

This all brings up a key issue with Lucene, which is that there is little way 
to recover from errors gracefully. I'd love to see a number of checked 
exceptions added. For example:

 IndexNotFoundException -- when trying to open an index that doesn't exist
 IndexLockedException -- when a lock file prevents you from getting an index
 IndexCorruptException -- maybe this would be thrown when an index appears to 
be broken?

At the moment, Lucene throws many undocumented IOExceptions and even 
NullPointerExceptions when an error case comes up. I catch these in my app, but 
there's really not an intelligent way to recover from them. Adding checked 
exceptions would be a change of the API, but it seems worth it. I'd be happy to 
make a more specific proposal if other people feel like this would be a 
worthwhile direction to go in.

Regards,
Matt

Quoting "Spencer, Dave" <[EMAIL PROTECTED]>:

> Runtime.addShutdownHook:
> 
> 
> 
> http://java.sun.com/j2se/1.3/docs/api/java/lang/Runtime.html#addShutdown
> Hook(java.lang.Thread)
> 
> -Original Message-
> From: Otis Gospodnetic [ mailto:[EMAIL PROTECTED]]
> Sent: Sunday, March 17, 2002 12:06 AM
> To: Lucene Users List
> Subject: Re: corrupted index
> 
> 
> Oh, I just thought of something (wine does body good).
> Perhaps one could use Runtime (the class) to catch the JVM shutdown and
> do whatever is needed to prevent index corruption.  I believe there are
> some shutdown hook methods in there that may let you do that.  I'm too
> lazy to look up the API docs now, but I rememeber reading about that
> once, and perhaps it was even mentioned on one of the 2 Lucene mailing
> lists.
> 
> On the other hand, it would be great to have a tool that can verify an
> existing index.  I don't know enough about the actual file structure
> yet to write something like that, but maybe somebody else has done that
> already or would like to contribute.
> 
> Otis
> 
> 
> --- "Steven J. Owens" <[EMAIL PROTECTED]> wrote:
> > Otis,
> >
> > > You can remove the .lock file and try re-indexing or continuing
> > > indexing where you left off.
> > > I am not sure about the corrupt index.  I have never seen it
> > happen,
> > > and I believe I recall reading some messages from Doug Cutting
> > saying
> > > that index should never be left in an inconsistent state. 
> >
> >  Obviously never "should" be, but if something's pulling the rug
> > out from under his JRE, changes could be only partially written,
> > right? 
> >
> >  Or is the writing format in some sense transactionally safe?
> > I've never worked directly on something like this, but I worked at a
> > database software company where they used transaction semantics and a
> > journaling scheme to fake a "bulletproof" file system.  Is this how
> > the index-writing code is implemented?
> >
> >  In general, I can guess Doug's response - just torch the old
> > index directory and rebuild it; Lucene's indexing is fast enough that
> > you don't need to get clever.  This seems to be Doug's stance in
> > general (i.e. "don't get fancy, I already put all the fanciness
> > you'll
> > need into extremely fast indexing and searching").  So far, it seems
> > to work :-).
> >
> > > I could be making this up, though, so I suggest you search through
> > > lucene-user and lucene-dev archives on www.mail-archive.com.
> > > A search for "corrupt" should do it.
> > > Once you figure things out maybe you can post a summary here.
> >
> >  I got a little curious, so I went and did the searches.  There
> > is
> > exactly one message in each list archive (dev and users) with the
> > keyword "corrupt" in it.  The lucene-users instance is irrelevant:
> >
> >
> http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html
> >
> >  The lucene-dev instance is more useful:
> >
> >
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html
> >
> >  It's a post from Doug, dated sept 27, 2001, about adding not
> > just
> > thread-safety but process-safety:
> >
> >   It should be impossible to corrupt an index through the Lucene API.
> >   However if a Lucene process exits unexpectedly it can leave the
> > index
> >   locked.  The remedy is simply to, at a time when it is certain that
> > no
> >   processes are accessing the index, remove all lock files.
> >  
> >  So it sounds like it's worth trying just removing the lock
> > files.
> > Hm, is there a way to come up with a "sanity check" you can run on an
> > index to make sure it's not corrupted?  This might be an excellent
> > thing to reassure yourself with: something went wrong?  Run a sanity
> > check, if it fails just reindex.
> >
> > Steven J. Owens
>

Questions on index locking

2002-01-29 Thread Matt Tucker

Hey all,

I'm integrating the newest version of Lucene into our codebase, and ran
into a few questions about the directory locking. First, I'd like to
suggest that it might help to add some comments to the Javadocs of
IndexReader and IndexWriter about when directories are locked and what
it means. Second, is there a particular exception I can catch when
trying to open a directory to handle if a lock is already on that dir?
Would it make sense to add a DirectoryLockedException or something
similar?

Regards,
Matt


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: JDK 1.1 vs 1.2+

2002-01-22 Thread Matt Tucker

Hey all,

I'd just like to chime in support for dropping JDK 1.1, especially if it
would aid i18n in Lucene. There just doesn't seem to be a compelling
reason to build anything for JDK 1.1 anymore.

Regards,
Matt
Jive Software

> -Original Message-
> From: Andrew C. Oliver [mailto:[EMAIL PROTECTED]] 
> Sent: Tuesday, January 22, 2002 10:52 AM
> To: Lucene Users List
> Subject: JDK 1.1 vs 1.2+
> 
> 
> Hello everyone,
> 
> I originally posted this question to the developers list, but 
> was asked to repeat it here.
> 
> I'm working on some new functionality I plan to submit for 
> Lucene.  In doing this I've noticed that Lucene currently 
> maintains compatibility with JDK 1.1.  This has some 
> disadvantages for instance the use of "vector" versus some of 
> the new collections.  Next, some of the functionality I plan 
> to add requires JDK 1.2.  Finally, some of the 
> internationalization features of Java do not work well in 
> 1.1.  For these reasons I suggest a move to 1.2+.  While it 
> seems reasonable to me to drop support for a >4 year old 
> version of the JDK, I realize it may still present a problem 
> to some users and would like to raise a discussion on this.
> 
> How many people are still using 1.1 and would be negatively 
> affected by Lucene's use of 1.2 features?  Of those, how many 
> people can not move to 1.2 for server side development?
> 
> -Andy
> -- 
> www.superlinksoftware.com
> www.sourceforge.net/projects/poi - port of Excel format to 
> java 
> http://developer.java.sun.com/developer/bugParade/bugs/4487555
.html 
- fix java generics!


The avalanche has already started. It is too late for the pebbles to
vote. -Ambassador Kosh


--
To unsubscribe, e-mail:

For additional commands, e-mail:



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Industry Use of Lucene?

2001-12-06 Thread Matt Tucker

Jeff,

We've been using Lucene as a part of our product Jive Forums 
(http://www.jivesoftware.com) for about a year now. So, that means it's in use 
on thousands upon thousands of our customer's production websites including 
Sun, Sony Music, and many others. I'd be happy to provide more detailed 
information about our integration if you'd like. Lucene has been great for us 
and we never mind raving about it. :)

Regards,
Matt

Quoting Jeff Kunkle <[EMAIL PROTECTED]>:

> Does anyone know of any companies or agencies using Lucene for their
> products/projects?  I am attempting to make a marketing pitch for Lucene
> to
> my manager and I know one of the first questions will be, "Who else is
> using
> it?"  I know Lucene is a very powerful, fast, and flexible full-text
> search
> engine but my manager will need a little more coercing.  Any help on
> this
> topic is greatly appreciated.
> 
> Thanks,
> Jeff
> 
> --
> To unsubscribe, e-mail:  
> 
> For additional commands, e-mail:
> 
> 

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: javac -O ?

2001-11-28 Thread Matt Tucker

Winton,

I'm not sure that javac -O actually does anything. From the 1.3 tool
documentation: 

"Note: the -O option does nothing in the current implementation of javac
and oldjavac."

In fact, the JDK 1.4 tool documentation doesn't even mention the -O
option (even though "javac -help" still lists the option).

In regards to your earlier email about JVM optimization -- you may want
to check out the Jrockit JVM if you have a chance. I haven't used it
yet, but the features sound interesting for server-side Java
performance. http://www.jrockit.com

Regards,
Matt

> -Original Message-
> From: Winton Davies [mailto:[EMAIL PROTECTED]] 
> Sent: Wednesday, November 28, 2001 4:51 PM
> To: Lucene Users List
> Subject: javac -O ?
> 
> 
> Hi,
> 
>   Is the nightly build compiled Optimized ? if not, has anyone ever 
> tried compiling Optimized, and using that ? Does it help improve 
> performance ? It would seem to me that given the compute intensive 
> nature of querying, that even slightly improved compilations would 
> speed things up ?
> 
>   Cheers,
>Winton
> 
> Winton Davies
> Lead Engineer, Overture (NSDQ: OVER)
> 1820 Gateway Drive, Suite 360
> San Mateo, CA 94404
> work: (650) 403-2259
> cell: (650) 867-1598
> http://www.overture.com/
> 
> --
> To unsubscribe, e-mail:   
>  [EMAIL PROTECTED]>
> For 
> additional commands, 
> e-mail: 
> 


--
To unsubscribe, e-mail:   
For additional commands, e-mail: