Re: Filters for Openoffice File Indexing available (Java)

2004-11-27 Thread Peter Becker
Joachim Arrasz wrote:
Hello List.
we have written an application which includes OpenOffice Integration 
into an OpenSource CMS (OpenCms).

For this CMS there is a Lucene Integration available under sourceforge.
So now we are looking for search and index Filters for Lucene, that 
weÂŽre able to integrate out OpenOffice Files also into search result.

Is there any project or code available for doing this, or must we write 
everything by ourself? Do anybody know good beginner Tutorials for doing 
things like this?
We wrote somthing like that a while ago. It seems to work, although we 
did only some basic testing:

http://cvs.sourceforge.net/viewcvs.py/tockit/java/applications/docco/source/org/tockit/docco/documenthandler/OpenOfficeDocumentHandler.java?view=markup
HTH,
  Peter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: URGENT: Help indexing large document set

2004-11-27 Thread John Wang
Hi Chuck:

 Thanks for your help and the info.

 By some experimentation, I found when calling
FSWriter.addIndex(ramDirectory), it is actually performing a merge
with the existing index. So doing 2000 batches of 500, when the index
grows after each batch, the time to do the merge increases.

 I guess in this implementation, doing it this way is not optimal.

Thanks

-John


On Sat, 27 Nov 2004 13:14:31 -0800, Chuck Williams <[EMAIL PROTECTED]> wrote:
> Hi John,
> 
> I don't use a RamDirectory and so don't have the answer for you.  There
> have been a number of messages about RamDirectory performance on
> lucene-user, including some reported benchmarks.  Some people have
> reported a significant benefit from RamDirectory's, but most others have
> seen little or no benefit.  I'm not sure which factors indicate the
> nature or magnitude of impact.   You sent the message below just to me
> -- you might want to post a question on lucene-user.
> 
> I've included a couple messages below on the subject that I saved.
> 
> Chuck
> 
> Included messages:
> 
> -Original Message-
> From: Jonathan Hager [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, November 24, 2004 2:27 PM
> To: Lucene Users List
> Subject: Re: Index in RAM - is it realy worthy?
> 
> When comparing RAMDirectory and FSDirectory it is important to mention
> what OS you are using.  When using linux it will cache the most recent
> disk access in memory.  Here is a good article that describes its
> strategy: http://forums.gentoo.org/viewtopic.php?t=175419
> 
> The 2% difference you are seeing is the memory copy.  With other OSes
> you may see a speed up when using the RAMDirectory, because not all
> OSes contain a disk cache in memory and must access the disk to read
> the index.
> 
> Another consideration is there is currently a 2GB limitation with the
> size of the RAMDirectory.  Indexes over 2GB causes a overflow in the
> int used to create the buffer.  [see int len = (int) is.length(); in
> RamDirectory]
> 
> I ended up using RAM directory for a very different reason.  The index
> is 1 to 2MB and is rebuilt every few hours.  It takes 3 to 4 minutes
> to query the database and rebuild the index.  But the search should be
> available 100% of the time.  Since the index is so small I do the
> following:
> 
> on server startup:
> - look for semaphore, if it is there delete the index
> - if there is no index, build it to FSdirectory
> - load the index from FSDirectory into RAMDirectory
> 
> on reindex:
> - create semaphore
> - rebuild index to FSDirectory
> - delete semaphore
> - load index from FSDirecttory into RAMDirectory
> 
> to search:
> - search the RAMDirectory
> 
> RAMDirectory could be replaced by a regular FSDirectory, but it seemed
> silly to copy the index from disk to disk, when it ultimately needs to
> be in memory.
> 
> FSDirectory could be replaced by a RAMDirectory, but this means that
> it would take the server 3 to 4 minutes longer to startup every time.
> By persisting the index, this time would only be necessary if indexing
> was interrupted.
> 
> Jonathan
> 
> On Mon, 22 Nov 2004 12:39:07 -0800, Kevin A. Burton
> <[EMAIL PROTECTED]> wrote:
> > Otis Gospodnetic wrote:
> >
> > >For the Lucene book I wrote some test cases that compare FSDirectory
> > >and RAMDirectory.  What I found was that with certain settings
> > >FSDirectory was almost as fast as RAMDirectory.  Personally, I would
> > >push FSDirectory and hope that the OS and the Filesystem do their
> share
> > >of work and caching for me before looking for ways to optimize my
> code.
> > >
> > >
> > Yes... I performed the same benchmark and in my situation RAMDirectory
> > for searches was about 2% slower.
> >
> > I'm willing to bet that it has to do with the fact that its a
> Hashtable
> > and not a HashMap (which isn't synchronized).
> >
> > Also adding a constructor for the term size could make loading a
> > RAMDirectory faster since you could prevent rehash.
> >
> > If you're on a modern machine your filesystme cache will end up
> > buffering your disk anyway which I'm sure was happening in my
> situation.
> >
> > Kevin
> >
> > --
> >
> > Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an
> > invite!  Also see irc.freenode.net #rojo if you want to chat.
> >
> > Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
> >
> > If you're interested in RSS, Weblogs, Social Networking, etc... then
> you
> > should work for Rojo!  If you recommend someone and we hire them
> you'll
> > get a free iPod!
> >
> > Kevin A. Burton, Location - San Francisco, CA
> >AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> > GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
> >
> >
> >
> >
> > -
> 
> 
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> 

RE: Are similarity scores computed when using sort?

2004-11-27 Thread Aphinyanaphongs, Yindalon
Erik!
Thanks for the response.  I'll take a look and see to customizing a solution.
Yin



From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Sat 11/27/2004 5:04 PM
To: Lucene Users List
Subject: Re: Are similarity scores computed when using sort?



On Nov 27, 2004, at 1:01 PM, Aphinyanaphongs, Yindalon wrote:
> Thanks for the quick response.  I don't necessarily need to calculated
> the similarity score. It was my understanding that the inverted index
> had a list of all the hits for each term so technically, no document
> returned from the inverted index should have a score of 0.

It's certainly true that the inverted index has a list of all documents
that contain each term.  However the word "hit" is by definition a
document with score > 0 given a query, not just a term.  Queries can be
quite sophisticated.  For example, I was very easily able to create an
XOR query by using a custom similarity.

> Would you know in what java class the call is made to Similarity and
> where that code may be commented out, or would it be preferable to
> write my own similarity sub-class instead?

Similarity use is spread out in several places.  Get a good IDE(A!) and
surf the code and you'll easily be able to see where methods of
Similarity are being used.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Re: Are similarity scores computed when using sort?

2004-11-27 Thread Erik Hatcher
On Nov 27, 2004, at 1:01 PM, Aphinyanaphongs, Yindalon wrote:
Thanks for the quick response.  I don't necessarily need to calculated 
the similarity score. It was my understanding that the inverted index 
had a list of all the hits for each term so technically, no document 
returned from the inverted index should have a score of 0.
It's certainly true that the inverted index has a list of all documents 
that contain each term.  However the word "hit" is by definition a 
document with score > 0 given a query, not just a term.  Queries can be 
quite sophisticated.  For example, I was very easily able to create an 
XOR query by using a custom similarity.

Would you know in what java class the call is made to Similarity and 
where that code may be commented out, or would it be preferable to 
write my own similarity sub-class instead?
Similarity use is spread out in several places.  Get a good IDE(A!) and 
surf the code and you'll easily be able to see where methods of 
Similarity are being used.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Zilverline Search Engine version 1.0-final released

2004-11-27 Thread Zilverline info
All,
I've just released Zilverline version 1.0.
New features include incremental indexing and scheduling of indexing
proces, as well as a few minor updates.
The source will be made available as well very soon.
Zilverline is protected by a Collaborative Source License. You can read
more on this type of licensing at http://www.zilverline.org
Zilverline is a search engine based on lucene that's ready to
roll, and can be simply dropped in a Servlet Engine. It runs out of the
box, and supports PDF, WORD, HTM, TXT, RTF and CHM, and can  index zip,
rar, and many other formats. Both on Windows and Linux.
Zilverline supports plugins. You can create your own extractors
for various file formats. I've provided Extractors for RTF, Text, PDF,
Word, and HTML.
Zilverline supports collections. A collection is a set of files and
directories in a directory. A collection can be indexed, and searched.
The results of the search can be retrieved from local disk or remotely,
if you run a webserver on your machine. Files inside zip, rar and chm
files are extracted, indexed and can be cached. The cache can be mapped
to sit behind your webserver as well.
It's also possible to specify your own handlers for archives. Say you
have a RAR archive, and you have a program on your system that can
extract the content from it, then you can specify that Zilverline should
use this program.
Please take look at http://www.zilverline.org, and have a swing at it.
cheers,
  Michael Franken


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Are similarity scores computed when using sort?

2004-11-27 Thread Aphinyanaphongs, Yindalon
Erik
Thanks for the quick response.  I don't necessarily need to calculated the 
similarity score. It was my understanding that the inverted index had a list of 
all the hits for each term so technically, no document returned from the 
inverted index should have a score of 0.  Thus, if I have some numerical field 
to sort by, the inverted index itself handles returning the documents with the 
term.
 
Would you know in what java class the call is made to Similarity and where that 
code may be commented out, or would it be preferable to write my own similarity 
sub-class instead?
 
Thanks,
Yin
 



From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Sat 11/27/2004 4:43 AM
To: Lucene Users List
Subject: Re: Are similarity scores computed when using sort?



Yes, Similarity is still computed.  It has to be in order to determine
if the documents considered are a hit or not.  Scores of 0 are not a
hit.

You certainly could simplify the Similarity computations though, by
creating your own implementation and returning 1 from all the methods.

Erik

On Nov 27, 2004, at 2:46 AM, Aphinyanaphongs, Yindalon wrote:

> I have an search application that is very performance conscious.  I've
> looked through the IndexSearcher code, and haven't been able to
> clarify whether a similarity score is calculated if the results are
> sorted by some numerical field value? Basically, it would be
> preferable to not incur the computational cost of generating a
> similarity score if it is never used.
>
> Thanks
> Yin
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Re: Are similarity scores computed when using sort?

2004-11-27 Thread Erik Hatcher
Yes, Similarity is still computed.  It has to be in order to determine 
if the documents considered are a hit or not.  Scores of 0 are not a 
hit.

You certainly could simplify the Similarity computations though, by 
creating your own implementation and returning 1 from all the methods.

Erik
On Nov 27, 2004, at 2:46 AM, Aphinyanaphongs, Yindalon wrote:
I have an search application that is very performance conscious.  I've 
looked through the IndexSearcher code, and haven't been able to 
clarify whether a similarity score is calculated if the results are 
sorted by some numerical field value? Basically, it would be 
preferable to not incur the computational cost of generating a 
similarity score if it is never used.

Thanks
Yin

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]