Re: knowing which field contributed the search result

2005-02-22 Thread John Wang
Hi David:

Can you further explain which calls specically would solve my problem?

Thanks

-John

On Mon, 21 Feb 2005 12:20:15 -0800, David Spencer
[EMAIL PROTECTED] wrote:
 John Wang wrote:
 
  Anyone has any thoughts on this?
 
 Does this help?
 
 http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Searchable.html#explain(org.apache.lucene.search.Query,%20int)
 
  Thanks
 
  -John
 
 
  On Wed, 16 Feb 2005 14:39:52 -0800, John Wang [EMAIL PROTECTED] wrote:
 
 Hi:
 
Is there way to find out given a hit from a search, find out which
 fields contributed to the hit?
 
 e.g.
 
 If my search for:
 
 contents1=brown fox OR contents2=black bear
 
 can the document founded by this query also have information on
 whether it was found via contents1 or contents2 or both.
 
 Thanks
 
 -John
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: knowing which field contributed the search result

2005-02-21 Thread John Wang
Anyone has any thoughts on this?

Thanks

-John


On Wed, 16 Feb 2005 14:39:52 -0800, John Wang [EMAIL PROTECTED] wrote:
 Hi:
 
Is there way to find out given a hit from a search, find out which
 fields contributed to the hit?
 
 e.g.
 
 If my search for:
 
 contents1=brown fox OR contents2=black bear
 
 can the document founded by this query also have information on
 whether it was found via contents1 or contents2 or both.
 
 Thanks
 
 -John


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



knowing which field contributed the search result

2005-02-16 Thread John Wang
Hi:

   Is there way to find out given a hit from a search, find out which
fields contributed to the hit?

e.g.

If my search for:

contents1=brown fox OR contents2=black bear

can the document founded by this query also have information on
whether it was found via contents1 or contents2 or both.


Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: google mini? who needs it when Lucene is there

2005-01-27 Thread John Wang
I think Google mini also includes crawling and a server wrapper. So it
is not entirely an 1-to-1 comparison.

Of couse extending lucene to have those features are not at all
difficult anyway.

-John


On Thu, 27 Jan 2005 16:04:54 -0800 (PST), Xiaohong Yang (Sharon)
[EMAIL PROTECTED] wrote:
 Hi,
 
 I agree that Google mini is quite expensive.  It might be similar to the 
 desktop version in quality.  Anyone knows google's ratio of index to text?   
 Is it true that Lucene's index is about 500 times the original text size (not 
 including image size)?  I don't have one installed, so I cannot measure.
 
 Best,
 
 Sharon
 
 jian chen [EMAIL PROTECTED] wrote:
 Hi,
 
 I was searching using google and just found that there was a new
 feature called google mini. Initially I thought it was another free
 service for small companies. Then I realized that it costs quite some
 money ($4,995) for the hardware and software. (I guess the proprietary
 software costs a whole lot more than actual hardware.)
 
 The nice feature is that, you can only index up to 50,000 documents
 with this price. If you need to index more, sorry, send in the
 check...
 
 It seems to me that any small biz will be ripped off if they install
 this google mini thing, compared to using Lucene to implement a easy
 to use search software, which could search up to whatever number of
 documents you could image.
 
 I hope the lucene project could get exposed more to the enterprise so
 that people know that they have not only cheaper but more importantly,
 BETTER alternatives.
 
 Jian
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene2.0 and transaction support

2005-01-20 Thread John Wang
Hi:

   When is lucene 2.0 scheduled to be released? Is there a javadoc
somewhere so we can check out the new APIs?

Is there a plan to add transaction support into lucene? This is
something we need and if we do implement it ourselves, is it too large
of a change for a patch?

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: reading fields selectively

2005-01-07 Thread John Wang
Thanks guys for the info!

After looking at the patch code I have two problems:

1) The patch implementation doesn't help with performance. It still
reads the data for every field in the document. Just not storing all
of them. So this implementation helps if there are memory
restrictions, but not if you are after performance.

2) We are bundling Lucene in our application, we are trying very hard
not having to change Lucene code and thus divert from the Lucene code
base. This patch implementation requires you to make changes to
SegmentReader.java. I am hoping not having to do that.


Any ideas?

Thanks

-John


On Fri, 7 Jan 2005 08:59:25 + (GMT), mark harwood
[EMAIL PROTECTED] wrote:
 There is no API for this, but I recall somebody
  talking about adding support for this a few months
  back
 
 See
 http://marc.theaimsgroup.com/?l=lucene-devm=109485996612177w=2
 
 This implementation was working on a version of Lucene
 before compression was introduced so things may have
 changed a little.
 
 Cheers,
 Mark
 
 
 ___
 ALL-NEW Yahoo! Messenger - all new features - even more fun! 
 http://uk.messenger.yahoo.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: setting Similarity at search time

2005-01-07 Thread John Wang
Hi Chuck:

 Trying to follow up on this thread. Do you know if this feature
will be incorporated in the next Lucene release?

 How would someone find out which patches will go into the next release?

Thanks

-John


On Mon, 15 Nov 2004 13:05:36 -0800, Chuck Williams [EMAIL PROTECTED] wrote:
 Take a look at this:
 
 http://issues.apache.org/bugzilla/show_bug.cgi?id=31841
 
 Not my initial patch, but the latest patch from Wolf Siberski.  I
 haven't used it yet, but it looks like what you are looking for, and
 something I want to use too.
 
 Chuck
 
-Original Message-
From: Ken McCracken [mailto:[EMAIL PROTECTED]
Sent: Monday, November 15, 2004 11:31 AM
To: Lucene Users List
Subject: setting Similarity at search time
   
Hi,
   
Is there a way to set the Similarity at search(...) time, rather
 than
just setting it on the (Index)Searcher object itself?  I'd like to
 be
able to specify different similarities in different threads
 searching
concurrently, using the same IndexSearcher instance.
   
In my use case, the choice of Similarity is a parameter of the
 search
request, and hence may be different for each request.
   
Can such a method be added to override the search(...) method?
   
Thanks,
-Ken
   
   
 -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multi-threaded thru-put in lucene

2005-01-06 Thread John Wang
I actually ran a few tests. But seeing similar behaviors.

After removing all the possible variations, this is what I used:

1 Index, doccount is 15,000.
Using FSDirectory, e.g. new IndexSearcher(String path), by default I
think it uses FSDirectory.

each thread is doing 100 iterations of search, e.g.

for (int i=0;i100;++i){
idxSearcher.search(q);
}

for each thread and each iteration, I am using the same query.

I am timing them the following way:

long start=System.currenTimeInMillis();

for (int i =0;ithreadCount;++i){
   thread[i].start();
}

for (int i=0;ithreadCount;++i){
   thread[i].join();
}


long duration=System.currenTimeInMillis()-start;

duration numbers I am getting are:

1 thread: 445 ms.
2 threads: 870 ms.
5 threads: 2200 ms.

Pretty much the same numbers you'd get if you are running them sequentially.

Any ideas? Am I doing something wrong?

Thanks advance for all your help

-John

On Thu, 6 Jan 2005 00:06:09 -0800 (PST), Chris Hostetter
[EMAIL PROTECTED] wrote:
 
 : This is what we found:
 :
 :  1 thread, search takes 20 ms.
 :
 :   2 threads, search takes 40 ms.
 :
 :   5 threads, search takes 100 ms.
 
 how big is your index?  What are the term frequencies like in your index?
 how many differnt queries did you try? what was the structure of your
 query objects like?  were you using a RAMDirectory or an FSDirectory? what
 hardware were you running on?
 
 Is your test application small enough that you can post it to the list?
 
 I haven't done a lot of PMA testing of Lucene, but from what limited
 testing i have done I'm a little suprised at those numbers, you'd get
 results just as good if you ran the queries sequentially.
 
 -Hoss
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multi-threaded thru-put in lucene

2005-01-06 Thread John Wang
Is the operation IndexSearcher.search I/O or CPU bound if I am doing
100's of searches on the same query?

Thanks

-John


On Thu, 06 Jan 2005 10:31:49 -0800, Doug Cutting [EMAIL PROTECTED] wrote:
 John Wang wrote:
  1 thread: 445 ms.
  2 threads: 870 ms.
  5 threads: 2200 ms.
 
  Pretty much the same numbers you'd get if you are running them sequentially.
 
  Any ideas? Am I doing something wrong?
 
 If you're performing compute-bound work on a single-processor machine
 then threading should give you no better performance than sequential,
 perhaps a bit worse.  If you're performing io-bound work on a
 single-disk machine then threading should again provide no improvement.
   If the task is evenly compute and i/o bound then you could achieve at
 best a 2x speedup on a single CPU system with a single disk.
 
 If you're compute-bound on an N-CPU system then threading should
 optimally be able to provide a factor of N speedup.
 
 Java's scheduling of compute-bound theads when no threads call
 Thread.sleep() can also be very unfair.
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multi-threaded thru-put in lucene

2005-01-06 Thread John Wang
Thanks Doug! You are right, by adding a Thread.sleep() helped greatly.

Mysteries of Java...

Another Java threading question.
With 1 thread, iterations of 100 searches, it took about 850 ms.
by adding a Thread.sleep(10) in the loop. It is taking about 2200 ms.

So there is 2200 - 1850 = 350 ms unaccounted for. Is that due to
thread scheduling/context switching?

Thanks

-John


On Thu, 6 Jan 2005 10:36:12 -0800, John Wang [EMAIL PROTECTED] wrote:
 Is the operation IndexSearcher.search I/O or CPU bound if I am doing
 100's of searches on the same query?
 
 Thanks
 
 -John
 
 
 On Thu, 06 Jan 2005 10:31:49 -0800, Doug Cutting [EMAIL PROTECTED] wrote:
  John Wang wrote:
   1 thread: 445 ms.
   2 threads: 870 ms.
   5 threads: 2200 ms.
  
   Pretty much the same numbers you'd get if you are running them 
   sequentially.
  
   Any ideas? Am I doing something wrong?
 
  If you're performing compute-bound work on a single-processor machine
  then threading should give you no better performance than sequential,
  perhaps a bit worse.  If you're performing io-bound work on a
  single-disk machine then threading should again provide no improvement.
If the task is evenly compute and i/o bound then you could achieve at
  best a 2x speedup on a single CPU system with a single disk.
 
  If you're compute-bound on an N-CPU system then threading should
  optimally be able to provide a factor of N speedup.
 
  Java's scheduling of compute-bound theads when no threads call
  Thread.sleep() can also be very unfair.
 
  Doug
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



reading fields selectively

2005-01-06 Thread John Wang
Hi:

   Is there some way to read only 1 field value from an index given a docID?

   From the current API, in order to get a field from given a docID, I
would call:
 
IndexSearcher.document(docID)

 which in turn reads in all fields from the disk.

   Here is my problem:

   After the search, I have a set of docIDs. For each
document, I have a unique string identifier. At this point I only need
these identifiers but with the above API, I am forced to read the
entire row of fields for each document in the search result, which in
my case can be very large.

   Is there an alternative?

I am thinking more on the lines of a call:

   Field[] getFields(int docID,String fieldName);

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



multi-threaded thru-put in lucene

2005-01-05 Thread John Wang
Hi folks:

We are trying to measure thru-put lucene in a multi-threaded environment. 

This is what we found:

 1 thread, search takes 20 ms.

  2 threads, search takes 40 ms.

  5 threads, search takes 100 ms.


 Seems like under a multi-threaded scenario, thru-put isn't good,
performance is not any better than that of 1 thread.

 I tried to share an IndexSearcher amongst all threads as well as
having an IndexSearcher per thread. Both yield same numbers.

 Is this consistent with what you'd expect?

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Remotely Index

2004-12-16 Thread John Wang
one way is to create a reader from a URL to your file:

(Assuming the file is hosted somewhere reachable by an URL)

Reader r=new InputStreamReader(url.getInputStream());

Document doc=new Document();
doc.addField(Field.Keyword(url,url.toString()));
doc.addField(Field.Text(contents,r));

iw.addDocument(doc);


-John


On Thu, 16 Dec 2004 16:07:57 +0530, Natarajan.T
[EMAIL PROTECTED] wrote:
 Hi All,
 
 How to Index remotely?
 
 For example I have a some documents in machine A and lucene Indexing and
 searching server in machine B.
 
 How can do Index...
 
 Regards,
 
 Natarajan.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


File locking using java.nio.channels.FileLock

2004-12-15 Thread John Wang
Hi:

  When is Lucene planning on moving toward java 1.4+?

   I see there are some problems caused from the current lock file
implementation, e.g. Bug# 32171. The problems would be easily fixed by
using the java.nio.channels.FileLock object.

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: finalize delete without optimize

2004-12-14 Thread John Wang
Hi Otis:

 Thanks for you reply.

 I am looking for more of an API call than a tool. e.g.
IndexWriter.finalizeDelete()

 If I implement this, how would I go about submitting a patch?

thanks

-John


On Mon, 13 Dec 2004 22:24:12 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 Hello John,
 
 I believe you didn't get any replies to this.  What you are describing
 cannot be done using the public, but maaay (no source code on this
 machine, so I can't double-check that) be doable if you use some of the
 'internal' methods.
 
 I don't have the need for this, but others might, so it may be worth
 developing a tool that purges Documents marked as deleted without the
 expensive segment merging, iff that is possible.  If you put this tool
 under the approprite org.apache.lucene... package, you'll get access to
 'internal' methods, of course.  If you end up creating this, we could
 stick it in the Sandbox, where we should really create a new section
 for handy command-line tools that manipulate the index.
 
 Otis
 
 
 
 
 --- John Wang [EMAIL PROTECTED] wrote:
 
  Hi:
 
 Is there a way to finalize delete, e.g. actually remove them from
  the segments and make sure the docIDs are contiguous again.
 
 The only explicit way to do this is by calling
  IndexWriter.optmize(). But this call does a lot more (also merges all
  the segments), hence is very expensive. Is there a way to simply just
  finalize the deletes without having to merge all the segments?
 
  If not, I'd be glad to submit an implementation of this feature
  if
  the Lucene devs agree this is useful.
 
  Thanks
 
  -John
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Vs Ixiasoft

2004-12-08 Thread John Wang
I thought Lucene implements the Boolean model.

-John


On Thu, 9 Dec 2004 00:19:21 +0100, Nicolas Maisonneuve
[EMAIL PROTECTED] wrote:
 hi,
 think first of the relevance of the model in this 2 search engine  for
 XML document retrieval.
 
 Lucene is classic fulltext search engine  using the vector space
 model. this model is efficient for indexing  no structred document
 (like plain text file ) and not made for structured document like XML.
 there is a XML demo of lucene sandbox but it's not really very
 efficient because it doesn't take advantage of  the document strucutre
 in the indexing and the ranking model, so it lose semantic information
 and relevance.
 
 i don't know Ixiasoft, check the information to see how it index and
 rank XML document.
 
 nicolas
 
 On Wed, 8 Dec 2004 14:20:45 -0500, Praveen Peddi
 
 
 [EMAIL PROTECTED] wrote:
  Does anyone know about Ixiasoft server. Its a xml repository/search engine. 
  If anyone knows about it, does he/she also know how it is compared to 
  Lucene? Which is fast?
 
  Praveen
  **
  Praveen Peddi
  Sr Software Engg, Context Media, Inc.
  email:[EMAIL PROTECTED]
  Tel:  401.854.3475
  Fax:  401.861.3596
  web: http://www.contextmedia.com
  **
  Context Media- The Leader in Enterprise Content Integration
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



finalize delete without optimize

2004-12-06 Thread John Wang
Hi:

   Is there a way to finalize delete, e.g. actually remove them from
the segments and make sure the docIDs are contiguous again.

   The only explicit way to do this is by calling
IndexWriter.optmize(). But this call does a lot more (also merges all
the segments), hence is very expensive. Is there a way to simply just
finalize the deletes without having to merge all the segments?

If not, I'd be glad to submit an implementation of this feature if
the Lucene devs agree this is useful.

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Recommended values for mergeFactor, minMergeDocs, maxMergeDocs

2004-12-04 Thread John Wang
We've found something interesting about mergeFactors.

We are indexing a million documents with a batch of 1000.
We first set the mergeFactor to 1000.

What we found is at every 10th commit, we see a significant spike in
indexing time.

The reason is that the indexer is trying to merge the segments every
10th commit, e.g 10*mergeFactor, since the mergeFactor is large, the
merge time is also long.

The example given in the previous email thread indexes identical
documents, merge time is very fast since no new terms are introduced
as indexing proceeds. Hence it may hide this overhead.

We found mergeFactor=100 worked well for our application.

Cheers

-John

On Fri, 3 Dec 2004 16:38:34 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 In my experiments with mergeFactor I found the point of diminishing/no
 returns.  If I remember correctly, I hit the limit at mergeFactor of
 50.
 
 But here is something from Lucene in Action that you can use to play
 with various index tuning factors and see their effect on indexing
 performance.  It's simple, and if you want to test all 3 of your
 scenarios, you will have to modify it.
 
 package lia.indexing;
 
 import org.apache.lucene.index.IndexWriter;
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.analysis.SimpleAnalyzer;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.store.FSDirectory;
 
 /**
 *
 */
 public class IndexTuningDemo {
 
  public static void main(String[] args) throws Exception {
int docsInIndex  = Integer.parseInt(args[0]);
 
// create an index called 'index-dir' in a temp directory
Directory dir = FSDirectory.getDirectory(
  System.getProperty(java.io.tmpdir, tmp) +
  System.getProperty(file.separator) + index-dir, true);
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter writer = new IndexWriter(dir, analyzer, true);
 
// set variables that affect speed of indexing
writer.mergeFactor   = Integer.parseInt(args[1]);
writer.maxMergeDocs  = Integer.parseInt(args[2]);
writer.minMergeDocs  = Integer.parseInt(args[3]);
writer.infoStream= System.out;
 
System.out.println(Merge factor:+ writer.mergeFactor);
System.out.println(Max merge docs:  + writer.maxMergeDocs);
System.out.println(Min merge docs:  + writer.minMergeDocs);
 
long start = System.currentTimeMillis();
for (int i = 0; i  docsInIndex; i++) {
  Document doc = new Document();
  doc.add(Field.Text(fieldname, Bibamus));
  writer.addDocument(doc);
}
writer.close();
long stop = System.currentTimeMillis();
System.out.println(Time:  + (stop - start) +  ms);
  }
 }
 
 Otis
 
 
 
 
 --- Chuck Williams [EMAIL PROTECTED] wrote:
 
  I'm wondering what values of mergeFactor, minMergeDocs and
  maxMergeDocs
  people have found to yield the best performance for different
  configurations.  Is there a repository of this information anywhere?
 
 
 
  I've got about 30k documents and have 3 indexing scenarios:
 
  1.   Full indexing and optimize
 
  2.   Incremental indexing and optimize
 
  3.   Parallel incremental indexing without optimize
 
 
 
  Search performance is critical.  For both cases 1 and 2, I'd like the
  fastest possible indexing time.  For case 3, I'd like minimal pauses
  and
  no noticeable degradation in search performance.
 
 
 
  Based on reading the code (including the javadocs comments), I'm
  thinking of values along these lines:
 
 
 
  mergeFactor:  1000 during Full indexing, and during optimize (for
  both
  cases 1 and 2); 10 during incremental indexing (cases 2 and 3)
 
  minMergeDocs:  1000 during Full indexing, 10 during incremental
  indexing
 
  maxMergeDocs:  Integer.MAX_VALUE during full indexing, 1000 during
  incremental indexing
 
 
 
  Do these values seem reasonable?  Are there better settings before I
  start experimenting?
 
 
 
  Since mergeFactor is used in both addDocument() and optimize(), I'm
  thinking of using two different values in case 2:  10 during the
  incremental indexing, and then 1000 during the optimize.  Is changing
  the value like this going to cause a problem?
 
 
  Thanks for any advice,
 
 
 
  Chuck
 
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: URGENT: Help indexing large document set

2004-11-27 Thread John Wang
Hi Chuck:

 Thanks for your help and the info.

 By some experimentation, I found when calling
FSWriter.addIndex(ramDirectory), it is actually performing a merge
with the existing index. So doing 2000 batches of 500, when the index
grows after each batch, the time to do the merge increases.

 I guess in this implementation, doing it this way is not optimal.

Thanks

-John


On Sat, 27 Nov 2004 13:14:31 -0800, Chuck Williams [EMAIL PROTECTED] wrote:
 Hi John,
 
 I don't use a RamDirectory and so don't have the answer for you.  There
 have been a number of messages about RamDirectory performance on
 lucene-user, including some reported benchmarks.  Some people have
 reported a significant benefit from RamDirectory's, but most others have
 seen little or no benefit.  I'm not sure which factors indicate the
 nature or magnitude of impact.   You sent the message below just to me
 -- you might want to post a question on lucene-user.
 
 I've included a couple messages below on the subject that I saved.
 
 Chuck
 
 Included messages:
 
 -Original Message-
 From: Jonathan Hager [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, November 24, 2004 2:27 PM
 To: Lucene Users List
 Subject: Re: Index in RAM - is it realy worthy?
 
 When comparing RAMDirectory and FSDirectory it is important to mention
 what OS you are using.  When using linux it will cache the most recent
 disk access in memory.  Here is a good article that describes its
 strategy: http://forums.gentoo.org/viewtopic.php?t=175419
 
 The 2% difference you are seeing is the memory copy.  With other OSes
 you may see a speed up when using the RAMDirectory, because not all
 OSes contain a disk cache in memory and must access the disk to read
 the index.
 
 Another consideration is there is currently a 2GB limitation with the
 size of the RAMDirectory.  Indexes over 2GB causes a overflow in the
 int used to create the buffer.  [see int len = (int) is.length(); in
 RamDirectory]
 
 I ended up using RAM directory for a very different reason.  The index
 is 1 to 2MB and is rebuilt every few hours.  It takes 3 to 4 minutes
 to query the database and rebuild the index.  But the search should be
 available 100% of the time.  Since the index is so small I do the
 following:
 
 on server startup:
 - look for semaphore, if it is there delete the index
 - if there is no index, build it to FSdirectory
 - load the index from FSDirectory into RAMDirectory
 
 on reindex:
 - create semaphore
 - rebuild index to FSDirectory
 - delete semaphore
 - load index from FSDirecttory into RAMDirectory
 
 to search:
 - search the RAMDirectory
 
 RAMDirectory could be replaced by a regular FSDirectory, but it seemed
 silly to copy the index from disk to disk, when it ultimately needs to
 be in memory.
 
 FSDirectory could be replaced by a RAMDirectory, but this means that
 it would take the server 3 to 4 minutes longer to startup every time.
 By persisting the index, this time would only be necessary if indexing
 was interrupted.
 
 Jonathan
 
 On Mon, 22 Nov 2004 12:39:07 -0800, Kevin A. Burton
 [EMAIL PROTECTED] wrote:
  Otis Gospodnetic wrote:
 
  For the Lucene book I wrote some test cases that compare FSDirectory
  and RAMDirectory.  What I found was that with certain settings
  FSDirectory was almost as fast as RAMDirectory.  Personally, I would
  push FSDirectory and hope that the OS and the Filesystem do their
 share
  of work and caching for me before looking for ways to optimize my
 code.
  
  
  Yes... I performed the same benchmark and in my situation RAMDirectory
  for searches was about 2% slower.
 
  I'm willing to bet that it has to do with the fact that its a
 Hashtable
  and not a HashMap (which isn't synchronized).
 
  Also adding a constructor for the term size could make loading a
  RAMDirectory faster since you could prevent rehash.
 
  If you're on a modern machine your filesystme cache will end up
  buffering your disk anyway which I'm sure was happening in my
 situation.
 
  Kevin
 
  --
 
  Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an
  invite!  Also see irc.freenode.net #rojo if you want to chat.
 
  Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
 
  If you're interested in RSS, Weblogs, Social Networking, etc... then
 you
  should work for Rojo!  If you recommend someone and we hire them
 you'll
  get a free iPod!
 
  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
  GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 
 
 
 
  -
 
 
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -Original Message-
 From: John Wang [mailto:[EMAIL

Re: URGENT: Help indexing large document set

2004-11-24 Thread John Wang
Thanks Paul!

Using your suggestion, I have changed the update check code to use
only the indexReader:

try {
  localReader = IndexReader.open(path);

  while (keyIter.hasNext()) {
key = (String) keyIter.next();
term = new Term(key, key);
TermDocs tDocs = localReader.termDocs(term);
if (tDocs != null) {
  try {
while (tDocs.next()) {
  localReader.delete(tDocs.doc());
}
  } finally {
tDocs.close();
  }
}
  }
} finally {

  if (localReader != null) {
localReader.close();
  }

}


Unfortunately it didn't seem to make any dramatic difference.

I also see the CPU is only 30-50% busy, so I am guessing it's spending
a lot of time in IO. Anyway of making the CPU work harder?

Is batch size of 500 too small for 1 million documents?

Currently I am seeing a linear speed degredation of 0.3 milliseconds
per document.

Thanks

-John


On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot [EMAIL PROTECTED] wrote:
 On Wednesday 24 November 2004 00:37, John Wang wrote:
 
 
  Hi:
 
 I am trying to index 1M documents, with batches of 500 documents.
 
 Each document has an unique text key, which is added as a
  Field.KeyWord(name,value).
 
 For each batch of 500, I need to make sure I am not adding a
  document with a key that is already in the current index.
 
To do this, I am calling IndexSearcher.docFreq for each document and
  delete the document currently in the index with the same key:
 
 while (keyIter.hasNext()) {
  String objectID = (String) keyIter.next();
  term = new Term(key, objectID);
  int count = localSearcher.docFreq(term);
 
 To speed this up a bit make sure that the iterator gives
 the terms in sorted order. I'd use an index reader instead
 of a searcher, but that will probably not make a difference.
 
 Adding the documents can be done with multiple threads.
 Last time I checked that, there was a moderate speed up
 using three threads instead of one on a single CPU machine.
 Tuning the values of minMergeDocs and maxMergeDocs
 may also help to increase performance of adding documents.
 
 Regards,
 Paul Elschot
 
 -
 
 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Too many open files issue

2004-11-24 Thread John Wang
I have also seen this problem.

In the Lucene code, I don't see where the reader speicified when
creating a field is closed. That holds on to the file.

I am looking at DocumentWriter.invertDocument()

Thanks

-John


On Mon, 22 Nov 2004 16:21:35 -0600, Chris Lamprecht
[EMAIL PROTECTED] wrote:
 A useful resource for increasing the number of file handles on various
 operating systems is the Volano Report:
 
 http://www.volano.com/report/
 
 
 
  I had requested help on an issue we have been facing with the Too many
  open files Exception garbling the search indexes and crashing the
  search on the web site.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



URGENT: Help indexing large document set

2004-11-23 Thread John Wang
Hi:

   I am trying to index 1M documents, with batches of 500 documents.

   Each document has an unique text key, which is added as a
Field.KeyWord(name,value).

   For each batch of 500, I need to make sure I am not adding a
document with a key that is already in the current index.

  To do this, I am calling IndexSearcher.docFreq for each document and
delete the document currently in the index with the same key:
 
   while (keyIter.hasNext()) {
String objectID = (String) keyIter.next();
term = new Term(key, objectID);
int count = localSearcher.docFreq(term);

if (count != 0) {
localReader.delete(term);
}
  }

Then I proceed with adding the documents.

This turns out to be extremely expensive, I looked into the code and I see in 
TermInfosReader.get(Term term) it is doing a linear look up for each
term. So as the index grows, the above operation degrades at a linear
rate. So for each commit, we are doing a docFreq for 500 documents.

I also tried to create a BooleanQuery composed of 500 TermQueries and
do 1 search for each batch, and the performance didn't get better. And
if the batch size increases to say 50,000, creating a BooleanQuery
composed of 50,000 TermQuery instances may introduce huge memory
costs.

Is there a better way to do this?

Can TermInfosReader.get(Term term) be optimized to do a binary lookup
instead of a linear walk? Of course that depends on whether the terms
are stored in sorted order, are they?

This is very urgent, thanks in advance for all your help.

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: URGENT: Help indexing large document set

2004-11-23 Thread John Wang
Thanks Chuck! I missed the call: getIndexOffset.
I am profiling it again to pin-point where the performance problem is.

-John

On Tue, 23 Nov 2004 16:13:22 -0800, Chuck Williams [EMAIL PROTECTED] wrote:
 Are you sure you have a performance problem with
 TermInfosReader.get(Term)?  It looks to me like it scans sequentially
 only within a small buffer window (of size
 SegmentTermEnum.indexInterval) and that it uses binary search otherwise.
 See TermInfosReader.getIndexOffset(Term).
 
 Chuck
 
 
 
   -Original Message-
   From: John Wang [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, November 23, 2004 3:38 PM
   To: [EMAIL PROTECTED]
   Subject: URGENT: Help indexing large document set
  
   Hi:
  
  I am trying to index 1M documents, with batches of 500 documents.
  
  Each document has an unique text key, which is added as a
   Field.KeyWord(name,value).
  
  For each batch of 500, I need to make sure I am not adding a
   document with a key that is already in the current index.
  
 To do this, I am calling IndexSearcher.docFreq for each document
 and
   delete the document currently in the index with the same key:
  
  while (keyIter.hasNext()) {
   String objectID = (String) keyIter.next();
   term = new Term(key, objectID);
   int count = localSearcher.docFreq(term);
  
   if (count != 0) {
   localReader.delete(term);
   }
 }
  
   Then I proceed with adding the documents.
  
   This turns out to be extremely expensive, I looked into the code and
 I
   see in
   TermInfosReader.get(Term term) it is doing a linear look up for each
   term. So as the index grows, the above operation degrades at a
 linear
   rate. So for each commit, we are doing a docFreq for 500 documents.
  
   I also tried to create a BooleanQuery composed of 500 TermQueries
 and
   do 1 search for each batch, and the performance didn't get better.
 And
   if the batch size increases to say 50,000, creating a BooleanQuery
   composed of 50,000 TermQuery instances may introduce huge memory
   costs.
  
   Is there a better way to do this?
  
   Can TermInfosReader.get(Term term) be optimized to do a binary
 lookup
   instead of a linear walk? Of course that depends on whether the
 terms
   are stored in sorted order, are they?
  
   This is very urgent, thanks in advance for all your help.
  
   -John
  
  
 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



indexing benchmark

2004-11-22 Thread John Wang
Hi folks:

 Is there an indexing benchmark somewhere? I see a search
benchmark on the lucene home site.

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index in RAM - is it realy worthy?

2004-11-22 Thread John Wang
In my test, I have 12900 documents. Each document is small, a few
discreet fields (KeyWord type) and 1 Text field containing only 1
sentence.

with both mergeFactor and maxMergeDocs being 1000

using RamDirectory, the indexing job took about 9.2 seconds

not using RamDirectory, the indexing job took about 122 seconds.

I am not calling optimize.

This is on windows Xp running java 1.5.

Is there something very wrong or different in my setup to cause such a
big different?


Thanks

-John


On Mon, 22 Nov 2004 09:23:40 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 For the Lucene book I wrote some test cases that compare FSDirectory
 and RAMDirectory.  What I found was that with certain settings
 FSDirectory was almost as fast as RAMDirectory.  Personally, I would
 push FSDirectory and hope that the OS and the Filesystem do their share
 of work and caching for me before looking for ways to optimize my code.
 
 Otis
 
 
 
 --- [EMAIL PROTECTED] wrote:
 
 
  I did following test:
  I created  the RAM folder on my Red Hat box and copied   c. 1Gb of
  indexes
  there.
  I expected the queries to run much quicker.
  In reality it was even sometimes slower(sic!)
 
  Lucene has it's own RAM disk functionality. If I implement it, would
  it
  bring any benefits?
 
  Thanks in advance
  J.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene file locking question

2004-11-11 Thread John Wang
Hi folks:

  My application builds a super-index around the lucene index,
e.g. stores some additional information outside of lucene.

   I am using my own locking outside of the lucene index via
FileLock object in the jdk1.4 nio package.

   My code does the following:

FileLock lock=null;
try{
lock=myLockFileChannel.lock();

indexing into lucene;

indexing additional information;

}

finally{
  try{
  commit lucene index by closing the IndexWriter instance.
  }
  finally{
if (lock!=null){
   lock.release();
}
  }
}


Now here is the weird thing, say I terminate the process in the middle
of indexing, and run the program again, I would get a Lock obtain
time out exception, as long as you delete the stale lock file, the
index remains uncorrupted.

However, if I turn lucene file lock off since I have a lock outside it anyways, 
(by doing: 
static{
System.setProperty(disableLuceneLocks,true);
  }
)

and do the same thing. Instead I get an unrecoverable corrupted index.

Does lucene lock really guarentee index integrity under this kind of
abuse or am I just getting lucky?
If so, can someone shine some light on how?

Thanks in advance

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene cutomized indexing

2004-07-21 Thread John Wang
Hi Eric and Grant:

 Thanks for the replies and this is certainly encouraging. As
suggested, I will post furthere such discussions to the dev list.

Thanks

-John

On Tue, 20 Jul 2004 15:37:35 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote:
 It seems to me the answer to this is not necessarily to open up the API, but to 
 provide a mechanism for adding Writers and Readers to the indexing/searching process 
 at the application level.  These readers and writers could be passed to Lucene and 
 used to read and write to separate files (thus, not harming the index file format).  
 They could be used to read/write an arbitrary amount of metadata at the term, 
 document and/or index level w/o affecting the core Lucene index.  Furthermore, 
 previous versions could still work b/c they would just ignore the new files and the 
 indexes could be used by other applications as well.
 
 This is just a thought in the infancy stage, but it seems like it would solve the 
 problem.  Of course, the trick is figuring out how it fits into the API (or maybe it 
 becomes a part of 2.0).  Not sure if it is even feasible, but it seems like you 
 could define interfaces for Readers and Writers that met the requirements to do this.
 
 This may be better discussed on the dev list.
 
  [EMAIL PROTECTED] 07/20/04 11:28AM 
 
 
 Hi:
   I am trying to store some Databased like field values into lucene.
 I have my own way of storing field values in a customized format.
 
   I guess my question is wheather we can make the Reader/Writer
 classes, e.g. FieldReader, FieldWriter, DocumentReader/Writer classes
 non-final?
 
   I have asked to make the Lucene API less restrictive many many many
 times but got no replies. Is this request feasible?
 
 Thanks
 
 -John
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up lucene search

2004-07-21 Thread John Wang
In general, yes.
By splitting up a large index into smaller indicies, you are
linearizing the search time.
Furthermore, that allows you to make your search distributable.

-John

On Wed, 21 Jul 2004 13:00:28 +1000, Anson Lau [EMAIL PROTECTED] wrote:
 Hello guys,
 
 What are some general techniques to make lucene search faster?
 
 I'm thinking about splitting up the index.  My current index has approx 1.8
 million documents (small documents) and index size is about 550MB.  Am I
 likely to get much gain out of splitting it up and use a
 multiparallelsearcher?
 
 Most of my search queries search queries search on 5-10 fields.
 
 Are there other things I should look at?
 
 Thanks to all,
 Anson
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene cutomized indexing

2004-07-20 Thread John Wang
Hi:
   I am trying to store some Databased like field values into lucene.
I have my own way of storing field values in a customized format.

   I guess my question is wheather we can make the Reader/Writer
classes, e.g. FieldReader, FieldWriter, DocumentReader/Writer classes
non-final?

   I have asked to make the Lucene API less restrictive many many many
times but got no replies. Is this request feasible?

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene cutomized indexing

2004-07-20 Thread John Wang
Hi Daniel:

 There are few things I want to do to be able to customize lucene:

1) to be able to plug in a different similarity model (e.g. bayesian,
vector space etc.)

2) to be able to store certain fields in its own format and provide
corresponding readers. I may not want to store every field in the
lexicon/inverted index structure. I may have fields that doesn't make
sense to store the position or frequency information.

3) to be able to customize analyzers to add more information to the
Token while doing tokenization.

Oleg mentioned about the HayStack project. In the HayStack source
code, they had to modifiy many lucene class to make them non-final in
order to customzie. They make sure during deployment their versions
gets loaded before the same classes in the lucene .jar. It is
cumbersome, but it is a Lucene restriction they had to live with.

I believe there are many other users feel the same way. 

If I write some classes that derives from the lucene API and it
breaks, then it is my responsibility to fix it. I don't understand why
it would add burden to the Lucene developers.

Thanks

-John

On Tue, 20 Jul 2004 17:56:26 +0200, Daniel Naber
[EMAIL PROTECTED] wrote:
 On Tuesday 20 July 2004 17:28, John Wang wrote:
 
 I have asked to make the Lucene API less restrictive many many many
  times but got no replies.
 
 I suggest you just change it in your source and see if it works. Then you can
 still explain what exactly you did and why it's useful. From the developers
 point-of-view having things non-final means more stuff is exposed and making
 changes is more difficult (unless one accepts that derived classes may break
 with the next update).
 
 Regards
 Daniel
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene cutomized indexing

2004-07-20 Thread John Wang
On Tue, 20 Jul 2004 13:40:28 -0400, Erik Hatcher
[EMAIL PROTECTED] wrote:
 On Jul 20, 2004, at 12:12 PM, John Wang wrote:
   There are few things I want to do to be able to customize lucene:
 
 [...]
 
  3) to be able to customize analyzers to add more information to the
  Token while doing tokenization.
 
 I have already provided my opinion on this one - I think it would be
 fine to allow Token to be public.  I'll let others respond to the
 additional requests you've made.

Great, what processes need to be in place before this gets in the code base? 
 
  Oleg mentioned about the HayStack project. In the HayStack source
  code, they had to modifiy many lucene class to make them non-final in
  order to customzie. They make sure during deployment their versions
  gets loaded before the same classes in the lucene .jar. It is
  cumbersome, but it is a Lucene restriction they had to live with.
 
 Wow - I didn't realize that they've made local changes.  Did they post
 with requests for opening things up as you have?  Did they submit
 patches with their local changes?
 
  I believe there are many other users feel the same way.
 
 Then they should speak up :)

Well, I AM speaking up. So have some other people in earlier emails.
But alike me, are getting ignored. The HayStack changes were needed
specifically due to the fact that many classes are declared to be
final and not extensible.

 
  If I write some classes that derives from the lucene API and it
  breaks, then it is my responsibility to fix it. I don't understand why
  it would add burden to the Lucene developers.
 
 Making things extensible for no good reason is asking for maintenance
 troubles later when you need more control internally.  Lucene has been
 well designed from the start with extensibility only where it was
 needed in mind.  It has evolved to be more open in very specific areas
 after careful consideration of the performance impact has been weighed.
  Breaking is not really the concern with extensibility, I don't
 think.  Real-world use cases are needed to show that changes need to be
 made.

I thought I gave many real-world use cases in the previous email.
And evidently also applies to the Haystack project. What other
information do we need to provide?

I don't want to diverge from the Lucene codebase like Haystack has
done. But I may not have a choice.

Thanks

-John

 
Erik
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene cutomized indexing

2004-07-20 Thread John Wang
That is what exactly they did and that's probably what I have to do.
But that means we are diverging from the lucene code base and future
fixes and enhancements need to be synchronized and that maybe a pain.

-John

On Tue, 20 Jul 2004 20:03:05 +0200, Daniel Naber
[EMAIL PROTECTED] wrote:
 On Tuesday 20 July 2004 18:12, John Wang wrote:
 
  They make sure during deployment their versions
  gets loaded before the same classes in the lucene .jar.
 
 I don't see why people cannot just make their own lucene.jar. Just remove
 the final and recompile. Finally, Lucene is Open Source.
 
 Regards
 Daniel
 
 --
 http://www.danielnaber.de
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is Field.java final?

2004-07-10 Thread John Wang
I was running into the similar problems with Lucene classes being
final. In my case the Token class. I sent out an email but no one
responeded :(

-John

On Sat, 10 Jul 2004 15:50:28 -0700, Kevin A. Burton
[EMAIL PROTECTED] wrote:
 I was going to create a new IDField class which just calls super( name,
 value, false, true, false) but noticed I was prevented because
 Field.java is final?
 
 Why is this?  I can't see any harm in making it non-final...
 
 Kevin
 
 --
 
 Please reply using PGP.
 
http://peerfear.org/pubkey.asc
 
NewsMonster - http://www.newsmonster.org/
 
 Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing help

2004-07-08 Thread John Wang
Hi Grant:
 Thanks for the options. How likely will the lucene file formats change?

 Are there really no more optiosn? :(...

Thanks

-John

On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote:
 Hi John,
 
 The source code is available from CVS, make it non-final and do what you need to do. 
  Of course, you may have a hard time finding help later if you aren't using 
 something everyone else is and your solution doesn't work...  :-)
 
 If I understand correctly what you are trying to do, you already know all of the 
 answers for indexing, you just want Lucene to do the retrieval side of the coin, 
 correct?  I suppose a crazy idea might be to write a program that took your info and 
 output it in the Lucene file format, but that seems a bit like overkill.
 
 -Grant
 
  [EMAIL PROTECTED] 07/07/04 07:37PM 
 
 
 Hi Doug:
 Thanks for the response!
 
 The solution you proposed is still a derivative of creating a
 dummy document stream. Taking the same example, java (5), lucene (6),
 VectorTokenStream would create a total of 11 Tokens whereas only 2 is
 neccessary.
 
Given many documents with many terms and frequencies, it would
 create many extra Token instances.
 
   The reason I was looking to derving the Field class is because I
 can directly manipulate the FieldInfo by setting the frequency. But
 the class is final...
 
   Any other suggestions?
 
 Thanks
 
 -John
 
 On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote:
  John Wang wrote:
While lucene tokenizes the words in the document, it counts the
   frequency and figures out the position, we are trying to bypass this
   stage: For each document, I have a set of words with a know frequency,
   e.g. java (5), lucene (6) etc. (I don't care about the position, so it
   can always be 0.)
  
What I can do now is to create a dummy document, e.g. java java
   java java java lucene lucene lucene lucene lucene and pass it to
   lucene.
  
This seems hacky and cumbersome. Is there a better alternative? I
   browsed around in the source code, but couldn't find anything.
 
  Write an analyzer that returns terms with the appropriate distribution.
 
  For example:
 
  public class VectorTokenStream extends TokenStream {
private int term;
private int freq;
public VectorTokenStream(String[] terms, int[] freqs) {
  this.terms = terms;
  this.freqs = freqs;
}
public Token next() {
  if (freq == 0) {
term++;
if (term = terms.length)
  return null;
freq = freqs[term];
  }
  freq--;
  return new Token(terms[term], 0, 0);
}
  }
 
  Document doc = new Document();
  doc.add(Field.Text(content, ));
  indexWriter.addDocument(doc, new Analyzer() {
public TokenStream tokenStream(String field, Reader reader) {
  return new VectorTokenStream(new String[] {java,lucene},
   new int[] {5,6});
}
  });
 
 Too bad the Field class is final, otherwise I can derive from it
   and do something on that line...
 
  Extending Field would not help.  That's why it's final.
 
  Doug
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing help

2004-07-08 Thread John Wang
Hi Grant:

 I have something that would extract only the important words from
a document along with its importance, furthermore, these important
words may not be physically in the document, it could be synonyms to
some of the words in the document. So the output of a process for a
document is a list of word/importance pairs.

I want to be able to query using only these words on the document. 

   I don't think Lucene has such capability.

   Can you suggest what I can do with the analysers process in doing
this without replicating words/tokens?

Thanks

-John

On Thu, 08 Jul 2004 11:10:07 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote:
 Hey John,
 
 Those are just options, didn't say they were good ones!  :-)
 
 I guess the real question is, what is the background of what you are trying to do?  
 Presumably you have some other program that is generating frequencies for you, do 
 you really need that in the current form?  Can't the Lucene indexing engine act as a 
 stand-in for this process since your end result _should_ be the same?  The Lucene 
 Analyzer process is quite flexible, I bet you could even find a way to hook in your 
 existing tools into the Analyzer process.
 
 -Grant
 
  [EMAIL PROTECTED] 07/08/04 10:42AM 
 
 
 Hi Grant:
 Thanks for the options. How likely will the lucene file formats change?
 
 Are there really no more optiosn? :(...
 
 Thanks
 
 -John
 
 On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote:
  Hi John,
 
  The source code is available from CVS, make it non-final and do what you need to 
  do.  Of course, you may have a hard time finding help later if you aren't using 
  something everyone else is and your solution doesn't work...  :-)
 
  If I understand correctly what you are trying to do, you already know all of the 
  answers for indexing, you just want Lucene to do the retrieval side of the coin, 
  correct?  I suppose a crazy idea might be to write a program that took your info 
  and output it in the Lucene file format, but that seems a bit like overkill.
 
  -Grant
 
   [EMAIL PROTECTED] 07/07/04 07:37PM 
 
 
  Hi Doug:
  Thanks for the response!
 
  The solution you proposed is still a derivative of creating a
  dummy document stream. Taking the same example, java (5), lucene (6),
  VectorTokenStream would create a total of 11 Tokens whereas only 2 is
  neccessary.
 
 Given many documents with many terms and frequencies, it would
  create many extra Token instances.
 
The reason I was looking to derving the Field class is because I
  can directly manipulate the FieldInfo by setting the frequency. But
  the class is final...
 
Any other suggestions?
 
  Thanks
 
  -John
 
  On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote:
   John Wang wrote:
 While lucene tokenizes the words in the document, it counts the
frequency and figures out the position, we are trying to bypass this
stage: For each document, I have a set of words with a know frequency,
e.g. java (5), lucene (6) etc. (I don't care about the position, so it
can always be 0.)
   
 What I can do now is to create a dummy document, e.g. java java
java java java lucene lucene lucene lucene lucene and pass it to
lucene.
   
 This seems hacky and cumbersome. Is there a better alternative? I
browsed around in the source code, but couldn't find anything.
  
   Write an analyzer that returns terms with the appropriate distribution.
  
   For example:
  
   public class VectorTokenStream extends TokenStream {
 private int term;
 private int freq;
 public VectorTokenStream(String[] terms, int[] freqs) {
   this.terms = terms;
   this.freqs = freqs;
 }
 public Token next() {
   if (freq == 0) {
 term++;
 if (term = terms.length)
   return null;
 freq = freqs[term];
   }
   freq--;
   return new Token(terms[term], 0, 0);
 }
   }
  
   Document doc = new Document();
   doc.add(Field.Text(content, ));
   indexWriter.addDocument(doc, new Analyzer() {
 public TokenStream tokenStream(String field, Reader reader) {
   return new VectorTokenStream(new String[] {java,lucene},
new int[] {5,6});
 }
   });
  
  Too bad the Field class is final, otherwise I can derive from it
and do something on that line...
  
   Extending Field would not help.  That's why it's final.
  
   Doug
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands

Re: indexing help

2004-07-08 Thread John Wang
Thanks Doug. I will do just that.

Just for my education, can you maybe elaborate on using the
implement an IndexReader that delivers a
synthetic index approach?

Thanks in advance

-John

On Thu, 08 Jul 2004 10:01:59 -0700, Doug Cutting [EMAIL PROTECTED] wrote:
 John Wang wrote:
   The solution you proposed is still a derivative of creating a
  dummy document stream. Taking the same example, java (5), lucene (6),
  VectorTokenStream would create a total of 11 Tokens whereas only 2 is
  neccessary.
 
 That's easy to fix.  We just need to reuse the token:
 
 public class VectorTokenStream extends TokenStream {
   private int term = -1;
   private int freq = 0;
   private Token token;
   public VectorTokenStream(String[] terms, int[] freqs) {
 this.terms = terms;
 this.freqs = freqs;
   }
   public Token next() {
 if (freq == 0) {
   term++;
   if (term = terms.length)
 return null;
   token = new Token(terms[term], 0, 0);
   freq = freqs[term];
 }
 freq--;
 return token;
   }
 }
 
 Then only two tokens are created, as you desire.
 
 If you for some reason don't want to create a dummy document stream,
 then you could instead implement an IndexReader that delivers a
 synthetic index for a single document.  Then use
 IndexWriter.addIndexes() to turn this into a real, FSDirectory-based
 index.  However that would be a lot more work and only very marginally
 faster.  So I'd stick with the approach I've outlined above.  (Note:
 this code has not been compiled or run.  It may have bugs.)
 
 
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing help

2004-07-07 Thread John Wang
Hi Doug:
 Thanks for the response!

 The solution you proposed is still a derivative of creating a
dummy document stream. Taking the same example, java (5), lucene (6),
VectorTokenStream would create a total of 11 Tokens whereas only 2 is
neccessary.

Given many documents with many terms and frequencies, it would
create many extra Token instances.

   The reason I was looking to derving the Field class is because I
can directly manipulate the FieldInfo by setting the frequency. But
the class is final...

   Any other suggestions?

Thanks

-John

On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote:
 John Wang wrote:
   While lucene tokenizes the words in the document, it counts the
  frequency and figures out the position, we are trying to bypass this
  stage: For each document, I have a set of words with a know frequency,
  e.g. java (5), lucene (6) etc. (I don't care about the position, so it
  can always be 0.)
 
   What I can do now is to create a dummy document, e.g. java java
  java java java lucene lucene lucene lucene lucene and pass it to
  lucene.
 
   This seems hacky and cumbersome. Is there a better alternative? I
  browsed around in the source code, but couldn't find anything.
 
 Write an analyzer that returns terms with the appropriate distribution.
 
 For example:
 
 public class VectorTokenStream extends TokenStream {
   private int term;
   private int freq;
   public VectorTokenStream(String[] terms, int[] freqs) {
 this.terms = terms;
 this.freqs = freqs;
   }
   public Token next() {
 if (freq == 0) {
   term++;
   if (term = terms.length)
 return null;
   freq = freqs[term];
 }
 freq--;
 return new Token(terms[term], 0, 0);
   }
 }
 
 Document doc = new Document();
 doc.add(Field.Text(content, ));
 indexWriter.addDocument(doc, new Analyzer() {
   public TokenStream tokenStream(String field, Reader reader) {
 return new VectorTokenStream(new String[] {java,lucene},
  new int[] {5,6});
   }
 });
 
Too bad the Field class is final, otherwise I can derive from it
  and do something on that line...
 
 Extending Field would not help.  That's why it's final.
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]