Re: Multiple Keywords/Keyphrases fields

2005-02-16 Thread Paul Elschot
On Wednesday 16 February 2005 06:49, Owen Densmore wrote:
  From: Erik Hatcher [EMAIL PROTECTED]
  Date: February 12, 2005 3:09:15 PM MST
  To: Lucene Users List lucene-user@jakarta.apache.org
  Subject: Re: Multiple Keywords/Keyphrases fields
 
 
  The real question to answer is what types of queries you're planning 
  on making.  Rather than look at it from indexing forward, consider it 
  from searching backwards.
 
  How will users query using those keyword phrases?
 
 Hi Erik.  Good point.
 
 There are two uses we are making of the keyphrases:
 
   - Graphical Navigation: A Flash graphical browser will allow users to 
 fly around in a space of documents, choosing what to be viewing: 
 Authors, Keyphrases and Textual terms.  In any of these cases, the 
 closeness of any of the fields will govern how close they will appear 
 graphically.  In the case of authors, we will weight collaboration .. 
 how often the authors work together.  In the case of Keyphrases, we 
 will want to use something like distance vectors like you show in the 
 book using the cosine measure.  Thus the keyphrases need to be separate 
 entities within the document .. it would be a bug for us if the terms 
 leaked across the separate kephrases within the document.
 
   - Textual Search: In this case, we will have two ways to search the 
 keyphrases.  The first would be like the graphical navigation above 
 where searching for complex system should require the terms to be in 
 a single keyphrase.  The second way will be looser, where we may simply 
 pool the keyphrases with titles and abstract, and allow them all to be 
 searched together within the document.
 
 Does this make sense?  So the question from the search standpoint is: 
 do multiple instances of a field act like there are barriers across the 
 instances, or are they somehow treated as a single instance somehow.  

Multiple field instances with the same name in a document are concatenated in
the index in the order in which they where added to the document.
For each instance of a field in the document, even when it has the same name, 
the analyzer is asked to provide a new tokenstream. 

This happens in org.apache.lucene.index.DocumentWriter.invertDocument(),
The last position offset in the field as indexed is maintained for this
purpose.

 In terms of the closeness calculation, for example, can we get separate 
 term vectors for each instance of the keyphrase field, or will we get a 
 single vector combining all the keyphrase terms within a single 
 document?

The positions in the TermVectors are treated in the same way.

To put a barrier between field instances with the same name
one can put a gap in the indexed term positions. This gap needs a larger
query proximity to match. AND like queries will match in the indexed field.

A gap is implemented by providing the a tokenstream from the analyzer
that has a position increment that equals the gap for the first token in the
stream.
For the first field instance with same name the gap is not needed.

Regards,
Paul Elschot

 
 I hope this is clear!  Kinda hard to articulate.
 
 Owen
 
  Erik
 
  On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote:
 
  I'm getting a bit more serious about the final form of our lucene 
  index.  Each document has DocNumber, Authors, Title, Abstract, and 
  Keywords.  By Keywords, I mean a comma separated list, each entry 
  having possibly many terms in a phrase like:
 temporal infomax, finite state automata, Markov chains,
 conditional entropy, neural information processing
 
  I presume I should be using a field Keywords which have many 
  entries or instances per document (one per comma separated 
  phrase).  But I'm not sure the right way to handle all this.  My 
  assumption is that I should analyze them individually, just as we do 
  for free text (the Abstract, for example), thus in the example above 
  having 5 entries of the nature
 doc.add(Field.Text(Keywords, finite state automata));
  etc, analyzing them because these are author-supplied strings with no 
  canonical form.
 
  For guidance, I looked in the archive and found the attached email, 
  but I didn't see the answer.  (I'm not concerned about the dups, I 
  presume that is equivalent to a boos of some sort) Does this seem 
  right?
 
  Thanks once again.
 
  Owen
 
  From: [EMAIL PROTECTED] [EMAIL PROTECTED]
  Subject: Multiple equal Fields?
  Date: Tue, 17 Feb 2004 12:47:58 +0100
 
  Hi!
  What happens if I do this:
 
  doc.add(Field.Text(foo, bar));
  doc.add(Field.Text(foo, blah));
 
  Is there a field foo with value blah or are there two foos 
  (actually not
  possible) or is there one foo with the values bar and blah?
 
  And what does happen in this case:
 
  doc.add(Field.Text(foo, bar));
  doc.add(Field.Text(foo, bar));
  doc.add(Field.Text(foo, bar));
 
  Does lucene store this only once?
 
  Timo
 
 
 
 
 -
 To unsubscribe, 

Re: Concurrent searching re-indexing

2005-02-16 Thread Otis Gospodnetic
Hi Paul,

If I understand your setup correctly, it looks like you are running
multiple threads that create IndexWriter for the ame directory.  That's
a no no.

This section (first hit) describes all various concurrency issues with
regards to adds, updates, optimization, and searches:
  http://www.lucenebook.com/search?query=concurrent

IndexSearcher (IndexReader, really) does take a snapshot of the index
state when it is opened, so at that time the index segments listed in
segments should be in a complete state.  It also reads index files when
searching, of course.

Otis


--- Paul Mellor [EMAIL PROTECTED] wrote:

 Hi,
 
 I've read from various sources on the Internet that it is perfectly
 safe to
 simultaneously search a Lucene index that is being updated from
 another
 Thread, as long as all write access to the index is synchronized. 
 But does
 this apply only to updating the index (i.e. deleting and adding
 documents),
 or to a complete re-indexing (i.e. create a new IndexWriter with the
 'create' argument true and then re-add all the documents)?
 
 I have a class which encapsulates all access to my index, so that
 writes can
 be synchronized.  This class also exposes a method to obtain an
 IndexSearcher for the index.  I'm running unit tests to test this
 which
 create many threads - each thread does a complete re-indexing and
 then
 obtains an IndexSearcher and does a query.
 
 I'm finding that with sufficiently high numbers of threads, I'm
 getting the
 occasional failure, with the following exception thrown when
 attempting to
 construct a new IndexWriter (during the reindexing) -
 
 java.io.IOException: couldn't delete _a.f1
 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151)
 ...
 
 The exception occurs quite infrequently (usually for somewhere
 between 1-5%
 of the Threads).
 
 Does the IndexSearcher take a 'snapshot' of the index at creation? 
 Or does
 it access the filesystem whilst searching?  I am also synchronizing
 creation
 of the IndexSearcher with the write lock, so that the IndexSearcher
 is not
 created whilst the index is being recreated (and vice versa).  But do
 I need
 to ensure that the IndexSearcher cannot search whilst the index is
 being
 recreated as well?
 
 Note that a similar unit test where the threads update the index
 (rather
 than recreate it from scratch) works fine, as expected.
 
 This is running on Windows 2000.
 
 Any help would be much appreciated!
 
 Paul
 
 This e-mail and any files transmitted with it are confidential and
 intended
 solely for the use of the individual or entity to whom they are
 addressed.
 If you are not the intended recipient, you should not copy,
 retransmit or
 use the e-mail and/or files transmitted with it  and should not
 disclose
 their contents. In such a case, please notify
 [EMAIL PROTECTED]
 and delete the message from your own system. Any opinions expressed
 in this
 e-mail and/or files transmitted with it that do not relate to the
 official
 business of this company are those solely of the author and should
 not be
 interpreted as being endorsed by this company.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Concurrent searching re-indexing

2005-02-16 Thread Paul Mellor
But all write access to the index is synchronized, so that although multiple
threads are creating an IndexWriter for the same directory and using it to
totally recreate that index, only one thread is doing this at once.

I was concerned about the safety of using an IndexSearcher to perform
queries on an index that is in the process of being recreated from scratch,
but I guess that if the IndexSearcher takes a snapshot of the index when it
is created (and in my code this creation is synchronized with the write
operations as well so that the threads wait for the write operations to
finish before instantiating an IndexSearcher, and vice versa) this can't be
a problem.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: 16 February 2005 17:30
To: Lucene Users List
Subject: Re: Concurrent searching  re-indexing


Hi Paul,

If I understand your setup correctly, it looks like you are running
multiple threads that create IndexWriter for the ame directory.  That's
a no no.

This section (first hit) describes all various concurrency issues with
regards to adds, updates, optimization, and searches:
  http://www.lucenebook.com/search?query=concurrent

IndexSearcher (IndexReader, really) does take a snapshot of the index
state when it is opened, so at that time the index segments listed in
segments should be in a complete state.  It also reads index files when
searching, of course.

Otis


--- Paul Mellor [EMAIL PROTECTED] wrote:

 Hi,
 
 I've read from various sources on the Internet that it is perfectly
 safe to
 simultaneously search a Lucene index that is being updated from
 another
 Thread, as long as all write access to the index is synchronized. 
 But does
 this apply only to updating the index (i.e. deleting and adding
 documents),
 or to a complete re-indexing (i.e. create a new IndexWriter with the
 'create' argument true and then re-add all the documents)?
 
 I have a class which encapsulates all access to my index, so that
 writes can
 be synchronized.  This class also exposes a method to obtain an
 IndexSearcher for the index.  I'm running unit tests to test this
 which
 create many threads - each thread does a complete re-indexing and
 then
 obtains an IndexSearcher and does a query.
 
 I'm finding that with sufficiently high numbers of threads, I'm
 getting the
 occasional failure, with the following exception thrown when
 attempting to
 construct a new IndexWriter (during the reindexing) -
 
 java.io.IOException: couldn't delete _a.f1
 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151)
 ...
 
 The exception occurs quite infrequently (usually for somewhere
 between 1-5%
 of the Threads).
 
 Does the IndexSearcher take a 'snapshot' of the index at creation? 
 Or does
 it access the filesystem whilst searching?  I am also synchronizing
 creation
 of the IndexSearcher with the write lock, so that the IndexSearcher
 is not
 created whilst the index is being recreated (and vice versa).  But do
 I need
 to ensure that the IndexSearcher cannot search whilst the index is
 being
 recreated as well?
 
 Note that a similar unit test where the threads update the index
 (rather
 than recreate it from scratch) works fine, as expected.
 
 This is running on Windows 2000.
 
 Any help would be much appreciated!
 
 Paul
 
 This e-mail and any files transmitted with it are confidential and
 intended
 solely for the use of the individual or entity to whom they are
 addressed.
 If you are not the intended recipient, you should not copy,
 retransmit or
 use the e-mail and/or files transmitted with it  and should not
 disclose
 their contents. In such a case, please notify
 [EMAIL PROTECTED]
 and delete the message from your own system. Any opinions expressed
 in this
 e-mail and/or files transmitted with it that do not relate to the
 official
 business of this company are those solely of the author and should
 not be
 interpreted as being endorsed by this company.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


_
This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com

This e-mail and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you are not the intended recipient, you should not copy, retransmit or
use the e-mail and/or files transmitted with it  and should not disclose
their contents. In such a case, please notify [EMAIL PROTECTED]
and delete the message from your own system. Any 

big index and multi threaded IndexSearcher

2005-02-16 Thread Yura Smolsky
Hello.

I use PyLucene, python port of Lucene.

I have problem about using big index (50Gb) with IndexSearcher
from many threads.
I use IndexSearcher from PyLucene's PythonThread. It's really a wrapper
around a Java/libgcj thread that python is tricked into thinking
it's one of its own.

The core of problem:
When I have many threads (more than 5) I receive this exception:
  File /usr/lib/python2.4/site-packages/PyLucene.py, line 2241, in search
def search(*args): return _PyLucene.Searcher_search(*args)
ValueError: java.lang.OutOfMemoryError
   No stacktrace available

When I decrease number of threads to 3 or even 1 then search works.
How do many threads can affect to this exception?..

I have 2 Gb of memory. So with one thread the process takes like
1200-1300Mb.

Andi Vajda suggested that There may be overhead involved in having
multiple threads against a given index.

Does anyone here have experience in handling big indexes with many
threads?

Any ideas are appreciated.

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: big index and multi threaded IndexSearcher

2005-02-16 Thread Erik Hatcher
Are you using multiple IndexSearcher instances?Or only one and 
sharing it across multiple threads?

If using a single shared IndexSearcher instance doesn't help, it may be 
beneficial to port your code to Java and try it there.

I'm just now getting into PyLucene myself - building a demo for a Unix 
User's Group presentation I'm giving.

Erik
On Feb 16, 2005, at 3:04 PM, Yura Smolsky wrote:
Hello.
I use PyLucene, python port of Lucene.
I have problem about using big index (50Gb) with IndexSearcher
from many threads.
I use IndexSearcher from PyLucene's PythonThread. It's really a wrapper
around a Java/libgcj thread that python is tricked into thinking
it's one of its own.
The core of problem:
When I have many threads (more than 5) I receive this exception:
  File /usr/lib/python2.4/site-packages/PyLucene.py, line 2241, in 
search
def search(*args): return _PyLucene.Searcher_search(*args)
ValueError: java.lang.OutOfMemoryError
   No stacktrace available

When I decrease number of threads to 3 or even 1 then search works.
How do many threads can affect to this exception?..
I have 2 Gb of memory. So with one thread the process takes like
1200-1300Mb.
Andi Vajda suggested that There may be overhead involved in having
multiple threads against a given index.
Does anyone here have experience in handling big indexes with many
threads?
Any ideas are appreciated.
Yura Smolsky.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re[2]: big index and multi threaded IndexSearcher

2005-02-16 Thread Yura Smolsky
Hello, PA.


 Does anyone here have experience in handling big indexes with many
 threads?
P What about turning the problem around and spitting your index in
P several chunks? Then you could search those (smaller) indices in 
P parallel and consolidate the final result, no?

Well, I have not 6 CPU in one box :)

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: big index and multi threaded IndexSearcher

2005-02-16 Thread Yura Smolsky
Hello, Erik.

EH Are you using multiple IndexSearcher instances?Or only one and
EH sharing it across multiple threads?

EH If using a single shared IndexSearcher instance doesn't help, it may be
EH beneficial to port your code to Java and try it there.

I have single instance of IndexSearcher and I pass reference of it to each
thread. I will port code to Java if no other ideas will come my
mind...

EH On Feb 16, 2005, at 3:04 PM, Yura Smolsky wrote:

 Hello.

 I use PyLucene, python port of Lucene.

 I have problem about using big index (50Gb) with IndexSearcher
 from many threads.
 I use IndexSearcher from PyLucene's PythonThread. It's really a wrapper
 around a Java/libgcj thread that python is tricked into thinking
 it's one of its own.

 The core of problem:
 When I have many threads (more than 5) I receive this exception:
   File /usr/lib/python2.4/site-packages/PyLucene.py, line 2241, in
 search
 def search(*args): return _PyLucene.Searcher_search(*args)
 ValueError: java.lang.OutOfMemoryError
No stacktrace available

 When I decrease number of threads to 3 or even 1 then search works.
 How do many threads can affect to this exception?..

 I have 2 Gb of memory. So with one thread the process takes like
 1200-1300Mb.

 Andi Vajda suggested that There may be overhead involved in having
 multiple threads against a given index.

 Does anyone here have experience in handling big indexes with many
 threads?


Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



knowing which field contributed the search result

2005-02-16 Thread John Wang
Hi:

   Is there way to find out given a hit from a search, find out which
fields contributed to the hit?

e.g.

If my search for:

contents1=brown fox OR contents2=black bear

can the document founded by this query also have information on
whether it was found via contents1 or contents2 or both.


Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



bookkeeping documents cause problem in Sort

2005-02-16 Thread aurora
I understand that unlike relational database, Lucene is flexible in having  
documents with different set of fields. My index has documents with a date  
and content field. There are also a few book keeping documents that does  
not have the date field. Things work well except in one case:

  Sort sort = Sort('date');
  searcher.search(query, sort);
In this case an exception is thrown:
  java.lang.RuntimeException: field date does not appear to be indexed
It does not make sense to sort by 'date' when the document does not has  
'date'. On the other hand I don't expect the search() to return any book  
keeping documents at all since the current look for fields not in those  
documents. Is this an implementation issue or is there any inherent reason  
all document need to have the 'date' field if it is sorted?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]