MergerIndex + Searchables

2004-12-21 Thread Karthik N S
Hi Guys

Apologies...


I have several MERGERINDEXES [  MGR1,MGR2,MGR3].

for searching across these MERGERINDEXES I use the following Code
IndexSearcher[] indexToSearch = new IndexSearcher[CNTINDXDBOOK];

for(int all=0;allCNTINDXDBOOK;all++){
indexToSearch[all] = new IndexSearcher(INDEXEDBOOKS[all]);
 System.out.println(all +  ADDED TO SEARCHABLES  + INDEXEDBOOKS[all]);
}

MultiSearcher searcher = new MultiSearcher(indexToSearch);


Question :

When on Search Process , How to Display that this relevan  Document Id
Originated from Which MRG???

[ Some thing like this : -  Search word  'ISBN12345' is avalible from
MRGx ]



  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Synonyms for AND/OR/NOT operators

2004-12-21 Thread Sanyi
Hi!

What is the simplest way to add synonyms for AND/OR/NOT operators?
I'd like to support two sets of operator words, so people can use either the 
original english
operators and my custom ones for our local language.

Thank you for your attention!
Sanyi



__ 
Do you Yahoo!? 
Send holiday email and support a worthy cause. Do good. 
http://celebrity.mail.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MergerIndex + Searchables

2004-12-21 Thread Nader Henein
As obvious as it may seem, you could always store the index ID in which 
you are indexing the document in the document itself and have that 
fetched with the search results, or is there something stopping you from 
doing that.

Nader Henein
Karthik N S wrote:
Hi Guys
Apologies...
I have several MERGERINDEXES [  MGR1,MGR2,MGR3].
for searching across these MERGERINDEXES I use the following Code
IndexSearcher[] indexToSearch = new IndexSearcher[CNTINDXDBOOK];
for(int all=0;allCNTINDXDBOOK;all++){
   indexToSearch[all] = new IndexSearcher(INDEXEDBOOKS[all]);
System.out.println(all +  ADDED TO SEARCHABLES  + INDEXEDBOOKS[all]);
}
MultiSearcher searcher = new MultiSearcher(indexToSearch);
Question :
When on Search Process , How to Display that this relevan  Document Id
Originated from Which MRG???
[ Some thing like this : -  Search word  'ISBN12345' is avalible from
MRGx ]

 WITH WARM REGARDS
 HAVE A NICE DAY
 [ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: index size doubled?

2004-12-21 Thread Paul Elschot
On Tuesday 21 December 2004 05:49, aurora wrote:
 I'm testing the rebuilding of the index. I add several hundred documents,  
 optimize and add another few hundred and so on. Right now I have around  
 7000 files. I observed after the index gets to certain size. Everytime  
 after optimize, the are two files roughly the same size like below:
 
 12/20/2004  01:57p  13 deletable
 12/20/2004  01:57p  29 segments
 12/20/2004  01:53p  14,460,367 _5qf.cfs
 12/20/2004  01:57p  15,069,013 _5zr.cfs
 
 The index total index is double of what I expect. This is not always  
 reproducible. (I'm constantly tuning my program and the set of document).  
 Sometime I get a decent single document after optimize. What was happening?

Lucene tried to delete the older version (_5cf.cfs above), but got an error
back from the file system. After that it has put the name of that segment in
the deletable file, so it can try later to delete that segment.

This is known behaviour on FAT file systems. These randomly take some time
for themselves to finish closing a file after it has been correctly closed by
a program.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MergerIndex + Searchables

2004-12-21 Thread Paul Elschot
Karthik,

On Tuesday 21 December 2004 09:04, Karthik N S wrote:
 Hi Guys
 
 Apologies...
 
 
 I have several MERGERINDEXES [  MGR1,MGR2,MGR3].
 
 for searching across these MERGERINDEXES I use the following Code
 IndexSearcher[] indexToSearch = new IndexSearcher[CNTINDXDBOOK];
 
 for(int all=0;allCNTINDXDBOOK;all++){
 indexToSearch[all] = new IndexSearcher(INDEXEDBOOKS[all]);
  System.out.println(all +  ADDED TO SEARCHABLES  + INDEXEDBOOKS[all]);
 }
 
 MultiSearcher searcher = new MultiSearcher(indexToSearch);
 
 
 Question :
 
 When on Search Process , How to Display that this relevan  Document Id
 Originated from Which MRG???
 
 [ Some thing like this : -  Search word  'ISBN12345' is avalible from
 MRGx ]

I think you are looking for the methods subSearcher() and subDoc() on
MultiSearcher.

Regards,
Paul Elschot




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Synonyms for AND/OR/NOT operators

2004-12-21 Thread Erik Hatcher
On Dec 21, 2004, at 3:04 AM, Sanyi wrote:
What is the simplest way to add synonyms for AND/OR/NOT operators?
I'd like to support two sets of operator words, so people can use 
either the original english
operators and my custom ones for our local language.
There are two options that I know of: 1) add synonyms during indexing 
and 2) add synonyms during querying.  Generally this would be done 
using a custom analyzer.

If the synonym mappings are static and you don't mind a larger index, 
adding them during indexing avoids the complexity of rewriting the 
query.  Injecting synonyms during querying allows the synonym mappings 
to change dynamically, though does produce more complex queries.  
Here's an example you'll find with the source code distribution of 
Lucene in Action which uses WordNet to look up synonyms.

Erik
p.s. I'm sensitive to over-marketing Lucene in Action in this forum as 
it would bother me to constantly see an advertisement.  You can be sure 
that any mentions of it from me will coincide with concrete examples 
(which are freely available) that are directly related to questions 
being asked.

% ant -emacs SynonymAnalyzerViewer
Buildfile: build.xml
check-environment:
compile:
build-test-index:
build-perf-index:
prepare:
SynonymAnalyzerViewer:
  Using a custom SynonymAnalyzer, two fixed strings are
  analyzed with the results displayed.  Synonyms, from the
  WordNet database, are injected into the same positions
  as the original words.
  See the Analysis chapter for more on synonym injection and
  position increments.  The Tools and extensions chapter covers
  the WordNet feature found in the Lucene sandbox.
Press return to continue...
Running lia.analysis.synonym.SynonymAnalyzerViewer...
1: [quick] [warm] [straightaway] [spry] [speedy] [ready] [quickly] 
[promptly] [prompt] [nimble] [immediate] [flying] [fast] [agile]
2: [brown] [brownness] [brownish]
3: [fox] [trick] [throw] [slyboots] [fuddle] [fob] [dodger] 
[discombobulate] [confuse] [confound] [befuddle] [bedevil]
4: [jumps]
5: [over] [o] [across]
6: [lazy] [faineant] [indolent] [otiose] [slothful]
7: [dogs]

1: [oh]
2: [we]
3: [get] [acquire] [aim] [amaze] [arrest] [arrive] [baffle] [beat] 
[become] [beget] [begin] [bewilder] [bring] [can] [capture] [catch] 
[cause] [come] [commence] [contract] [convey] [develop] [draw] [drive] 
[dumbfound] [engender] [experience] [father] [fetch] [find] [fix] 
[flummox] [generate] [go] [gravel] [grow] [have] [incur] [induce] [let] 
[make] [may] [mother] [mystify] [nonplus] [obtain] [perplex] [produce] 
[puzzle] [receive] [scram] [sire] [start] [stimulate] [stupefy] 
[stupify] [suffer] [sustain] [take] [trounce] [undergo]
4: [both]
5: [kinds]
6: [country] [state] [nationality] [nation] [land] [commonwealth] [area]
7: [western] [westerly]
8: [bb]

BUILD SUCCESSFUL
Total time: 10 seconds
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Synonyms for AND/OR/NOT operators

2004-12-21 Thread Sanyi
Hi!

I think we're talking about different things.
My question is about using synonyms for AND/OR/NOT operators, not about 
synonyms of words in the
index.
For example, in some language: AND = AANNDD; OR = OORR; NOT = NNOOTT

So, the user can enter:
(cat OR kitty) AND black AND tail

and either:

(cat OORR kitty) AANNDD black AANNDD tail

Both sets of operators must work.
It must be some kind of a query parser modification/parametering, so there is 
nothing to do with
the index.

I hope I was more specific now ;)

Thanx,
Sanyi




--- Erik Hatcher [EMAIL PROTECTED] wrote:

 On Dec 21, 2004, at 3:04 AM, Sanyi wrote:
  What is the simplest way to add synonyms for AND/OR/NOT operators?
  I'd like to support two sets of operator words, so people can use 
  either the original english
  operators and my custom ones for our local language.
 
 There are two options that I know of: 1) add synonyms during indexing 
 and 2) add synonyms during querying.  Generally this would be done 
 using a custom analyzer.
 
 If the synonym mappings are static and you don't mind a larger index, 
 adding them during indexing avoids the complexity of rewriting the 
 query.  Injecting synonyms during querying allows the synonym mappings 
 to change dynamically, though does produce more complex queries.  
 Here's an example you'll find with the source code distribution of 
 Lucene in Action which uses WordNet to look up synonyms.
 
   Erik
 
 p.s. I'm sensitive to over-marketing Lucene in Action in this forum as 
 it would bother me to constantly see an advertisement.  You can be sure 
 that any mentions of it from me will coincide with concrete examples 
 (which are freely available) that are directly related to questions 
 being asked.
 
 
 % ant -emacs SynonymAnalyzerViewer
 Buildfile: build.xml
 
 check-environment:
 
 compile:
 
 build-test-index:
 
 build-perf-index:
 
 prepare:
 
 SynonymAnalyzerViewer:
 
Using a custom SynonymAnalyzer, two fixed strings are
analyzed with the results displayed.  Synonyms, from the
WordNet database, are injected into the same positions
as the original words.
 
See the Analysis chapter for more on synonym injection and
position increments.  The Tools and extensions chapter covers
the WordNet feature found in the Lucene sandbox.
 
 Press return to continue...
 
 Running lia.analysis.synonym.SynonymAnalyzerViewer...
 
 1: [quick] [warm] [straightaway] [spry] [speedy] [ready] [quickly] 
 [promptly] [prompt] [nimble] [immediate] [flying] [fast] [agile]
 2: [brown] [brownness] [brownish]
 3: [fox] [trick] [throw] [slyboots] [fuddle] [fob] [dodger] 
 [discombobulate] [confuse] [confound] [befuddle] [bedevil]
 4: [jumps]
 5: [over] [o] [across]
 6: [lazy] [faineant] [indolent] [otiose] [slothful]
 7: [dogs]
 
 1: [oh]
 2: [we]
 3: [get] [acquire] [aim] [amaze] [arrest] [arrive] [baffle] [beat] 
 [become] [beget] [begin] [bewilder] [bring] [can] [capture] [catch] 
 [cause] [come] [commence] [contract] [convey] [develop] [draw] [drive] 
 [dumbfound] [engender] [experience] [father] [fetch] [find] [fix] 
 [flummox] [generate] [go] [gravel] [grow] [have] [incur] [induce] [let] 
 [make] [may] [mother] [mystify] [nonplus] [obtain] [perplex] [produce] 
 [puzzle] [receive] [scram] [sire] [start] [stimulate] [stupefy] 
 [stupify] [suffer] [sustain] [take] [trounce] [undergo]
 4: [both]
 5: [kinds]
 6: [country] [state] [nationality] [nation] [land] [commonwealth] [area]
 7: [western] [westerly]
 8: [bb]
 
 BUILD SUCCESSFUL
 Total time: 10 seconds
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Dress up your holiday email, Hollywood style. Learn more. 
http://celebrity.mail.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Synonyms for AND/OR/NOT operators

2004-12-21 Thread Morus Walter
Erik Hatcher writes:
 On Dec 21, 2004, at 3:04 AM, Sanyi wrote:
  What is the simplest way to add synonyms for AND/OR/NOT operators?
  I'd like to support two sets of operator words, so people can use 
  either the original english
  operators and my custom ones for our local language.
 
 There are two options that I know of: 1) add synonyms during indexing 
 and 2) add synonyms during querying.  Generally this would be done 
 using a custom analyzer.

I guess you missunderstood the question.

I think he want's to know how to create a query parser understanding 
something like 'a UND b' as well as 'a AND b' to support localized 
operator names (german in this case).

AFAIK that can only be done by copying query parsers javacc-source and
adding the operators there.
Shouldn't be difficult, though it's a bit ugly since it implies code
duplication. And there will be no way of choosing the operators dynamically
at runtime. One will need to have different query parsers for different
languages.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Synonyms for AND/OR/NOT operators

2004-12-21 Thread Erik Hatcher
Wow, I really did misunderstand.  My apologies.
Yes, you will need to fork QueryParser.jj and install JavaCC to build 
your custom parser.  It should be pretty trivial to add alternatives to 
AND(+)/OR/NOT(-).

Erik
On Dec 21, 2004, at 4:42 AM, Sanyi wrote:
Hi!
I think we're talking about different things.
My question is about using synonyms for AND/OR/NOT operators, not 
about synonyms of words in the
index.
For example, in some language: AND = AANNDD; OR = OORR; NOT = NNOOTT

So, the user can enter:
(cat OR kitty) AND black AND tail
and either:
(cat OORR kitty) AANNDD black AANNDD tail
Both sets of operators must work.
It must be some kind of a query parser modification/parametering, so 
there is nothing to do with
the index.

I hope I was more specific now ;)
Thanx,
Sanyi

--- Erik Hatcher [EMAIL PROTECTED] wrote:
On Dec 21, 2004, at 3:04 AM, Sanyi wrote:
What is the simplest way to add synonyms for AND/OR/NOT operators?
I'd like to support two sets of operator words, so people can use
either the original english
operators and my custom ones for our local language.
There are two options that I know of: 1) add synonyms during indexing
and 2) add synonyms during querying.  Generally this would be done
using a custom analyzer.
If the synonym mappings are static and you don't mind a larger index,
adding them during indexing avoids the complexity of rewriting the
query.  Injecting synonyms during querying allows the synonym mappings
to change dynamically, though does produce more complex queries.
Here's an example you'll find with the source code distribution of
Lucene in Action which uses WordNet to look up synonyms.
Erik
p.s. I'm sensitive to over-marketing Lucene in Action in this forum as
it would bother me to constantly see an advertisement.  You can be 
sure
that any mentions of it from me will coincide with concrete examples
(which are freely available) that are directly related to questions
being asked.

% ant -emacs SynonymAnalyzerViewer
Buildfile: build.xml
check-environment:
compile:
build-test-index:
build-perf-index:
prepare:
SynonymAnalyzerViewer:
   Using a custom SynonymAnalyzer, two fixed strings are
   analyzed with the results displayed.  Synonyms, from the
   WordNet database, are injected into the same positions
   as the original words.
   See the Analysis chapter for more on synonym injection and
   position increments.  The Tools and extensions chapter covers
   the WordNet feature found in the Lucene sandbox.
Press return to continue...
Running lia.analysis.synonym.SynonymAnalyzerViewer...
1: [quick] [warm] [straightaway] [spry] [speedy] [ready] [quickly]
[promptly] [prompt] [nimble] [immediate] [flying] [fast] [agile]
2: [brown] [brownness] [brownish]
3: [fox] [trick] [throw] [slyboots] [fuddle] [fob] [dodger]
[discombobulate] [confuse] [confound] [befuddle] [bedevil]
4: [jumps]
5: [over] [o] [across]
6: [lazy] [faineant] [indolent] [otiose] [slothful]
7: [dogs]
1: [oh]
2: [we]
3: [get] [acquire] [aim] [amaze] [arrest] [arrive] [baffle] [beat]
[become] [beget] [begin] [bewilder] [bring] [can] [capture] [catch]
[cause] [come] [commence] [contract] [convey] [develop] [draw] [drive]
[dumbfound] [engender] [experience] [father] [fetch] [find] [fix]
[flummox] [generate] [go] [gravel] [grow] [have] [incur] [induce] 
[let]
[make] [may] [mother] [mystify] [nonplus] [obtain] [perplex] [produce]
[puzzle] [receive] [scram] [sire] [start] [stimulate] [stupefy]
[stupify] [suffer] [sustain] [take] [trounce] [undergo]
4: [both]
5: [kinds]
6: [country] [state] [nationality] [nation] [land] [commonwealth] 
[area]
7: [western] [westerly]
8: [bb]

BUILD SUCCESSFUL
Total time: 10 seconds
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



__
Do you Yahoo!?
Dress up your holiday email, Hollywood style. Learn more.
http://celebrity.mail.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene index files from two different applications.

2004-12-21 Thread Gururaja H
Hi !
 
Have two applications.  Both are supposed
to write Lucene index files and the WebApplication is supposed to read
these index files.
 
Here are the questions:
1.  Can two applications write index files, in the same directory, at the same 
time ?
2.  If two applications cannot write index files, in the same directory, at the 
same time.  
 How should we resolve this ?  Would appriciate any solutions to this...
3.  My thought is to write the index files in two different directories and 
read both the indexes
(as though it forms a single index, search results should consider the 
documents in both the indexes) from the WebApplication.  How to go about 
implementing this, using Lucene API ?  Need inputs on which of the Lucene API's 
to use ?
 
  
 
Thanks,
Gururaja

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Lucene index files from two different applications.

2004-12-21 Thread Sergiu Gordea
Gururaja H wrote:
Hi !
Have two applications.  Both are supposed
to write Lucene index files and the WebApplication is supposed to read
these index files.
Here are the questions:
1.  Can two applications write index files, in the same directory, at the same time ?
 

if you implement the synchronisation between these 2 applications, yes
2.  If two applications cannot write index files, in the same directory, at the same time.  
How should we resolve this ?  Would appriciate any solutions to this...
 

... se 1. and 3.
3.  My thought is to write the index files in two different directories and read both the indexes
(as though it forms a single index, search results should consider the documents in both the indexes) from the WebApplication.  How to go about implementing this, using Lucene API ?  Need inputs on which of the Lucene API's to use ?
 

If yor requirements allow you to create to independent indices, than you 
can use the MultiSearcher to search in both indices.
Maybe this will be the most cost effective solution in your case,

Best,
 Sergiu
 
Thanks,
Gururaja
__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene index files from two different applications.

2004-12-21 Thread Erik Hatcher
On Dec 21, 2004, at 5:51 AM, Gururaja H wrote:
1.  Can two applications write index files, in the same directory, at 
the same time ?
If you mean to the same Lucene index, the answer is no.  Only a single 
IndexWriter instance may be writing to an index at one time.

2.  If two applications cannot write index files, in the same 
directory, at the same time.
 How should we resolve this ?  Would appriciate any solutions to 
this...
You may consider writing a queuing system so that two applications 
queue up a document to index, and a single indexer application reads 
from the queue.  Or the applications could wait until the index is 
available for writing.  Or...

3.  My thought is to write the index files in two different 
directories and read both the indexes
(as though it forms a single index, search results should consider the 
documents in both the indexes) from the WebApplication.  How to go 
about implementing this, using Lucene API ?  Need inputs on which of 
the Lucene API's to use ?
Lucene can easily search from multiple indexes using MultiSearcher.  
This merges the results together as you'd expect.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Synonyms for AND/OR/NOT operators

2004-12-21 Thread Sanyi
Well, I guess I'd better recognize and replace the operator synonyms to their 
original format
before passing them to QueryParser. I don't feel comfortable tampering with 
Lucene's source code.

Anyway, thanx for the answers.

Sanyi

--- Morus Walter [EMAIL PROTECTED] wrote:

 Erik Hatcher writes:
  On Dec 21, 2004, at 3:04 AM, Sanyi wrote:
   What is the simplest way to add synonyms for AND/OR/NOT operators?
   I'd like to support two sets of operator words, so people can use 
   either the original english
   operators and my custom ones for our local language.
  
  There are two options that I know of: 1) add synonyms during indexing 
  and 2) add synonyms during querying.  Generally this would be done 
  using a custom analyzer.
 
 I guess you missunderstood the question.
 
 I think he want's to know how to create a query parser understanding 
 something like 'a UND b' as well as 'a AND b' to support localized 
 operator names (german in this case).
 
 AFAIK that can only be done by copying query parsers javacc-source and
 adding the operators there.
 Shouldn't be difficult, though it's a bit ugly since it implies code
 duplication. And there will be no way of choosing the operators dynamically
 at runtime. One will need to have different query parsers for different
 languages.
 
 Morus
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Take Yahoo! Mail with you! Get it on your mobile phone. 
http://mobile.yahoo.com/maildemo 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Synonyms for AND/OR/NOT operators

2004-12-21 Thread Morus Walter
Sanyi writes:
 Well, I guess I'd better recognize and replace the operator synonyms to their 
 original format
 before passing them to QueryParser. I don't feel comfortable tampering with 
 Lucene's source code.
 
Apart from knowing how to compile lucene (including the javacc code
generation) you should only need to change

DEFAULT TOKEN : {
  AND:   (AND | ) 
| OR:(OR | ||) 
| NOT:   (NOT | !) 

to
DEFAULT TOKEN : {
  AND:   (AND | insert your version of and here | ) 
| OR:(OR | insert your version of or here | ||) 
| NOT:   (NOT | insert your version of not here | !) 

in jakarta-lucene/src/java/org/apache/lucene/queryParser/QueryParser.jj

Replacing the operators before query might be hard to do, if you want
to handle cases like »a AND b OR c«, which is a query for a 
phrase a AND b or the token c, correctly.

Morus



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index size doubled?

2004-12-21 Thread Otis Gospodnetic
Another possibility is that you are using an older version of Lucene,
which was known to have a bug with similar symptoms.  Get the latest
version of Lucene.

You shouldn't really have multiple .cfs files after optimizing your
index.  Also, optimize only at the end, if you care about indexing
speed.

Otis

--- Paul Elschot [EMAIL PROTECTED] wrote:

 On Tuesday 21 December 2004 05:49, aurora wrote:
  I'm testing the rebuilding of the index. I add several hundred
 documents,  
  optimize and add another few hundred and so on. Right now I have
 around  
  7000 files. I observed after the index gets to certain size.
 Everytime  
  after optimize, the are two files roughly the same size like below:
  
  12/20/2004  01:57p  13 deletable
  12/20/2004  01:57p  29 segments
  12/20/2004  01:53p  14,460,367 _5qf.cfs
  12/20/2004  01:57p  15,069,013 _5zr.cfs
  
  The index total index is double of what I expect. This is not
 always  
  reproducible. (I'm constantly tuning my program and the set of
 document).  
  Sometime I get a decent single document after optimize. What was
 happening?
 
 Lucene tried to delete the older version (_5cf.cfs above), but got an
 error
 back from the file system. After that it has put the name of that
 segment in
 the deletable file, so it can try later to delete that segment.
 
 This is known behaviour on FAT file systems. These randomly take some
 time
 for themselves to finish closing a file after it has been correctly
 closed by
 a program.
 
 Regards,
 Paul Elschot
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



sorting on a field that can have null values (resend)

2004-12-21 Thread Praveen Peddi
I sent this mail yesterday but had no luck in receiving responses. Trying it 
again .

Hi all,
I am getting null pointer exception when I am sorting on a field that has null 
value for some documents. Order by in sql does work on such fields and I 
think it puts all results with null values at the end of the list. Shouldn't 
lucene also do the same thing instead of throwing null pointer exception. Is 
this an expected behaviour? Is lucene always expecting some value on the 
sortable fields?

I thought of putting empty strings instead of null values but I think empty 
strings are put first in the list while sorting which is the reverse of what 
anyone would want. 

Following is the exception I saw in the error log:

java.lang.NullPointerException
 at 
org.apache.lucene.search.SortComparator$1.compare(Lorg.apache.lucene.search.ScoreDoc;Lorg.apache.lucene.search.ScoreDoc;)I(SortComparator.java:36)
 at 
org.apache.lucene.search.FieldSortedHitQueue.lessThan(Ljava.lang.Object;Ljava.lang.Object;)Z(FieldSortedHitQueue.java:95)
 at org.apache.lucene.util.PriorityQueue.upHeap()V(PriorityQueue.java:120)
 at 
org.apache.lucene.util.PriorityQueue.put(Ljava.lang.Object;)V(PriorityQueue.java:47)
 at 
org.apache.lucene.util.PriorityQueue.insert(Ljava.lang.Object;)Z(PriorityQueue.java:58)
 at 
org.apache.lucene.search.IndexSearcher$2.collect(IF)V(IndexSearcher.java:130)
 at 
org.apache.lucene.search.Scorer.score(Lorg.apache.lucene.search.HitCollector;)V(Scorer.java:38)
 at 
org.apache.lucene.search.IndexSearcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;ILorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.TopFieldDocs;(IndexSearcher.java:125)
 at org.apache.lucene.search.Hits.getMoreDocs(I)V(Hits.java:64)
 at 
org.apache.lucene.search.Hits.init(Lorg.apache.lucene.search.Searcher;Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;Lorg.apache.lucene.search.Sort;)V(Hits.java:51)
 at 
org.apache.lucene.search.Searcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.Hits;(Searcher.java:41)

If its a bug in lucene, Will it be fixed in next release? Any suggestions would 
be appreciated.

Praveen

** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 


Lucene working with a DB

2004-12-21 Thread Daniel Cortes
I read a lot of messages that Lucene can index a DB because it use that 
INPUTSTREAM type
I don't understand how to do this. For example if I've a forum with 
Mysql  and a lot of files on my web, for every search I've to select the 
index that I want use in my search, true? But I don't know how to do 
that Lucene writes an index about the information of the DB of forum 
(for example  MySQL)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Stopwords in phrases

2004-12-21 Thread Ravi
 I want to be able to use stopwords in exact phrase searches. I have
looked at Nutch and used the same approach (replace common words with
n-grams. Look at net.nutch.analysis.CommonGrams). 
  So if to,be,or and not are stop words, for the string to be
or not to be, the analyzer produces the following tokens

[to-be, to-be-or, to-be-or-not, to-be-or-not-to, to-be-or-not-to-be,
be-or, be-or-not, be-or-not-to, be-or-not-to-be, or-not, or-not-to,
or-not-to-be, not-to, not-to-be, to-be]

  This is exactly what I wanted from the analyzer during indexing.
  But I'm having a problem with the search. 
 when I do a search on not to be the analyzer is converting my search
into 
  content:not-to not-to-be to-be because the analyzer produces the
tokens not-to,not-to-be,to-be

  I'm getting 0 results on this as there is no token not-to not-to-be
to-be in the index. 

  I want just not-to-be from the analyzer during the search so when I
search on not to be I will get the document which has not-to-be as a
token. 

   How can I use the same analyzer to get different results in indexing
and searching? 

Thanks in advance,
Ravi. 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene working with a DB

2004-12-21 Thread Erik Hatcher
On Dec 21, 2004, at 10:39 AM, Daniel Cortes wrote:
I read a lot of messages that Lucene can index a DB because it use 
that INPUTSTREAM type
Where have you read that?  This is incorrect.
I don't understand how to do this. For example if I've a forum with 
Mysql  and a lot of files on my web, for every search I've to select 
the index that I want use in my search, true? But I don't know how to 
do that Lucene writes an index about the information of the DB of 
forum (for example  MySQL)
To index data in a database into a Lucene index, you must write code 
that pulls the records from the database and adds them to a Lucene 
index, slicing into fields in whatever manner you need.  You will want 
to be sure to update the index when your database changes by either 
removing, or updating (remove and re-add) documents.  There is 
nothing built-in that will do these steps for you.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Stopwords in phrases

2004-12-21 Thread Erik Hatcher
On Dec 21, 2004, at 10:41 AM, Ravi wrote:
 I want to be able to use stopwords in exact phrase searches. I have
looked at Nutch and used the same approach (replace common words with
n-grams. Look at net.nutch.analysis.CommonGrams).
  So if to,be,or and not are stop words, for the string to be
or not to be, the analyzer produces the following tokens
[to-be, to-be-or, to-be-or-not, to-be-or-not-to, to-be-or-not-to-be,
be-or, be-or-not, be-or-not-to, be-or-not-to-be, or-not, or-not-to,
or-not-to-be, not-to, not-to-be, to-be]
You've gone a bit beyond what Nutch is using.  It creates bigrams, 
where you've expanded it to many more than that.

Are you also using the position increment of 0 for the gram tokens 
like Nutch does?

  But I'm having a problem with the search.
 when I do a search on not to be the analyzer is converting my search
into
  content:not-to not-to-be to-be because the analyzer produces the
tokens not-to,not-to-be,to-be
  I'm getting 0 results on this as there is no token not-to not-to-be
to-be in the index.
  I want just not-to-be from the analyzer during the search so when I
search on not to be I will get the document which has not-to-be as 
a
token.

   How can I use the same analyzer to get different results in indexing
and searching?
Nutch does some different stuff between indexing and parsing queries...
 [java] 1: [the:WORD] [the-quick:gram]
 [java] 2: [quick:WORD]
 [java] 3: [brown:WORD]
 [java] 4: [fox:WORD]
 [java] query = (+url:the quick brown^4.0) (+anchor:the quick 
brown^2.0) (+content:the-quick quick brown)

The first four lines show the analysis of the quick brown fox.  The 
last line is the resultant Lucene query for the quick brown.  Notice 
that only the content field gets analyzed specially, and also that 
only gram tokens are considered in that field, not the WORD tokens 
if there is also a gram.

Does this help with your situation?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Lucene index files from two different applications.

2004-12-21 Thread Chuck Williams
Depending on what you are doing, there are some problems with
MultiSearcher.   See
http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 for a
description of the issues and possible patch(es) to fix.

Chuck

   -Original Message-
   From: Erik Hatcher [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, December 21, 2004 3:09 AM
   To: Lucene Users List
   Subject: Re: Lucene index files from two different applications.
   
   
   On Dec 21, 2004, at 5:51 AM, Gururaja H wrote:
1.  Can two applications write index files, in the same directory,
at
the same time ?
   
   If you mean to the same Lucene index, the answer is no.  Only a
single
   IndexWriter instance may be writing to an index at one time.
   
2.  If two applications cannot write index files, in the same
directory, at the same time.
 How should we resolve this ?  Would appriciate any solutions
to
this...
   
   You may consider writing a queuing system so that two applications
   queue up a document to index, and a single indexer application reads
   from the queue.  Or the applications could wait until the index is
   available for writing.  Or...
   
3.  My thought is to write the index files in two different
directories and read both the indexes
(as though it forms a single index, search results should consider
the
documents in both the indexes) from the WebApplication.  How to go
about implementing this, using Lucene API ?  Need inputs on which
of
the Lucene API's to use ?
   
   Lucene can easily search from multiple indexes using MultiSearcher.
   This merges the results together as you'd expect.
   
   Erik
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene working with a DB

2004-12-21 Thread [EMAIL PROTECTED]
Hello
I'll just paste the relevant MySQL code, you add the calls to it per 
your needs..it has no checking of anything so better add that as well...
It's possible I didnt copy/paste everything but you should get the idea 
where this is going...

-pedja
--

import java.sql.*;
import lucene stuff...

public class  sqlTest {
 public static void main(String[] args) throws Exception {
   String sTable  = args[0];
   String sThing = args[1];
   String indexDir = /path/to/lucene/index;
 try {
   Analyzer analyzer   = new StandardAnalyzer();
   IndexWriter fsWriter  = new IndexWriter(indexDir, analyzer, false);
   addSQLDoc(fsWriter, sTable, sThing);
   fsWriter.close();
 } catch (Exception e) {
   throw new Exception( caught a  + e.getClass() + \n with 
message:  + e.getMessage());
 }
}

private void addSQLDoc(IndexWriter writer, String sqlTable, String 
somethingElse) throws Exception {

   String cs = 
jdbc:mysql://HOST/DATABASE?user=SQLUSERpassword=SQLPASSWORD;
   String sql= SELECT * FROM  + sqlTable +  WHERE 
something=\ + somethingElse + \;

   // establish a connection to MySQL database
   try {
   Class.forName(com.mysql.jdbc.Driver).newInstance();
   } catch (Exception e) {
   System.out.println(Lucene: ERROR: Unable to load driver);
   e.printStackTrace();
   }
   // get the record data...
   try {
  Connection conn = DriverManager.getConnection(cs);
  Statement Stmt = conn.createStatement();
  ResultSet RS = Stmt.executeQuery(sql);
  while(RS.next()) {
 // make a new, empty document
 Document doc = new Document();
 // get the database fields
 String field2 = RS.getString(1);
 String field2 = RS.getString(2);
 String field3 = RS.getString(3);
 String field4 = RS.getString(4);
 String field5 = RS.getString(5);
 // add the first group of fields
 //
 doc.add(Field.Keyword(FIELD1, field1));
 doc.add(Field.Keyword(FIELD2, field2));
 doc.add(Field.Keyword(FIELD3, field3));
 doc.add(Field.Keyword(FIELD4, field4));
 doc.add(Field.Text(FIELD5, field5));
 // add the document
 writer.addDocument(doc);
   } catch (Exception e) {
   e.printStackTrace();
   throw new Exception();
   }
  } // close while(..)
  RS.close();
  Stmt.close();
  conn.close();
   } catch(SQLException e) {
   throw new Exception();
   }
 }
}
--
Daniel Cortes said the following on 12/21/2004 10:39 AM:
I read a lot of messages that Lucene can index a DB because it use 
that INPUTSTREAM type
I don't understand how to do this. For example if I've a forum with 
Mysql  and a lot of files on my web, for every search I've to select 
the index that I want use in my search, true? But I don't know how to 
do that Lucene writes an index about the information of the DB of 
forum (for example  MySQL)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: index size doubled?

2004-12-21 Thread aurora
Thanks for the heads up. I'm using Lucene 1.4.2.
I tried to do optimize() again but it has no effect. Adding a just tiny  
dummy document would get rid of it.

I'm doing optimize every few hundred documents because I tried to simulate  
incremental update. This lead to another question I would post separately.

Thanks.

Another possibility is that you are using an older version of Lucene,
which was known to have a bug with similar symptoms.  Get the latest
version of Lucene.
You shouldn't really have multiple .cfs files after optimizing your
index.  Also, optimize only at the end, if you care about indexing
speed.
Otis
--- Paul Elschot [EMAIL PROTECTED] wrote:
On Tuesday 21 December 2004 05:49, aurora wrote:
 I'm testing the rebuilding of the index. I add several hundred
documents,
 optimize and add another few hundred and so on. Right now I have
around
 7000 files. I observed after the index gets to certain size.
Everytime
 after optimize, the are two files roughly the same size like below:

 12/20/2004  01:57p  13 deletable
 12/20/2004  01:57p  29 segments
 12/20/2004  01:53p  14,460,367 _5qf.cfs
 12/20/2004  01:57p  15,069,013 _5zr.cfs

 The index total index is double of what I expect. This is not
always
 reproducible. (I'm constantly tuning my program and the set of
document).
 Sometime I get a decent single document after optimize. What was
happening?
Lucene tried to delete the older version (_5cf.cfs above), but got an
error
back from the file system. After that it has put the name of that
segment in
the deletable file, so it can try later to delete that segment.
This is known behaviour on FAT file systems. These randomly take some
time
for themselves to finish closing a file after it has been correctly
closed by
a program.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


how often to optimize?

2004-12-21 Thread aurora
Right now I am incrementally adding about 100 documents to the index a day  
and then optimize after that. I find that optimize essentially rebuilding  
the entire index into a single file. So the size of disk write is  
proportion to the total index size, not to the size of documents  
incrementally added.

So my question is would it be an overkill to optimize everyday? Is there  
any guideline on how often to optimize? Every 1000 documents or more?  
Every week? Is there any concern if there are a lot of documents added  
without optimizing?

Thanks.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Stopwords in phrases

2004-12-21 Thread Ravi

Are you also using the position increment of 0 for the gram tokens
like Nutch does?
Yes. 

I don't think considering only gram tokens will work for me because
Nutch uses only bi-grams. It can only have one gram per token. In my
case I have more than one and even if I get only the grams, I still will
have the same problem. 

Ravi.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how often to optimize?

2004-12-21 Thread Otis Gospodnetic
Hello,

I think some of these questions my be answered in the jGuru FAQ

 So my question is would it be an overkill to optimize everyday?

Only if lots of documents are being added/deleted, and you end up with
a lot of index segments.

 Is
 there  
 any guideline on how often to optimize? Every 1000 documents or more?

Are not optimized indices causing you any problems (e.g. slow searches,
high number of open file handles)?  If no, then you don't even need to
optimize until those issues become... issues.

 Every week? Is there any concern if there are a lot of documents
 added without optimizing?

Possibly, see my answer above.

Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[ANNOUNCE] dotLucene1.4.3 RC1 (port of Jakarta Lucene to C#)

2004-12-21 Thread George Aroush
Hi Folks, 

I am pleased to announce the availability of dotLucene 1.4.3 RC1 build-001
This is the first Release Candidate release of version 1.4.3 of Jakarta
Lucene ported to C# and is intended to be Final.

Please visit http://www.sourceforge.net/projects/dotlucene/ to learn more
about dotLucene and to download the source code.

Best regards,

-- George Aroush


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]