Limo 0.5

2004-11-22 Thread Chandrashekhar
Hi,

With Limo 0.5 , can i find out if certain word from some Document is indexed or 
not?

With Regards,
Chandrashekhar V Deshmukh


Re: Using multiple analysers within a query

2004-11-22 Thread Paul Elschot
On Monday 22 November 2004 05:02, Kauler, Leto S wrote:
 Hi Lucene list,
 
 We have the need for analysed and 'not analysed/not tokenised' clauses
 within one query.  Imagine an unparsed query like:
 
 +title:Hello World +path:Resources\Live\1
 
 In the above example we would want the first clause to use
 StandardAnalyser and the second to use an analyser which returns the
 term as a single token.  So a parsed result might look like:
 
 +(title:hello title:world) +path:Resources\Live\1
 
 Would anyone have any suggestions on how this could be done?  I was
 thinking maybe the QueryParser would have to be changed/extended to
 accept a separator other than colon :, something like = for example
 to indicate this clause is not to be tokenised.  Or perhaps this can all
 be done using a single analyser?

Overriding QueryParser.getFieldQuery() might work for you.
It is given the field and the query text so an analyzer can be chosen
depending on the field.
In case you don't use the latest cvs head, it may be worthwhile to
have a look. Some of the getFieldQuery methods have been
deprecated, but I don't know when.

Regards,
Paul.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using multiple analysers within a query

2004-11-22 Thread Morus Walter
Kauler, Leto S writes:
 
 Would anyone have any suggestions on how this could be done?  I was
 thinking maybe the QueryParser would have to be changed/extended to
 accept a separator other than colon :, something like = for example
 to indicate this clause is not to be tokenised.  

I suggested that in a recent discussion and Erik Hatcher objected that
it isn't a good idea, to require that users know which field to query
in which way. I guess he is right.
If your query isn't entered by users, you shouldn't use query parser in
most cases anyway.

 Or perhaps this can all
 be done using a single analyser?
 
Look at PerFieldAnalyzerWrapper. 
You will probably have to write a keyword analyzer (unless you can use
whitespace analyzer in your case).

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: disadvantages

2004-11-22 Thread Erik Hatcher
On Nov 22, 2004, at 12:36 AM, Luke Francl wrote:
Well that really depends on how big your index is and what they search 
for, now doesn't it? ;)
Everything is relative.



-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Sun 11/21/2004 2:52 PM
To: Lucene Users List
Subject: Re: disadvantages
On Nov 21, 2004, at 12:00 PM, Miguel Angel wrote:
What are disadvantages the Lucene??
The users of your system won't have time to get coffee when running
searches.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Using multiple analysers within a query

2004-11-22 Thread Erik Hatcher
On Nov 22, 2004, at 2:56 AM, Morus Walter wrote:
Kauler, Leto S writes:
Would anyone have any suggestions on how this could be done?  I was
thinking maybe the QueryParser would have to be changed/extended to
accept a separator other than colon :, something like = for 
example
to indicate this clause is not to be tokenised.
I suggested that in a recent discussion and Erik Hatcher objected that
it isn't a good idea, to require that users know which field to query
in which way. I guess he is right.
QueryParser is a one-size fits (?) all sort of beast.  It has plenty of 
negatives, no question.

If your query isn't entered by users, you shouldn't use query parser in
most cases anyway.
I'd go even further and say in all cases.
Or perhaps this can all
be done using a single analyser?
Look at PerFieldAnalyzerWrapper.
You will probably have to write a keyword analyzer (unless you can use
whitespace analyzer in your case).
We should probably add a KeywordAnalyzer to Lucene's core at some point.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Using multiple analysers within a query

2004-11-22 Thread Morus Walter
Erik Hatcher writes:

  If your query isn't entered by users, you shouldn't use query parser in
  most cases anyway.
 
 I'd go even further and say in all cases.
 
If you use lucene as a search server you have to provide the query somehow.
E.g. we have an php application, that sends queries to a lucene search
servlet.
In this case it's justifiable to serialize the query into query parser
syntax on the client side and have query parser read the query again on
the server side.
I don't recall any problems with the aproach since we clean up the user
before constructing the query.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Question about multi-searching [re-post]

2004-11-22 Thread Cocula Remi



 Hi,
 
 (First of all : what is the plurial of index in english ; indexes or indices 
 ?)
 
 
 I want to search into several indexes (indices ?).
 For that, I parse a new query using QueryParser or MultiFieldQueryParser.
 Then I search my indexes using the MultiSearcher class.
 
 Ok, but the problem comes when different analyzer are used for each index.
 QueryParser requires an analyzer to parse the query but a query parsed with 
 an analyzer is not suitable for searching into an index that uses another 
 analyzer. 
 
 Does anyone know a trick to cope this problem.
 
 Eventually I could run a different query on each index to obtain several Hits 
 objects. 
 Then I could write some collector that collects Hits in the order of highest 
 scores.
 I wonder if this could work and if it would be as efficient as the 
 MultiSearcher . In this situation does it make sense to compare  the scores 
 of two differents Hits.


Re: How much time indexing doc ??

2004-11-22 Thread Luke Shannon
PDF(s) can definitely slow things down, depending on their size.

If there are a few larger PDF documents that time is definitely possible.

Luke

- Original Message - 
From: Miguel Angel [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, November 20, 2004 11:25 AM
Subject: How much time indexing doc ??


 Hi, i have 1000 doc (Word, PDF and HTML)   , those documents indexed
 in 5 min.  Is this correct?? or i have problem with my Analyzer, i
 used StandartAnalyzer
 -- 
 Miguel Angel Angeles R.
 Asesoria en Conectividad y Servidores
 Telf. 97451277
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimized??

2004-11-22 Thread Luke Shannon
As I understand it optimization is when you merge several segments into one
allowing for faster queries.

The FAQs and API have further details.

http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexingtoc=faq#q24

Luke

- Original Message - 
From: Miguel Angel [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, November 20, 2004 5:19 PM
Subject: Optimized??


What`s mean Optimized index in Lucene¿?
-- 
Miguel Angel Angeles R.
Asesoria en Conectividad y Servidores
Telf. 97451277

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question about multi-searching [re-post]

2004-11-22 Thread Erik Hatcher
On Nov 22, 2004, at 9:18 AM, Cocula Remi wrote:
(First of all : what is the plurial of index in english ; indexes or 
indices ?)
We used indexes in Lucene in Action.  Its a bit ambiguous in English, 
but indexes sounds less formal and is acceptable.

For that, I parse a new query using QueryParser or 
MultiFieldQueryParser.
Then I search my indexes using the MultiSearcher class.

Ok, but the problem comes when different analyzer are used for each 
index.
QueryParser requires an analyzer to parse the query but a query 
parsed with an analyzer is not suitable for searching into an index 
that uses another analyzer.

Does anyone know a trick to cope this problem.
Nothing built into Lucene solves this problem specifically.  You'll 
have to come up with your own MultiSearcher-like facility that can 
apply different queries to different indexes and merge the results back 
together.  This will be awkward when it comes to scoring though, since 
each index is using a different query.

Eventually I could run a different query on each index to obtain 
several Hits objects.
Then I could write some collector that collects Hits in the order of 
highest scores.
I wonder if this could work and if it would be as efficient as the 
MultiSearcher . In this situation does it make sense to compare  the 
scores of two differents Hits.
No, it won't make good sense to compare the scores between the queries, 
but I suspect our queries are pretty close to one another if all that 
varies is the analyzer.  It still will be an awkward comparison though, 
but maybe good enough for your needs?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Too many open files issue

2004-11-22 Thread Neelam Bhatnagar
Hi,
 
I had requested help on an issue we have been facing with the Too many
open files Exception garbling the search indexes and crashing the
search on the web site. 
As a suggestion, you had asked us to look at the articles on O'Reilly
Network which had specific context around this exact problem. 
One of the suggestions was to increase the limit on the number of file
descriptors on the file system. We tried it by first lowering the limit
to 200 from 256 in order to reproduce the exception. The exception did
get reproduced but even after increasing the limit to 500, the exception
kept coming until after several rounds of trying to rebuild the index,
we finally got to get it working for the default file descriptor limit
of 256.  This makes us wonder if your first suggestion of optimizing
indexes is a pre-requisite to trying this option. 
 
Another piece of relevant information is that we have the default merge
factor of 10.
 
Kindly give us pointers to what it that we are doing wrong or should we
be trying something completely different.
 
Thanks and regards
Neelam Bhatnagar
 


Re: Using multiple analysers within a query

2004-11-22 Thread Erik Hatcher
On Nov 22, 2004, at 9:17 AM, Morus Walter wrote:
Erik Hatcher writes:
If your query isn't entered by users, you shouldn't use query parser 
in
most cases anyway.
I'd go even further and say in all cases.
If you use lucene as a search server you have to provide the query 
somehow.
E.g. we have an php application, that sends queries to a lucene search
servlet.
In this case it's justifiable to serialize the query into query parser
syntax on the client side and have query parser read the query again on
the server side.
Ah, good point!  I hadn't considered this scenario.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Limo 0.5

2004-11-22 Thread Luke Francl
On Mon, 2004-11-22 at 02:27, Chandrashekhar wrote:
 Hi,
 
 With Limo 0.5 , can i find out if certain word from some Document is indexed 
 or not?

This feature doesn't exist as such.

You could search for it and if results come up, then the word is in the
documents it returns.

I'll add enumerating the terms in an index to my list of things to add.

Regards,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Question about multi-searching [re-post]

2004-11-22 Thread Chuck Williams
If you are going to compare scores across multiple indices, I'd suggest
considering one of the patches here:

http://issues.apache.org/bugzilla/show_bug.cgi?id=31841

Chuck

   -Original Message-
   From: Erik Hatcher [mailto:[EMAIL PROTECTED]
   Sent: Monday, November 22, 2004 6:30 AM
   To: Lucene Users List
   Subject: Re: Question about multi-searching [re-post]
   
   
   On Nov 22, 2004, at 9:18 AM, Cocula Remi wrote:
(First of all : what is the plurial of index in english ; indexes
or
indices ?)
   
   We used indexes in Lucene in Action.  Its a bit ambiguous in
English,
   but indexes sounds less formal and is acceptable.
   
For that, I parse a new query using QueryParser or
MultiFieldQueryParser.
Then I search my indexes using the MultiSearcher class.
   
Ok, but the problem comes when different analyzer are used for
each
index.
QueryParser requires an analyzer to parse the query but a query
parsed with an analyzer is not suitable for searching into an
index
that uses another analyzer.
   
Does anyone know a trick to cope this problem.
   
   Nothing built into Lucene solves this problem specifically.  You'll
   have to come up with your own MultiSearcher-like facility that can
   apply different queries to different indexes and merge the results
back
   together.  This will be awkward when it comes to scoring though,
since
   each index is using a different query.
   
Eventually I could run a different query on each index to obtain
several Hits objects.
Then I could write some collector that collects Hits in the order
of
highest scores.
I wonder if this could work and if it would be as efficient as
the
MultiSearcher . In this situation does it make sense to compare
the
scores of two differents Hits.
   
   No, it won't make good sense to compare the scores between the
queries,
   but I suspect our queries are pretty close to one another if all
that
   varies is the analyzer.  It still will be an awkward comparison
though,
   but maybe good enough for your needs?
   
   Erik
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index in RAM - is it realy worthy?

2004-11-22 Thread iouli . golovatyi

I did following test:
I created  the RAM folder on my Red Hat box and copied   c. 1Gb of indexes
there.
I expected the queries to run much quicker.
In reality it was even sometimes slower(sic!)

Lucene has it's own RAM disk functionality. If I implement it, would it
bring any benefits?

Thanks in advance
J.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index in RAM - is it realy worthy?

2004-11-22 Thread Otis Gospodnetic
For the Lucene book I wrote some test cases that compare FSDirectory
and RAMDirectory.  What I found was that with certain settings
FSDirectory was almost as fast as RAMDirectory.  Personally, I would
push FSDirectory and hope that the OS and the Filesystem do their share
of work and caching for me before looking for ways to optimize my code.

Otis

--- [EMAIL PROTECTED] wrote:

 
 I did following test:
 I created  the RAM folder on my Red Hat box and copied   c. 1Gb of
 indexes
 there.
 I expected the queries to run much quicker.
 In reality it was even sometimes slower(sic!)
 
 Lucene has it's own RAM disk functionality. If I implement it, would
 it
 bring any benefits?
 
 Thanks in advance
 J.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need help with filtering

2004-11-22 Thread Edwin Tang
Hello again,

I've modified DateFilter to filter out document IDs as suggested. All seems to
be running well until I tried a specific test case. All my documents have IDs
in the 400,000 range. If I set my lower limit to 5, nothing comes back. After
examining the code, I found the issue to be at the following line:
TermEnum enumerator = reader.terms(new Term(field, start));

Is there a way to retrieve a set of documents with IDs using a Integer
comparison versus a String comparison? If I set start to 0, I get everything,
but that's not very efficient.

Thanks in advance,
Ed

--- Paul Elschot [EMAIL PROTECTED] wrote:

 On Wednesday 17 November 2004 01:20, Edwin Tang wrote:
  Hello,
  
  I have been using DateFilter to limit my search results to a certain date
  range. I am now asked to replace this filter with one where my search 
 results
  have document IDs greater than a given document ID. This document ID is
  assigned during indexing and is a Keyword field.
  
  I've browsed around the FAQs and archives and see that I can either use
  QueryFilter or BooleanQuery. I've tried both approaches to limit the 
 document
  ID range, but am getting the BooleanQuery.TooManyClauses exception in both
  cases. I've also tried bumping max number of clauses via 
 setMaxClauseCount(),
  but that number has gotten pretty big.
  
  Is there another approach to this? ...
 
 Recoding DateFilter to a DocumentIdFilter should be straightforward.
 
 The trick is to use only one document enumerator at a time for all
 terms. Document enumerators take buffer space, and that is the
 reason why BooleanQuery has an exception for too many clauses.
 
 Regards,
 Paul
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Need help with filtering

2004-11-22 Thread Chuck Williams
It sounds like you need to pad your numbers with leading zeroes, i.e.
use the same type of encoding as is required by RangeQuery's.  If you
query with 05 instead of 5 do you get what you expect?  If all your
document id's are fixed length, then string comparison will be
isomorphic to integer comparison.

Chuck

   -Original Message-
   From: Edwin Tang [mailto:[EMAIL PROTECTED]
   Sent: Monday, November 22, 2004 10:34 AM
   To: Lucene Users List
   Subject: Re: Need help with filtering
   
   Hello again,
   
   I've modified DateFilter to filter out document IDs as suggested.
All
   seems to
   be running well until I tried a specific test case. All my documents
   have IDs
   in the 400,000 range. If I set my lower limit to 5, nothing comes
back.
   After
   examining the code, I found the issue to be at the following line:
   TermEnum enumerator = reader.terms(new Term(field, start));
   
   Is there a way to retrieve a set of documents with IDs using a
Integer
   comparison versus a String comparison? If I set start to 0, I get
   everything,
   but that's not very efficient.
   
   Thanks in advance,
   Ed
   
   --- Paul Elschot [EMAIL PROTECTED] wrote:
   
On Wednesday 17 November 2004 01:20, Edwin Tang wrote:
 Hello,

 I have been using DateFilter to limit my search results to a
certain
   date
 range. I am now asked to replace this filter with one where my
   search
results
 have document IDs greater than a given document ID. This
document ID
   is
 assigned during indexing and is a Keyword field.

 I've browsed around the FAQs and archives and see that I can
either
   use
 QueryFilter or BooleanQuery. I've tried both approaches to limit
the
document
 ID range, but am getting the BooleanQuery.TooManyClauses
exception
   in both
 cases. I've also tried bumping max number of clauses via
setMaxClauseCount(),
 but that number has gotten pretty big.

 Is there another approach to this? ...
   
Recoding DateFilter to a DocumentIdFilter should be
straightforward.
   
The trick is to use only one document enumerator at a time for all
terms. Document enumerators take buffer space, and that is the
reason why BooleanQuery has an exception for too many clauses.
   
Regards,
Paul
   
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
   
   __
   Do You Yahoo!?
   Tired of spam?  Yahoo! Mail has the best spam protection around
   http://mail.yahoo.com
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



auto-generate uid?

2004-11-22 Thread aurora
Is there a way to auto-generate uid in Lucene? Even it is just a way to  
query the highest uid and let the application add one to it will do.

Thanks.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


indexing benchmark

2004-11-22 Thread John Wang
Hi folks:

 Is there an indexing benchmark somewhere? I see a search
benchmark on the lucene home site.

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: auto-generate uid?

2004-11-22 Thread Erik Hatcher
What would the purpose of an auto-generated UID be?
But no, Lucene does not generate UID's for you.  Documents are numbered 
internally by their insertion order.  This number changes, however, 
when documents are deleted in the middle and the index is optimized.

Erik
On Nov 22, 2004, at 1:50 PM, aurora wrote:
Is there a way to auto-generate uid in Lucene? Even it is just a way 
to query the highest uid and let the application add one to it will 
do.

Thanks.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Too many open files issue

2004-11-22 Thread Will Allen
If you are on linux the number of file handles for a session is much lower than 
that for the whole machine.  ulimit -n will tell you.  There are instructions 
on the web for changing this setting, it involves the /etc/security/limits.conf 
and setting the values for nofile.

(bulkadm is my user)

bulkadm softnofile  8192
bulkadm hardnofile  65536

Also, if you use the condensed file format you will have many fewer files.

-Original Message-
From: Neelam Bhatnagar [mailto:[EMAIL PROTECTED]
Sent: Monday, November 22, 2004 10:02 AM
To: Otis Gospodnetic
Cc: [EMAIL PROTECTED]
Subject: Too many open files issue


Hi,
 
I had requested help on an issue we have been facing with the Too many
open files Exception garbling the search indexes and crashing the
search on the web site. 
As a suggestion, you had asked us to look at the articles on O'Reilly
Network which had specific context around this exact problem. 
One of the suggestions was to increase the limit on the number of file
descriptors on the file system. We tried it by first lowering the limit
to 200 from 256 in order to reproduce the exception. The exception did
get reproduced but even after increasing the limit to 500, the exception
kept coming until after several rounds of trying to rebuild the index,
we finally got to get it working for the default file descriptor limit
of 256.  This makes us wonder if your first suggestion of optimizing
indexes is a pre-requisite to trying this option. 
 
Another piece of relevant information is that we have the default merge
factor of 10.
 
Kindly give us pointers to what it that we are doing wrong or should we
be trying something completely different.
 
Thanks and regards
Neelam Bhatnagar
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



downloading Lucene 1.4.2

2004-11-22 Thread Sullivan, Sean C - MWT

According to the Lucene homepage, Lucene 1.4.2 was released 
on October 1, 2004

However, the dist on www.apache.org does not have a copy of 
Lucene 1.4.2

   http://www.apache.org/dist/jakarta/lucene/binaries/

Where can I download Lucene 1.4.2?

-Sean


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: downloading Lucene 1.4.2

2004-11-22 Thread Hoss

In the same Lucene News section where the announcement about 1.4.2 is
listed, there is a link that says Binary and source distributions are
available here. ...

http://cvs.apache.org/dist/jakarta/lucene/v1.4.2/

I got really confused yesterday after I already had the binary version
and i was looking for the source and found the link you listed.  Does
anyone know how to go about getting http://www.apache.org/dist/ updated?

: According to the Lucene homepage, Lucene 1.4.2 was released
: on October 1, 2004
:
: However, the dist on www.apache.org does not have a copy of
: Lucene 1.4.2
:
:http://www.apache.org/dist/jakarta/lucene/binaries/
:
: Where can I download Lucene 1.4.2?



--

---
Oh, you're a tricky one.Chris M Hostetter
 -- Trisha Weir[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index in RAM - is it realy worthy?

2004-11-22 Thread John Wang
In my test, I have 12900 documents. Each document is small, a few
discreet fields (KeyWord type) and 1 Text field containing only 1
sentence.

with both mergeFactor and maxMergeDocs being 1000

using RamDirectory, the indexing job took about 9.2 seconds

not using RamDirectory, the indexing job took about 122 seconds.

I am not calling optimize.

This is on windows Xp running java 1.5.

Is there something very wrong or different in my setup to cause such a
big different?


Thanks

-John


On Mon, 22 Nov 2004 09:23:40 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 For the Lucene book I wrote some test cases that compare FSDirectory
 and RAMDirectory.  What I found was that with certain settings
 FSDirectory was almost as fast as RAMDirectory.  Personally, I would
 push FSDirectory and hope that the OS and the Filesystem do their share
 of work and caching for me before looking for ways to optimize my code.
 
 Otis
 
 
 
 --- [EMAIL PROTECTED] wrote:
 
 
  I did following test:
  I created  the RAM folder on my Red Hat box and copied   c. 1Gb of
  indexes
  there.
  I expected the queries to run much quicker.
  In reality it was even sometimes slower(sic!)
 
  Lucene has it's own RAM disk functionality. If I implement it, would
  it
  bring any benefits?
 
  Thanks in advance
  J.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index in RAM - is it realy worthy?

2004-11-22 Thread Kevin A. Burton
Otis Gospodnetic wrote:
For the Lucene book I wrote some test cases that compare FSDirectory
and RAMDirectory.  What I found was that with certain settings
FSDirectory was almost as fast as RAMDirectory.  Personally, I would
push FSDirectory and hope that the OS and the Filesystem do their share
of work and caching for me before looking for ways to optimize my code.
 

Yes... I performed the same benchmark and in my situation RAMDirectory 
for searches was about 2% slower.

I'm willing to bet that it has to do with the fact that its a Hashtable 
and not a HashMap (which isn't synchronized).

Also adding a constructor for the term size could make loading a 
RAMDirectory faster since you could prevent rehash.

If you're on a modern machine your filesystme cache will end up 
buffering your disk anyway which I'm sure was happening in my situation.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index in RAM - is it realy worthy?

2004-11-22 Thread Kevin A. Burton
Otis Gospodnetic wrote:
For the Lucene book I wrote some test cases that compare FSDirectory
and RAMDirectory.  What I found was that with certain settings
FSDirectory was almost as fast as RAMDirectory.  Personally, I would
push FSDirectory and hope that the OS and the Filesystem do their share
of work and caching for me before looking for ways to optimize my code.
 

Also another note is that doing an index merge in memory is probably 
faster if you just use a RAMDirectory and perform addIndexes to it.

This would almost certainly be faster than optimizing on disk but I 
haven't benchmarked it.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: downloading Lucene 1.4.2

2004-11-22 Thread Erik Hatcher
Click the here link on the Lucene home page.
We Lucene committers have been very very lame and have not published 
the binary distribution appropriately for the mirrors to pick up.  One 
of these days we'll correct this, but for now you can click the link 
from the announcement on the home page.

Erik
On Nov 22, 2004, at 3:05 PM, Sullivan, Sean C - MWT wrote:
According to the Lucene homepage, Lucene 1.4.2 was released
on October 1, 2004
However, the dist on www.apache.org does not have a copy of
Lucene 1.4.2
   http://www.apache.org/dist/jakarta/lucene/binaries/
Where can I download Lucene 1.4.2?
-Sean
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: auto-generate uid?

2004-11-22 Thread aurora
Just to clarify. I have a Field 'uid' those value is an unique integer. I  
use it as a key to the document stored externally. I don't mean Lucene's  
internal document number.

I was wonder if there is a method to query the highest value of a field,  
perhaps something like:

  IndexReader.maxTerm('uid')

What would the purpose of an auto-generated UID be?
But no, Lucene does not generate UID's for you.  Documents are numbered  
internally by their insertion order.  This number changes, however, when  
documents are deleted in the middle and the index is optimized.

Erik
On Nov 22, 2004, at 1:50 PM, aurora wrote:
Is there a way to auto-generate uid in Lucene? Even it is just a way to  
query the highest uid and let the application add one to it will do.

Thanks.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: auto-generate uid?

2004-11-22 Thread Erik Hatcher
On Nov 22, 2004, at 4:39 PM, aurora wrote:
Just to clarify. I have a Field 'uid' those value is an unique  
integer. I use it as a key to the document stored externally. I don't  
mean Lucene's internal document number.

I was wonder if there is a method to query the highest value of a  
field, perhaps something like:

  IndexReader.maxTerm('uid')
There isn't quite that type of API, though you can skip to a known one  
and enumerate from there:

	http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/ 
TermEnum.html#skipTo(org.apache.lucene.index.Term)

IndexReader gives you a TermEnum from either the terms() method or the  
terms(Term) method.

Erik

What would the purpose of an auto-generated UID be?
But no, Lucene does not generate UID's for you.  Documents are  
numbered internally by their insertion order.  This number changes,  
however, when documents are deleted in the middle and the index is  
optimized.

Erik
On Nov 22, 2004, at 1:50 PM, aurora wrote:
Is there a way to auto-generate uid in Lucene? Even it is just a way  
to query the highest uid and let the application add one to it will  
do.

Thanks.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: auto-generate uid?

2004-11-22 Thread Bernhard Messer
Just to clarify. I have a Field 'uid' those value is an unique integer. 
I  use it as a key to the document stored externally. I don't mean 
Lucene's  internal document number.

I was wonder if there is a method to query the highest value of a 
field,  perhaps something like:

  IndexReader.maxTerm('uid')
what you could do is to write your own IndexWriter class by extending 
the original one found in org.apache.lucene.index.IndexWriter. Than you 
have direct access to lucene's segment counter which could provide you a 
unique id for each document in the index. Those id's would stay sticky 
even if you modify the index after the intial creation process.

is that the hint you need to start ?
regards
Bernhard

What would the purpose of an auto-generated UID be?
But no, Lucene does not generate UID's for you.  Documents are 
numbered  internally by their insertion order.  This number changes, 
however, when  documents are deleted in the middle and the index is 
optimized.

Erik
On Nov 22, 2004, at 1:50 PM, aurora wrote:
Is there a way to auto-generate uid in Lucene? Even it is just a way 
to  query the highest uid and let the application add one to it will 
do.

Thanks.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: auto-generate uid?

2004-11-22 Thread Terry Steichen
Not exactly sure what you're trying to do.  You can easily generate a number 
when you index each Document and insert it in a uid field (which is, BTW, what 
I do), and if you base it on a timestamp plus some characteristic of the 
document (which is also what I do), it should always be unique.  As you add 
more documents, they will each get their own unique id.  When you delete 
documents and optimize, these ids won't be affected.

However, in your subsequent clarification, you indicated you already had a 
unique id, and want to find the maximum value.  So why did you say you want one 
auto-generated?

Terry
  - Original Message - 
  From: aurora 
  To: [EMAIL PROTECTED] 
  Sent: Monday, November 22, 2004 4:39 PM
  Subject: Re: auto-generate uid?


  Just to clarify. I have a Field 'uid' those value is an unique integer. I  
  use it as a key to the document stored externally. I don't mean Lucene's  
  internal document number.

  I was wonder if there is a method to query the highest value of a field,  
  perhaps something like:

 IndexReader.maxTerm('uid')


   What would the purpose of an auto-generated UID be?
  
   But no, Lucene does not generate UID's for you.  Documents are numbered  
   internally by their insertion order.  This number changes, however, when  
   documents are deleted in the middle and the index is optimized.
  
   Erik
  
   On Nov 22, 2004, at 1:50 PM, aurora wrote:
  
   Is there a way to auto-generate uid in Lucene? Even it is just a way to  
   query the highest uid and let the application add one to it will do.
  
   Thanks.
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]



  -- 
  Using Opera's revolutionary e-mail client: http://www.opera.com/m2/


  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



Re: Too many open files issue

2004-11-22 Thread Dmitry
I'm sorry, I wasn't involved in the original conversation but maybe I 
can jump in with some info that will help.

The number of files depends on the merge factor, number of segments, and 
number of indexed fields in your index. It also depends on whether you 
are using compound files or not (this is a flag on the IndexWriter). 
With compound files flag on, segments have fixed number of files, 
regardless of how many fields you use. Without the flag, each field is a 
separate file.

Let's say you have 10 segments (per your merge factor) that are being 
merged into a new segment (via an optimize call or just because you have 
reached the merge factor). This means there are 11 segments open at the 
same time. If you have 20 indexed fields and are not using compound 
files, that's 20 * 11 = 220 files. There are a few other files open as 
well, plus whatever other files and sockets that your JVM process is 
holding open at that time. This would include incoming connections, for 
example, if this is running inside a web server. If you are running in 
an application server, this could include connections and files open by 
other applications in that same app server.

So the numbers run up quite a bit.
By the way, it is usual to have the file descriptors limit set at 9000 
or so for unix machines running production web applications. By the way 
2, on Solaris, you will need to modify a value in /etc/systems to get up 
to this level. Not sure about Linux or other flavors.

Another suggestion - you may want to look into a tool called lsof. It 
is a utility that will show file handles open by a particular process. 
It could be that some other part of your process (or of the application 
server, VM, etc) is not closing files. This tool will help you see what 
files are open and you can validate that all of the really need to be open.

Best of luck.
Dmitry.
Neelam Bhatnagar wrote:
Hi,
I had requested help on an issue we have been facing with the Too many
open files Exception garbling the search indexes and crashing the
search on the web site. 
As a suggestion, you had asked us to look at the articles on O'Reilly
Network which had specific context around this exact problem. 
One of the suggestions was to increase the limit on the number of file
descriptors on the file system. We tried it by first lowering the limit
to 200 from 256 in order to reproduce the exception. The exception did
get reproduced but even after increasing the limit to 500, the exception
kept coming until after several rounds of trying to rebuild the index,
we finally got to get it working for the default file descriptor limit
of 256.  This makes us wonder if your first suggestion of optimizing
indexes is a pre-requisite to trying this option. 

Another piece of relevant information is that we have the default merge
factor of 10.
Kindly give us pointers to what it that we are doing wrong or should we
be trying something completely different.
Thanks and regards
Neelam Bhatnagar
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Too many open files issue

2004-11-22 Thread Chris Lamprecht
A useful resource for increasing the number of file handles on various
operating systems is the Volano Report:

http://www.volano.com/report/

 I had requested help on an issue we have been facing with the Too many
 open files Exception garbling the search indexes and crashing the
 search on the web site.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



JDBCDirectory to prevent optimize()?

2004-11-22 Thread Kevin A. Burton
It seems that when compared to other datastores that Lucene starts to 
fall down.  For example lucene doesn't perform online index 
optimizations so if you add 10 documents you have to run optimize() 
again and this isn't exactly a fast operation.

I'm wondering about the potential for a generic JDBCDirectory for 
keeping the lucene index within a database. 

It sounds somewhat unconventional would allow you to perform live 
addDirectory updates without performing an optimize() again.

Has anyone looked at this?  How practical would it be.
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


multi-dimensional scaling

2004-11-22 Thread DES
Is it possible to combine Lucene and  multi-dimensional scaling  in some way?

Numeric Range Restrictions: Queries vs Filters

2004-11-22 Thread Hoss
(NOTE: numbers in [] indicate Footnotes)

I'm rather new to Lucene (and this list), so if I'm grossly
misunderstanding things, forgive me.

One of my main needs as I investigate Search technologies is to restrict
results based on Ranges of numeric values.  Looking over the archives of
this list, it seems that lots of people have run into problems dealing
with this.  In particular, whenever someone asks a question about Numeric
Ranges the question seem to always involve one (or more) of the
following:

   (a) Lexical sorting puts 11 in the range 1 TO 5
   (b) Dates (or Dates and Times)
   (c) BooleanQuery$TooManyClauses Exceptions
   (d) Should I use a filter?

(a) is a solved problem as long as you use a formatter like
LongField.java[1]

(b) is really nothing more then a special case of dealing with generic
numeric values.  While there are certainly special purposes solutions that
sometimes apply to dealing with Date ranges, any good solution for dealing
with raw numeric ranges can be applied to Dates (and Times)

(c) is a situation that seems to come up a lot because of the way
RangeQuery works.  The rewrite method walks all of the Terms in the index
starting with lowerTerm and builds up BooleanQuery containing a separate
TermQuery for every Term found, until it reaches the upperTerm.  This
causes a range search of 0001 TO 1000 to generate a BooleanQuery with N
clauses, where N is the quantity of unique values in the field which are
lexically greater then 0001 and lexically less then 1000.  depending on
the nature of your data, this might be 0 BooleanClauses, or it might be
1000 BooleanClauses; but the list is built before the search is ever even
executed.

At first, this may seem really strange -- I know I was certainly confused
-- but there is a very good reason for it: Ultimately RangeQuery still
provides you with a meaningful score for each document, based on the
frequency (and quantity) of terms that document has in the range [2].  In
order to do that, it has to expand itself, but what if you don't care if
your Range restriction impacts the Score? [3]

Which brings us to...

(c) Filtering.  Filters in general make a lot of sense to me.  They are a
way to specify (at query time) that only a certain subset of the index
should be considered for results.  The Filter class has a very straight
forward API that seems very easy to subclass to get the behavior I want.
The Query API on the other hand ... I freely admit, that I can't make
heads or tails out of it.  I don't even know where I would begin to try
and write a new subclass of Query if I wanted to.

I would think that most people who want to do a numeric range
restriction on their data, probably don't care about the Scoring benefits
of RangeQuery.  Looking at the code base, the way DateFilter works seems
like it provides an ideal solution to any sort of Range restriction (not
just Dates) that *should* be more efficient then using RangeQuery when
dealing with an unbounded value set. (Both approaches need to iterate over
all of the terms in the specified field using TermEnum, but RangeQuery has
to build up an set of BooleanQuery objects for each matching term, and
then each of those queries have to help score the documents -- DateFilter
on the other hand only has to maintain a single BitSet of documents that
it finds as it iterates)

But I was surprised then to see the following quote from Erik Hatcher in
the archives:

  In fact, DateFilter by itself is practically of no use, I think. [4]

...Erik goes on to suggest that given a set of canned date ranges, it
doesn't really matter if you use a RangeQuery or a DateFilter -- as long
as you cache them to reuse them (with something like CachingWrappingFilter
or QueryFilter).  I'm hoping that he might elaborate on that comment?

As a test, I wrote a RangeFilter which borrows heavily from DateFilter
to both convince myself it could work, and to do a comparison between it
and RangeQuery. [5] Based on my limited tests, using a Filter to restrict
to a Range is a lot faster then using RangeQuery -- independent of
caching.

The attachment contains my RangeFilter, a unit test that demonstrates it,
and a Benchmarking unit test that does a side-by-side comparison with
RangeQuery [6].  If developers feel that this class is useful, then by all
means roll it into the code base.  (90% of it is cut/pasted from
DateFilter/RangeQuery anyway)


Comments? ... Questions? ... Answers?



Footnotes:

[1] It seems to me this class is extremely useful, does anyone know
if there's a particular reason it hasn't been added to the main Lucene
codebase?
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04790.html

[2] Take a look at RangeQueryScoreDemo.java in the attachment, which
produces output something like this...
   Range Search for: 'apple' TO 'dog'
   0.40924072 ... bed dog emu
   0.38014847 ... DOG
   0.2825246 ... cat
   0.17657787 ... apple emu
   0.12671615 ... dog


Re: Numeric Range Restrictions: Queries vs Filters

2004-11-22 Thread Hoss

Of course, not only did I manage to forget to include the attachment, but
when I sent a reply with the code, mail.apache.org rejected it because it
was a ZIP file.

So let's see how mail.apache.or feels about 6 seperate text files.


: Date: Mon, 22 Nov 2004 18:25:24 -0800 (PST)
: Subject: Numeric Range Restrictions: Queries vs Filters
:
: (NOTE: numbers in [] indicate Footnotes)
:
: I'm rather new to Lucene (and this list), so if I'm grossly
: misunderstanding things, forgive me.
:
: One of my main needs as I investigate Search technologies is to restrict
: results based on Ranges of numeric values.  Looking over the archives of
: this list, it seems that lots of people have run into problems dealing
: with this.  In particular, whenever someone asks a question about Numeric
: Ranges the question seem to always involve one (or more) of the
: following:
:
:(a) Lexical sorting puts 11 in the range 1 TO 5
:(b) Dates (or Dates and Times)
:(c) BooleanQuery$TooManyClauses Exceptions
:(d) Should I use a filter?
:
: (a) is a solved problem as long as you use a formatter like
: LongField.java[1]
:
: (b) is really nothing more then a special case of dealing with generic
: numeric values.  While there are certainly special purposes solutions that
: sometimes apply to dealing with Date ranges, any good solution for dealing
: with raw numeric ranges can be applied to Dates (and Times)
:
: (c) is a situation that seems to come up a lot because of the way
: RangeQuery works.  The rewrite method walks all of the Terms in the index
: starting with lowerTerm and builds up BooleanQuery containing a separate
: TermQuery for every Term found, until it reaches the upperTerm.  This
: causes a range search of 0001 TO 1000 to generate a BooleanQuery with N
: clauses, where N is the quantity of unique values in the field which are
: lexically greater then 0001 and lexically less then 1000.  depending on
: the nature of your data, this might be 0 BooleanClauses, or it might be
: 1000 BooleanClauses; but the list is built before the search is ever even
: executed.
:
: At first, this may seem really strange -- I know I was certainly confused
: -- but there is a very good reason for it: Ultimately RangeQuery still
: provides you with a meaningful score for each document, based on the
: frequency (and quantity) of terms that document has in the range [2].  In
: order to do that, it has to expand itself, but what if you don't care if
: your Range restriction impacts the Score? [3]
:
: Which brings us to...
:
: (c) Filtering.  Filters in general make a lot of sense to me.  They are a
: way to specify (at query time) that only a certain subset of the index
: should be considered for results.  The Filter class has a very straight
: forward API that seems very easy to subclass to get the behavior I want.
: The Query API on the other hand ... I freely admit, that I can't make
: heads or tails out of it.  I don't even know where I would begin to try
: and write a new subclass of Query if I wanted to.
:
: I would think that most people who want to do a numeric range
: restriction on their data, probably don't care about the Scoring benefits
: of RangeQuery.  Looking at the code base, the way DateFilter works seems
: like it provides an ideal solution to any sort of Range restriction (not
: just Dates) that *should* be more efficient then using RangeQuery when
: dealing with an unbounded value set. (Both approaches need to iterate over
: all of the terms in the specified field using TermEnum, but RangeQuery has
: to build up an set of BooleanQuery objects for each matching term, and
: then each of those queries have to help score the documents -- DateFilter
: on the other hand only has to maintain a single BitSet of documents that
: it finds as it iterates)
:
: But I was surprised then to see the following quote from Erik Hatcher in
: the archives:
:
:   In fact, DateFilter by itself is practically of no use, I think. [4]
:
: ...Erik goes on to suggest that given a set of canned date ranges, it
: doesn't really matter if you use a RangeQuery or a DateFilter -- as long
: as you cache them to reuse them (with something like CachingWrappingFilter
: or QueryFilter).  I'm hoping that he might elaborate on that comment?
:
: As a test, I wrote a RangeFilter which borrows heavily from DateFilter
: to both convince myself it could work, and to do a comparison between it
: and RangeQuery. [5] Based on my limited tests, using a Filter to restrict
: to a Range is a lot faster then using RangeQuery -- independent of
: caching.
:
: The attachment contains my RangeFilter, a unit test that demonstrates it,
: and a Benchmarking unit test that does a side-by-side comparison with
: RangeQuery [6].  If developers feel that this class is useful, then by all
: means roll it into the code base.  (90% of it is cut/pasted from
: DateFilter/RangeQuery anyway)
:
:
: Comments? ... Questions? ... Answers?
:
:
:
: Footnotes:
:
: [1] It seems to me this