Re: Permissioning Documents

2004-12-10 Thread Paul Elschot
On Friday 10 December 2004 07:10, Steve Skillcorn wrote:
 Hi;
  
 I'm currently using Lucene (which I am extremely impressed with BTW) to
 index a knowledge base of documents.  One issue I have is that only certain
 documents are available to certain users (or groups).  The number of
 documents is large, into the 100,000s, and the number of uses can be into
 the 1000s.  Obviously, the users permissioned to see certain documents can
 change regularly, so storing the user id's in the Lucene document is
 undesirable, as a permission change could mean a delete and re-add to
 potentially 100s of documents.
  
 Does anyone have any guidance as to how I should approach this?

A typical solution would be to use a Filter for each user group.
Each Filter would be built from categories indexed with the documents.
The moment to build a group Filter could be the first time a user from
a group queries an index after it is opened.
Filters can be cached, see the recent discussion on CachingWrappingFilter
and friends.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene in Action e-book now available!

2004-12-10 Thread Erik Hatcher
The Lucene in Action e-book is now available at Manning's site:
http://www.manning.com/hatcher2
Manning also put lots of other goodies there, the table of contents, 
about this book, preface, the foreward from Doug Cutting himself 
(thanks Doug!!!), and a couple of sample chapters.  The complete source 
code is there as well.

Now comes the exciting part to find out what others think of the work 
Otis and I spent 14+ months of our lives on.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Permissioning Documents

2004-12-10 Thread mark harwood
Hi Steve,
Possibly the easiest way to handle this is to tag the
documents with a field listing the permitted
roles/groups (not the individual users). 
I would be tempted to keep the information that
associates users to groups outside of the Lucene index
eg in a relational DB. 
This way you do not need to worry about updating the
Lucene index everytime a new user is added or is
granted membership to a group. 

When you search, simply use a QueryFilter which lists
the current user's roles e.g. groups:(admin,
projectManager) - this will restrict the search
results to only those docs associated with the user's
roles.

Cheers
Mark






___ 
Win a castle for NYE with your mates and Yahoo! Messenger 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



HITCOLLECTOR+SCORE+DELIMA

2004-12-10 Thread Karthik N S

Hi guys

Apologies.



I am still in delima on How to use the HitCollector for returning  Hits hits
between scores  0.2f to 1.0f ,

There is not a simple example for the same, yet lot's of talk on usage for
the same on the form.

Please somebody spare a bit of code (u'r intelligence) on this form.





Thx in advance
Karthik

























  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene in Action e-book now available!

2004-12-10 Thread William W
Am I the first one who bought the Lucene in Action book ?
Thanks Erik and Otis.
William W. Silva

From: Erik Hatcher [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: Lucene User [EMAIL PROTECTED],Lucene List 
[EMAIL PROTECTED]
Subject: Lucene in Action e-book now available!
Date: Fri, 10 Dec 2004 03:52:55 -0500

The Lucene in Action e-book is now available at Manning's site:
http://www.manning.com/hatcher2
Manning also put lots of other goodies there, the table of contents, about 
this book, preface, the foreward from Doug Cutting himself (thanks 
Doug!!!), and a couple of sample chapters.  The complete source code is 
there as well.

Now comes the exciting part to find out what others think of the work Otis 
and I spent 14+ months of our lives on.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Don’t just search. Find. Check out the new MSN Search! 
http://search.msn.click-url.com/go/onm00200636ave/direct/01/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: HITCOLLECTOR+SCORE+DELIMA

2004-12-10 Thread Erik Hatcher
On Dec 10, 2004, at 7:39 AM, Karthik N S wrote:
I am still in delima on How to use the HitCollector for returning  
Hits hits
between scores  0.2f to 1.0f ,

There is not a simple example for the same, yet lot's of talk on usage 
for
the same on the form.
Unfortunately there isn't a clean way to stop a HitCollector - it will 
simply collect all hits.

Also, scores are _not_ normalized when passed to a HitCollector, so you 
may get scores  1.0.  Hits, however, does normalize and you're 
guaranteed that scores will be = 1.0.  Hits are in descending score 
order, so you may just want to use Hits and filter based on the score 
provided by hits.score(i).

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: SEARCH +HITS+LIMIT

2004-12-10 Thread Erik Hatcher
On Dec 10, 2004, at 8:24 AM, Andraz Skoric wrote:
Displaytag (http://displaytag.sourceforge.net/) is for displaying 
search results in multiple pages
I don't know displaytag internals, but be cautious with such things.  
What you do not want to happen is all the results to be grabbed and 
cached somehow.  You only want to retrieve the actual documents being 
shown on that specific page.  It looks like displaytag can support 
this, as long as you provide your own custom pruned document set.

Personally, I use Tapestry :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in Action e-book now available!

2004-12-10 Thread Luke Shannon
Nice Work!

Congratulations Guys.

- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene User [EMAIL PROTECTED]; Lucene List
[EMAIL PROTECTED]
Sent: Friday, December 10, 2004 3:52 AM
Subject: Lucene in Action e-book now available!


 The Lucene in Action e-book is now available at Manning's site:

 http://www.manning.com/hatcher2

 Manning also put lots of other goodies there, the table of contents,
 about this book, preface, the foreward from Doug Cutting himself
 (thanks Doug!!!), and a couple of sample chapters.  The complete source
 code is there as well.

 Now comes the exciting part to find out what others think of the work
 Otis and I spent 14+ months of our lives on.

 Erik


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in Action e-book now available!

2004-12-10 Thread Robinson Raju
Congrats !
i went through sample chapter 1 . well written .


On Fri, 10 Dec 2004 09:58:25 -0500, Luke Shannon
[EMAIL PROTECTED] wrote:
 Nice Work!
 
 Congratulations Guys.
 
 
 
 - Original Message -
 From: Erik Hatcher [EMAIL PROTECTED]
 To: Lucene User [EMAIL PROTECTED]; Lucene List
 [EMAIL PROTECTED]
 Sent: Friday, December 10, 2004 3:52 AM
 Subject: Lucene in Action e-book now available!
 
  The Lucene in Action e-book is now available at Manning's site:
 
  http://www.manning.com/hatcher2
 
  Manning also put lots of other goodies there, the table of contents,
  about this book, preface, the foreward from Doug Cutting himself
  (thanks Doug!!!), and a couple of sample chapters.  The complete source
  code is there as well.
 
  Now comes the exciting part to find out what others think of the work
  Otis and I spent 14+ months of our lives on.
 
  Erik
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-- 
Regards,
Robin
9886394650
The merit of an action lies in finishing it to the end

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemoryError with Lucene 1.4 final

2004-12-10 Thread Justin Swanhart
You probably need to increase the amount of RAM available to your JVM.  

See the parameters:
-Xmx   :Maximum memory usable by the JVM
-Xms   :Initial memory allocated to JVM

My params are;  -Xmx2048m -Xms128m  (2G max, 128M initial)


On Fri, 10 Dec 2004 11:17:29 -0600, Sildy Augustine
[EMAIL PROTECTED] wrote:
 I think you should close your files in a finally clause in case of
 exceptions with file system and also print out the exception.
 
 You could be running out of file handles.
 
 
 
 -Original Message-
 From: Jin, Ying [mailto:[EMAIL PROTECTED]
 Sent: Friday, December 10, 2004 11:15 AM
 To: [EMAIL PROTECTED]
 Subject: OutOfMemoryError with Lucene 1.4 final
 
 Hi, Everyone,
 
 We're trying to index ~1500 archives but get OutOfMemoryError about
 halfway through the index process. I've tried to run program under two
 different Redhat Linux servers: One with 256M memory and 365M swap
 space. The other one with 512M memory and 1G swap space. However, both
 got OutOfMemoryError at the same place (at record 898).
 
 Here is my code for indexing:
 
 ===
 
 Document doc = new Document();
 
 doc.add(Field.UnIndexed(path, f.getPath()));
 
 doc.add(Field.Keyword(modified,
 
 DateField.timeToString(f.lastModified(;
 
 doc.add(Field.UnIndexed(eprintid, id));
 
 doc.add(Field.Text(metadata, metadata));
 
 FileInputStream is = new FileInputStream(f);  // the text file
 
 BufferedReader reader = new BufferedReader(new
 InputStreamReader(is));
 
 StringBuffer stringBuffer = new StringBuffer();
 
 String line = ;
 
 try{
 
   while((line = reader.readLine()) != null){
 
 stringBuffer.append(line);
 
   }
 
   doc.add(Field.Text(contents, stringBuffer.toString()));
 
   // release the resources
 
   is.close();
 
   reader.close();
 
 }catch(java.io.IOException e){}
 
 =
 
 Is there anything wrong with my code or I need more memory?
 
 Thanks for any help!
 
 Ying
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemoryError with Lucene 1.4 final

2004-12-10 Thread Xiangyu Jin

I am not sure. But guess there are three possilities,

(1). see that you use
Field.Text(contents, stringBuffer.toString())
This will store all your string of text into document object.
And it might be long ...

I do not know the detail how Lucene implemented.
I think you can try use unstored first to see
if the same problem happen.

BTW, how large is your document. Mine has 1M docs and
max-length less than 1 M, usually has length about several k.

(2) I guess another possiblilty is that record 898 is a very long
document, maybe java' s string object has a maxlength?
Just trace the code, see when the exception occur.

(3) Moreover, if you run it on a java VM, it also has a setting of
its virtual mem. It has nothing to do with the hardware
you are running. I has met this before when I use the directory's
ListOfFile function, where it easily exceed the max mem, if
there are 1M docs under the same dir (a stupid mistake I made).
But if I expand the VM's mem, it is then appears ok.

:)





On Fri, 10 Dec 2004, Jin, Ying wrote:

 Hi, Everyone,



 We're trying to index ~1500 archives but get OutOfMemoryError about
 halfway through the index process. I've tried to run program under two
 different Redhat Linux servers: One with 256M memory and 365M swap
 space. The other one with 512M memory and 1G swap space. However, both
 got OutOfMemoryError at the same place (at record 898).



 Here is my code for indexing:

 ===

 Document doc = new Document();

 doc.add(Field.UnIndexed(path, f.getPath()));

 doc.add(Field.Keyword(modified,


 DateField.timeToString(f.lastModified(;

 doc.add(Field.UnIndexed(eprintid, id));

 doc.add(Field.Text(metadata, metadata));



 FileInputStream is = new FileInputStream(f);  // the text file

 BufferedReader reader = new BufferedReader(new
 InputStreamReader(is));



 StringBuffer stringBuffer = new StringBuffer();

 String line = ;

 try{

   while((line = reader.readLine()) != null){

 stringBuffer.append(line);

   }

   doc.add(Field.Text(contents, stringBuffer.toString()));

   // release the resources

   is.close();

   reader.close();

 }catch(java.io.IOException e){}

 =

 Is there anything wrong with my code or I need more memory?



 Thanks for any help!

 Ying



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



sorting tokenized field

2004-12-10 Thread Praveen Peddi
I read that the tokenised fields cannot be sorted. In order to sort tokenized 
field, either the application has to duplicate field with diff name and not 
tokenize it or come up with something else. But shouldn't the search engine 
takecare of this? Are there any plans of putting this functionality built into 
lucene?

Praveen
** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 


Re: sorting tokenized field

2004-12-10 Thread Erik Hatcher
On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote:
I read that the tokenised fields cannot be sorted. In order to sort 
tokenized field, either the application has to duplicate field with 
diff name and not tokenize it or come up with something else. But 
shouldn't the search engine takecare of this? Are there any plans of 
putting this functionality built into lucene?
It would be wasteful for Lucene to assume any field you add should be 
available for sorting.

Adding one more line to your indexing code to accommodate your sorting 
needs seems a pretty small price to pay.  Do you have suggestions to 
improve how this works?   Or how it is documented?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting tokenized field

2004-12-10 Thread Praveen Peddi
I was only thinking in terms of other search engines. I worked with other 
search engines and I didn't see this requirements before. I think you are 
right that its wasteful to duplicate all tokenized fields. Not sure if there 
is a smart of dealing with it.

Praveen
- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, December 10, 2004 1:53 PM
Subject: Re: sorting tokenized field


On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote:
I read that the tokenised fields cannot be sorted. In order to sort 
tokenized field, either the application has to duplicate field with diff 
name and not tokenize it or come up with something else. But shouldn't 
the search engine takecare of this? Are there any plans of putting this 
functionality built into lucene?
It would be wasteful for Lucene to assume any field you add should be 
available for sorting.

Adding one more line to your indexing code to accommodate your sorting 
needs seems a pretty small price to pay.  Do you have suggestions to 
improve how this works?   Or how it is documented?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: OutOfMemoryError with Lucene 1.4 final

2004-12-10 Thread Jin, Ying
Great!!! It works perfect after I setup -Xms and -Xmx JVM command-line 
parameters with:
java -Xms128m -Xmx128m

It turns out that my JVM is running out of memory. And Otis is right on
my 
reader closing too.
reader.close() will close the reader and release any system resources 
associated with it.

I really appreciate everyone's help!
Ying



No of docs using IndexSearcher

2004-12-10 Thread Ravi
 How do I get the number of docs in an index If I just have access to a
searcher on that index?

Thanks in advance
Ravi.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: No of docs using IndexSearcher

2004-12-10 Thread [EMAIL PROTECTED]
numDocs()
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#numDocs()

Ravi said the following on 12/10/2004 2:42 PM:
How do I get the number of docs in an index If I just have access to a
searcher on that index?
Thanks in advance
Ravi.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: No of docs using IndexSearcher

2004-12-10 Thread [EMAIL PROTECTED]
If your index is open shouldnt there be an instance of IndexReader 
already there?

Ravi said the following on 12/10/2004 3:13 PM:
I already have a field with a constant value in my index. How about
using IndexSearcher.docFreq(new Term(field,value))? Then I don't have to
instantiate IndexReader. 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, December 10, 2004 2:59 PM
To: Lucene Users List
Subject: Re: No of docs using IndexSearcher

numDocs()
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexR
eader.html#numDocs()

Ravi said the following on 12/10/2004 2:42 PM:
 

How do I get the number of docs in an index If I just have access to a
   

 

searcher on that index?
Thanks in advance
Ravi.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 



RE: No of docs using IndexSearcher

2004-12-10 Thread Ravi
I'm fairly new to lucene.  The main reason why I did n't use the
IndexReader constructor for the searcher is we organize the indexes as
different partitions depending on document's date and during searching I
instantiate a MultiSearcher object on these different partitions
depending on from-date and to-date from the search. I was getting a
runtime exception during search, If the index does not have any
documents. That's why I was looking for some method on the searcher
object that gives me the number of documents. 

Thanks
Ravi


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, December 10, 2004 3:25 PM
To: Lucene Users List
Subject: Re: No of docs using IndexSearcher

If your index is open shouldnt there be an instance of IndexReader
already there?


Ravi said the following on 12/10/2004 3:13 PM:

I already have a field with a constant value in my index. How about
using IndexSearcher.docFreq(new Term(field,value))? Then I don't have
to
instantiate IndexReader. 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, December 10, 2004 2:59 PM
To: Lucene Users List
Subject: Re: No of docs using IndexSearcher

numDocs()

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/Index
R
eader.html#numDocs()



Ravi said the following on 12/10/2004 2:42 PM:

  

How do I get the number of docs in an index If I just have access to a



  

searcher on that index?

Thanks in advance
Ravi.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



  




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting tokenized field

2004-12-10 Thread Praveen Peddi
Since I am not aware of the lucene code much, I couldn't make much out of 
your patch. But is this patch already tested and proved to be efficient? If 
so, why can't it be merge into the lucene code and made it part of the 
release. I think the bug is valid. Its very likely that people want to sort 
on tokenized fields.

If I apply this patch to lucene code and use it for myself, I will have hard 
time managing it in future (while upgrading lucene library). If the pathc is 
applied to lucene release code, it would be very easy for the lucene users.

If possible, can someone explain what the path does? I am trying to 
understand what exactly changed but could not figrue out.

Praveen
- Original Message - 
From: Aviran [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Friday, December 10, 2004 2:30 PM
Subject: RE: sorting tokenized field


I have suggested a solution for this problem (
http://issues.apache.org/bugzilla/show_bug.cgi?id=30382 ) you can use the
patch suggested there and recompile lucene.
Aviran
http://www.aviransplace.com
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, December 10, 2004 13:53 PM
To: Lucene Users List
Subject: Re: sorting tokenized field

On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote:
I read that the tokenised fields cannot be sorted. In order to sort
tokenized field, either the application has to duplicate field with
diff name and not tokenize it or come up with something else. But
shouldn't the search engine takecare of this? Are there any plans of
putting this functionality built into lucene?
It would be wasteful for Lucene to assume any field you add should be
available for sorting.
Adding one more line to your indexing code to accommodate your sorting
needs seems a pretty small price to pay.  Do you have suggestions to
improve how this works?   Or how it is documented?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Sorting based on calculations at search time

2004-12-10 Thread Gurukeerthi Gurunathan
Hello,
 
 I'd like some suggestions on the following scenario. 
 Say I have an index with a stored, indexed field called
'weight'(essentially an int stored as string). I'd like to sort in
descending order of final weight, the search results by performing a
calculation involving the lucene score for each hits. For our
discussion, the calculation can be as simple as multiplying the lucene
score with the value from the field 'weight' to get final weight. The
search results can run into thousands of documents. Though finally I may
need only the top X number of documents, I wouldn't know what the top X
would be until I perform this calculation and sort it.
 The obvious way is to do a post processing of the hits iterator,
storing it in memory, performing this calculation and sorting it. Is
there any other better solution for this?
 
Thanks,
Guru.
 
 
*
Gurukeerthi Gurunathan
Third Pillar Systems
San Mateo, CA
650-372-1200x229
 


Re: Lucene in Action e-book now available!

2004-12-10 Thread Jonathan Hager
Congratulations on the book.  I ordered my copy the other day via
regular post and am eagerly awaiting it.  It looks like it will make
lucene available to a much wider audience.

Based on the table of contents, I wanted to toss out a couple of ideas
for your next book or articles.

1. I didn't see any examples of indexing a database table.  Although
it was mentioned in Chapter 1.  At the company I am currently
consulting at, we index the data from the database because its cleaner
than indexing the web.  This discussion should include why you would
want to use lucene to index a database table, rather than just using
the database indexes.  (The top reasons we choose to use Lucene
instead of just database indexes are: It allows stem word recognition;
It allows fuzzy searching; It ranks the results based on how good the
match is; It contains a parser that will parse natural language
queries; It has better Analyzers)

2. This one is a cookbook idea, I think it would be possible to index
the access log of web server.  Than when a user views product X the
searcher could search for a other products that were viewed by people
that also looked at product X.  In this way you can create basic
cross-selling opportunities.  This feature is a big seller to
managers for commercial search offerings.

3. A lot of search applications being built using lucene are web
applications.  I didn't see any reference to the two different
strategies for paging a hit list.  The two strategies are repeating
the search and caching a search.  An example of this would be good. 
[I know that I have seen this online, its just nice to have a
reference in book form]

Please don't take this as criticism.  First of all, because I have not
read the book.  Secondly, I am excluding the other 17 topics that I
thought should be in a book (for example, indexing PDFs, highlighting
search results, create a thesaurus, suggesting alternatives spellings,
filtering by ACLs, etc...) because they are clearly in your table of
contents.

I look forward to reading the book and appreciate your 14+ months of
hard work to create a concise but valuable book for Lucene.

Jonathan


On Fri, 10 Dec 2004 03:52:55 -0500, Erik Hatcher
[EMAIL PROTECTED] wrote:
 The Lucene in Action e-book is now available at Manning's site:
 
 http://www.manning.com/hatcher2
 
 Manning also put lots of other goodies there, the table of contents,
 about this book, preface, the foreward from Doug Cutting himself
 (thanks Doug!!!), and a couple of sample chapters.  The complete source
 code is there as well.
 
 Now comes the exciting part to find out what others think of the work
 Otis and I spent 14+ months of our lives on.
 
 Erik
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sorting based on calculations at search time

2004-12-10 Thread Gurukeerthi Gurunathan
Thanks Otis for your response and compliments (wish I was a lucene guru
like you guys :-)

I believe you are talking about the boost factor for fields or documents
while searching. That does not apply in my case - maybe I am missing a
point here. 
The weight field I was talking about is only for the calculation
purpose, I am not searching on that field (it can be just a stored,
unindexed field). The main searching happens on other fields(like title,
keywords etc.) for which I am already using some boost factor. The
problem starts after I search and got some set of results - all I want
here is the result to be ordered by a number that is a multiplication of
lucene score and the weight field value for each document. 

I understand that without iterating thru the hits I cannot retrieve the
score and the weight for each document - which is why I'd like this
calculation and ordering to happen while searching so that I can avoid
the iteration over the entire hits. If it involves working on the lucene
source code, please point me to the right class or package that I should
be dealing with.

Thanks again,
Guru.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, December 10, 2004 3:13 PM
To: Lucene Users List
Subject: Re: Sorting based on calculations at search time

Guru (I thought my first name was OK until now),

Have you tried using boosts for that?  You can boost individual Document
Fields when indexing, and/or you can boost individual Documents, thus
giving some more and some less 'weight', which will have an effect on
the final score.

Otis



--- Gurukeerthi Gurunathan [EMAIL PROTECTED] wrote:

 Hello,
  
  I'd like some suggestions on the following scenario. 
  Say I have an index with a stored, indexed field called 
 'weight'(essentially an int stored as string). I'd like to sort in 
 descending order of final weight, the search results by performing a 
 calculation involving the lucene score for each hits. For our 
 discussion, the calculation can be as simple as multiplying the lucene

 score with the value from the field 'weight' to get final weight. The 
 search results can run into thousands of documents. Though finally I 
 may need only the top X number of documents, I wouldn't know what the 
 top X would be until I perform this calculation and sort it.
  The obvious way is to do a post processing of the hits iterator, 
 storing it in memory, performing this calculation and sorting it. Is 
 there any other better solution for this?
  
 Thanks,
 Guru.
  
  
 *
 Gurukeerthi Gurunathan
 Third Pillar Systems
 San Mateo, CA
 650-372-1200x229
  
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



A simple Query Language

2004-12-10 Thread Dongling Ding
Hi,

 

I am going to implement a search service and plan to use Lucene. Is
there any simple query language that is independent of any particular
search engine out there?

 

Thanks

 

 

Dongling

 





If you have received this e-mail in error, please delete it and notify the 
sender as soon as possible. The contents of this e-mail may be confidential and 
the unauthorized use, copying, or dissemination of it and any attachments to 
it, is prohibited. 

Internet communications are not secure and Hyperion does not, therefore, accept 
legal responsibility for the contents of this message nor for any damage caused 
by viruses.  The views expressed here do not necessarily represent those of 
Hyperion.

For more information about Hyperion, please visit our Web site at 
www.hyperion.com




RE: A simple Query Language

2004-12-10 Thread Chuck Williams
You could support only terms with no operators at all, which will work
in most search engines (except those that require combining operators).
Using just terms and phrases embedded in 's is pretty universal.
After that, you might want to add +/- required/prohibited restrictions,
which many engines support.  After that, I think you're getting pretty
specific.  Lucene supports all of these and many more.

Chuck

   -Original Message-
   From: Dongling Ding [mailto:[EMAIL PROTECTED]
   Sent: Friday, December 10, 2004 5:08 PM
   To: Lucene Users List
   Subject: A simple Query Language
   
   Hi,
   
   
   
   I am going to implement a search service and plan to use Lucene. Is
   there any simple query language that is independent of any
particular
   search engine out there?
   
   
   
   Thanks
   
   
   
   
   
   Dongling
   
   
   
   
   
  

   
   If you have received this e-mail in error, please delete it and
notify
   the sender as soon as possible. The contents of this e-mail may be
   confidential and the unauthorized use, copying, or dissemination of
it
   and any attachments to it, is prohibited.
   
   Internet communications are not secure and Hyperion does not,
therefore,
   accept legal responsibility for the contents of this message nor for
any
   damage caused by viruses.  The views expressed here do not
necessarily
   represent those of Hyperion.
   
   For more information about Hyperion, please visit our Web site at
   www.hyperion.com
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-10 Thread Chris Lamprecht
Very cool, thanks for posting this!  

Google's feature doesn't seem to do a search on every keystroke
necessarily.  Instead, it waits until you haven't typed a character
for a short period (I'm guessing about 100 or 150 milliseconds).  So
if you type fast, it doesn't hit the server until you pause.  There
are some more detailed postings on slashdot about how it works.

On Fri, 10 Dec 2004 16:36:27 -0800, David Spencer
[EMAIL PROTECTED] wrote:
 
 Google just came out with a page that gives you feedback as to how many
 pages will match your query and variations on it:
 
 http://www.google.com/webhp?complete=1hl=en
 
 I had an unexposed experiment I had done with Lucene a few months ago
 that this has inspired me to expose - it's not the same, but it's
 similar in that as you type in a query you're given *immediate* feedback
 as to how many pages match.
 
 Try it here: http://www.searchmorph.com/kat/isearch.html
 
 This is my SearchMorph site which has an index of ~90k pages of open
 source javadoc packages.
 
 As you type in a query, on every keystroke it does at least one Lucene
 search to show results in the bottom part of the page.
 
 It also gives spelling corrections (using my NGramSpeller
 contribution) and also suggests popular tokens that start the same way
 as your search query.
 
 For one way to see corrections in action, type in rollback character
 by character (don't do a cut and paste).
 
 Note that:
 -- this is not how the Google page works - just similar to it
 -- I do single word suggestions while google does the more useful whole
 phrase suggestions (TBD I'll try to copy them)
 -- They do lots of javascript magic, whereas I use old school frames mostly
 -- this is relatively expensive, as it does 1 query per character, and
 when it's doing spelling correction there is even more work going on
 -- this is just an experiment and the page may be unstable as I fool w/ it
 
 What's nice is when you get used to immediate results, going back to the
 batch way of searching seems backward, slow, and old fashioned.
 
 There are too many idle CPUs in the world - this is one way to keep them
 busier :)
 
 -- Dave
 
 PS Weblog entry updated too:
 http://www.searchmorph.com/weblog/index.php?id=26
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sorting based on calculations at search time

2004-12-10 Thread Chris Hostetter
: I believe you are talking about the boost factor for fields or documents
: while searching. That does not apply in my case - maybe I am missing a
: point here.
: The weight field I was talking about is only for the calculation

Otis is suggesting that you set the boost of the document to be your
weight value.  That way Lucene will automaticly do your multiplucation
calculation when determining the score

The down side of this, is that i don't think there's anyway to keep
it from influencing the score on every search, so it's not something you
could use only on some queries.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SEARCH +HITS+LIMIT

2004-12-10 Thread Andraz Skoric
Displaytag (http://displaytag.sourceforge.net/) is for displaying search 
results in multiple pages

lp, a
Karthik N S wrote:
Hi Guy's
Apologies...

One question for the form [ Especially Erik]
1) I have a MERGED Index with  100,000  File Indexed into it  ( Content  is
one of the Fields of Type 'Text' )
2) On search for a simple words  Camera  returns me  6000 hits.
3) Since the Search process is  via  WebApps , a simple JSP is used to
display the Content.
Question
How to Display the Contents for the Hits in  Incremental order ?
[ Each Time a re hit to the Mergerindex with Incremental X value ].
This would solve the problem of Out of Memory by prefetching all the hit in
one strait go process.
Ex:
Total hits 6000
1st page  -  hit's returned (1   to   25)
2nd page -  hit's returned (26  to  50)
.
.
.
.
   N th page  hit's returned ( 5975 - 6000 )
Hint : - This is similar to a SQL query   SELECT * FROM LUCENE  LIMIT 10, 5

 WITH WARM REGARDS
 HAVE A NICE DAY
 [ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: OutOfMemoryError with Lucene 1.4 final

2004-12-10 Thread Sildy Augustine
I think you should close your files in a finally clause in case of
exceptions with file system and also print out the exception. 

You could be running out of file handles.

-Original Message-
From: Jin, Ying [mailto:[EMAIL PROTECTED] 
Sent: Friday, December 10, 2004 11:15 AM
To: [EMAIL PROTECTED]
Subject: OutOfMemoryError with Lucene 1.4 final

Hi, Everyone,

 

We're trying to index ~1500 archives but get OutOfMemoryError about
halfway through the index process. I've tried to run program under two
different Redhat Linux servers: One with 256M memory and 365M swap
space. The other one with 512M memory and 1G swap space. However, both
got OutOfMemoryError at the same place (at record 898). 

 

Here is my code for indexing:

===

Document doc = new Document();

doc.add(Field.UnIndexed(path, f.getPath()));

doc.add(Field.Keyword(modified,

 
DateField.timeToString(f.lastModified(;

doc.add(Field.UnIndexed(eprintid, id));

doc.add(Field.Text(metadata, metadata));

 

FileInputStream is = new FileInputStream(f);  // the text file

BufferedReader reader = new BufferedReader(new
InputStreamReader(is));

 

StringBuffer stringBuffer = new StringBuffer();

String line = ;

try{

  while((line = reader.readLine()) != null){

stringBuffer.append(line);

  }

  doc.add(Field.Text(contents, stringBuffer.toString()));

  // release the resources

  is.close();

  reader.close();

}catch(java.io.IOException e){}

=

Is there anything wrong with my code or I need more memory?

 

Thanks for any help!

Ying


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


OutOfMemoryError with Lucene 1.4 final

2004-12-10 Thread Jin, Ying
Hi, Everyone,

 

We're trying to index ~1500 archives but get OutOfMemoryError about
halfway through the index process. I've tried to run program under two
different Redhat Linux servers: One with 256M memory and 365M swap
space. The other one with 512M memory and 1G swap space. However, both
got OutOfMemoryError at the same place (at record 898). 

 

Here is my code for indexing:

===

Document doc = new Document();

doc.add(Field.UnIndexed(path, f.getPath()));

doc.add(Field.Keyword(modified,

 
DateField.timeToString(f.lastModified(;

doc.add(Field.UnIndexed(eprintid, id));

doc.add(Field.Text(metadata, metadata));

 

FileInputStream is = new FileInputStream(f);  // the text file

BufferedReader reader = new BufferedReader(new
InputStreamReader(is));

 

StringBuffer stringBuffer = new StringBuffer();

String line = ;

try{

  while((line = reader.readLine()) != null){

stringBuffer.append(line);

  }

  doc.add(Field.Text(contents, stringBuffer.toString()));

  // release the resources

  is.close();

  reader.close();

}catch(java.io.IOException e){}

=

Is there anything wrong with my code or I need more memory?

 

Thanks for any help!

Ying



RE: OutOfMemoryError with Lucene 1.4 final

2004-12-10 Thread Otis Gospodnetic
Ying,

You should follow this finally block advice below.  In addition, I
think you can just close the reader, and it will close the underlying
stream (I'm not sure about that, double-check it).

You are not running out of file handles, though.  Your JVM is running
out of memory.  You can play with:

1) -Xms and -Xmx JVM command-line parameters
2) IndexWriter's parameters: mergeFactor and minMergeDocs - check the
Javadocs for more info.  They will let you control how much memory your
indexing process uses.

Otis


--- Sildy Augustine [EMAIL PROTECTED] wrote:

 I think you should close your files in a finally clause in case of
 exceptions with file system and also print out the exception. 
 
 You could be running out of file handles.
 
 -Original Message-
 From: Jin, Ying [mailto:[EMAIL PROTECTED] 
 Sent: Friday, December 10, 2004 11:15 AM
 To: [EMAIL PROTECTED]
 Subject: OutOfMemoryError with Lucene 1.4 final
 
 Hi, Everyone,
 
  
 
 We're trying to index ~1500 archives but get OutOfMemoryError about
 halfway through the index process. I've tried to run program under
 two
 different Redhat Linux servers: One with 256M memory and 365M swap
 space. The other one with 512M memory and 1G swap space. However,
 both
 got OutOfMemoryError at the same place (at record 898). 
 
  
 
 Here is my code for indexing:
 
 ===
 
 Document doc = new Document();
 
 doc.add(Field.UnIndexed(path, f.getPath()));
 
 doc.add(Field.Keyword(modified,
 
  
 DateField.timeToString(f.lastModified(;
 
 doc.add(Field.UnIndexed(eprintid, id));
 
 doc.add(Field.Text(metadata, metadata));
 
  
 
 FileInputStream is = new FileInputStream(f);  // the text file
 
 BufferedReader reader = new BufferedReader(new
 InputStreamReader(is));
 
  
 
 StringBuffer stringBuffer = new StringBuffer();
 
 String line = ;
 
 try{
 
   while((line = reader.readLine()) != null){
 
 stringBuffer.append(line);
 
   }
 
   doc.add(Field.Text(contents, stringBuffer.toString()));
 
   // release the resources
 
   is.close();
 
   reader.close();
 
 }catch(java.io.IOException e){}
 
 =
 
 Is there anything wrong with my code or I need more memory?
 
  
 
 Thanks for any help!
 
 Ying
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: OutOfMemoryError with Lucene 1.4 final

2004-12-10 Thread Xiangyu Jin

Ok, I see. Seems most ppl think is the third possiblity

On Fri, 10 Dec 2004, Xiangyu  Jin wrote:


 I am not sure. But guess there are three possilities,

 (1). see that you use
 Field.Text(contents, stringBuffer.toString())
 This will store all your string of text into document object.
 And it might be long ...

 I do not know the detail how Lucene implemented.
 I think you can try use unstored first to see
 if the same problem happen.

 BTW, how large is your document. Mine has 1M docs and
 max-length less than 1 M, usually has length about several k.

 (2) I guess another possiblilty is that record 898 is a very long
 document, maybe java' s string object has a maxlength?
 Just trace the code, see when the exception occur.

 (3) Moreover, if you run it on a java VM, it also has a setting of
 its virtual mem. It has nothing to do with the hardware
 you are running. I has met this before when I use the directory's
 ListOfFile function, where it easily exceed the max mem, if
 there are 1M docs under the same dir (a stupid mistake I made).
 But if I expand the VM's mem, it is then appears ok.

 :)





 On Fri, 10 Dec 2004, Jin, Ying wrote:

  Hi, Everyone,
 
 
 
  We're trying to index ~1500 archives but get OutOfMemoryError about
  halfway through the index process. I've tried to run program under two
  different Redhat Linux servers: One with 256M memory and 365M swap
  space. The other one with 512M memory and 1G swap space. However, both
  got OutOfMemoryError at the same place (at record 898).
 
 
 
  Here is my code for indexing:
 
  ===
 
  Document doc = new Document();
 
  doc.add(Field.UnIndexed(path, f.getPath()));
 
  doc.add(Field.Keyword(modified,
 
 
  DateField.timeToString(f.lastModified(;
 
  doc.add(Field.UnIndexed(eprintid, id));
 
  doc.add(Field.Text(metadata, metadata));
 
 
 
  FileInputStream is = new FileInputStream(f);  // the text file
 
  BufferedReader reader = new BufferedReader(new
  InputStreamReader(is));
 
 
 
  StringBuffer stringBuffer = new StringBuffer();
 
  String line = ;
 
  try{
 
while((line = reader.readLine()) != null){
 
  stringBuffer.append(line);
 
}
 
doc.add(Field.Text(contents, stringBuffer.toString()));
 
// release the resources
 
is.close();
 
reader.close();
 
  }catch(java.io.IOException e){}
 
  =
 
  Is there anything wrong with my code or I need more memory?
 
 
 
  Thanks for any help!
 
  Ying
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: sorting tokenized field

2004-12-10 Thread Aviran
I have suggested a solution for this problem (
http://issues.apache.org/bugzilla/show_bug.cgi?id=30382 ) you can use the
patch suggested there and recompile lucene.


Aviran
http://www.aviransplace.com

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Friday, December 10, 2004 13:53 PM
To: Lucene Users List
Subject: Re: sorting tokenized field



On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote:
 I read that the tokenised fields cannot be sorted. In order to sort
 tokenized field, either the application has to duplicate field with 
 diff name and not tokenize it or come up with something else. But 
 shouldn't the search engine takecare of this? Are there any plans of 
 putting this functionality built into lucene?

It would be wasteful for Lucene to assume any field you add should be 
available for sorting.

Adding one more line to your indexing code to accommodate your sorting 
needs seems a pretty small price to pay.  Do you have suggestions to 
improve how this works?   Or how it is documented?

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


MultiSearcher close

2004-12-10 Thread Ravi
 If I close a MultiSearcher, does it close all the associated searchers
too? I was getting a bad file descriptor error, if I close the
MultiSearcher object and open it again for another search without
reinstantiating the underlying searchers. 

Thanks in advance,
Ravi

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MultiSearcher close

2004-12-10 Thread Erik Hatcher
On Dec 10, 2004, at 4:16 PM, Ravi wrote:
 If I close a MultiSearcher, does it close all the associated searchers
too?
It sure does:
  public void close() throws IOException {
for (int i = 0; i  searchables.length; i++)
  searchables[i].close();
  }

 I was getting a bad file descriptor error, if I close the
MultiSearcher object and open it again for another search without
reinstantiating the underlying searchers.
Thanks in advance,
Ravi
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-10 Thread David Spencer
Google just came out with a page that gives you feedback as to how many 
pages will match your query and variations on it:

http://www.google.com/webhp?complete=1hl=en
I had an unexposed experiment I had done with Lucene a few months ago 
that this has inspired me to expose - it's not the same, but it's 
similar in that as you type in a query you're given *immediate* feedback 
as to how many pages match.

Try it here: http://www.searchmorph.com/kat/isearch.html
This is my SearchMorph site which has an index of ~90k pages of open 
source javadoc packages.

As you type in a query, on every keystroke it does at least one Lucene 
search to show results in the bottom part of the page.

It also gives spelling corrections (using my NGramSpeller 
contribution) and also suggests popular tokens that start the same way 
as your search query.

For one way to see corrections in action, type in rollback character 
by character (don't do a cut and paste).

Note that:
-- this is not how the Google page works - just similar to it
-- I do single word suggestions while google does the more useful whole 
phrase suggestions (TBD I'll try to copy them)
-- They do lots of javascript magic, whereas I use old school frames mostly
-- this is relatively expensive, as it does 1 query per character, and 
when it's doing spelling correction there is even more work going on
-- this is just an experiment and the page may be unstable as I fool w/ it

What's nice is when you get used to immediate results, going back to the 
batch way of searching seems backward, slow, and old fashioned.

There are too many idle CPUs in the world - this is one way to keep them 
busier :)

-- Dave
PS Weblog entry updated too: 
http://www.searchmorph.com/weblog/index.php?id=26



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]