Re: BooleanQuery - Too Many Clases on date range.

2004-10-04 Thread Stephane James Vaucher
BTW, what's wrong with the DateFilter solution, I mentionned earlier?

I've used it before (before lucene-1.4 though) without memory problems,
thus I always assumed that it avoided the allocation problems with prefix
queries.

sv

On Mon, 4 Oct 2004, Chris Fraschetti wrote:

 Surely some folks out there have used lucene on a large scale and have
 had to compensate for this somehow, any other solutions? Morus, thank
 you very more for your imput, and I am looking into your solution,
 just putting my feelers out there once more.

 The lucene API is very limited as to it's descriptions of it's
 components, short of digging into the code, is there a good doc
 somewhere out there that explains the workins of lucene?


 On Mon, 4 Oct 2004 01:57:06 -0700, Chris Fraschetti
 [EMAIL PROTECTED] wrote:
  So before I spend a significant amount of time digging into the lucene
  code, how does your experience with lucene give light to my
  situation  Our current index is pretty huge, and with each
  increase in side i've had, i've experienced a problem like this...
  Without taking up too much of your time.. because obviously this i my
  task, I thought i'd ask you if you'd had any experience with this
  boolean clause nonsense...  of course it can be overcome, but if you
  know a quick hack, awesome, otherwise.. no big, but off to work i go
  :)
 
  -Fraschetti
 
 
  -- Forwarded message --
  From: Morus Walter [EMAIL PROTECTED]
  Date: Mon, 4 Oct 2004 09:01:50 +0200
  Subject: Re: BooleanQuery - Too Many Clases on date range.
  To: Lucene Users List [EMAIL PROTECTED], Chris
  Fraschetti [EMAIL PROTECTED]
 
  Chris Fraschetti writes:
   So i decicded to move my epoch date to the  20040608 date which fixed
   my boolean query problem in regards to my current data size (approx
   600,000) 
  
   but now as soon as I do a query like ...  a*
   I get the boolean error again. Google obviously can handle this query,
   and I'm pretty sure lucene can handle it.. any ideas? With out
   without a date dange specified i still get the  TooManyClauses error.
 
 
   I tired cranking the maxclauses up to Integer.MaxInt, but java gave me
   a out of memory error. Is this b/c the boolean search tried to
   allocate that many clauses by default or because my query actually
   needed that many clauses?
 
  boolean search allocates clauses for all tokens having the prefix or
  matching the wildcard expression.
 
   Why does it work on small indexes but not
   large?
  Because there are fewer tokens starting with a.
 
   Is there any way to have the parser create as many clauses as
   it can and then search with what it has? w/o recompiling the source?
  
  You need to create your own version of Wildcard- and Prefix-Query
  that takes a maximum term number and ignores further clauses.
  And you need a variant of the query parser that uses these queries.
 
  This can be done, even without recompiling lucene, but you will have to
  do some programming at the level of lucene queries.
  Shouldn't be hard, since you can use the sources as a starting point.
 
  I guess this does not exist because the lucene developer decided to prefer
  a query error rather than uncomplete results.
 
  Morus
 
 
  --
  ___
  Chris Fraschetti, Student CompSci System Admin
  University of San Francisco
  e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu
 






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BooleanQuery - Too Many Clases on date range.

2004-10-04 Thread Stephane James Vaucher
Ok, got it, got a small comment though.

For large wildcard queries, please note that google does not support wild
cards. Search hell*, and there will be no correct matches with hello.

Is there a reason why you wish to allow such large queries? We might
be able to find alternative ways of helping you out. No one will use a
query a*. If someone does, the results would be completely meaningless
(many false positives for a user). However a query like program* might be
interesting to a user.

The problem with hacking term expansion is that the rules of this
expansion might be hard to define (as is maybe one should use the
first, the most frequent terms or the even the least frequent, depending
on your app).

sv

On Mon, 4 Oct 2004, Chris Fraschetti wrote:

 The date portion of my code works great now.. no problems there, so
 let me thank you now for your date filter solution... but my current
 problem is in regards to a stand alone   a* query giving me
 the too many clauses exception


 On Mon, 4 Oct 2004 12:47:24 -0400 (EDT), Stephane James Vaucher
 [EMAIL PROTECTED] wrote:
  BTW, what's wrong with the DateFilter solution, I mentionned earlier?
 
  I've used it before (before lucene-1.4 though) without memory problems,
  thus I always assumed that it avoided the allocation problems with prefix
  queries.
 
  sv
 
 
 
  On Mon, 4 Oct 2004, Chris Fraschetti wrote:
 
   Surely some folks out there have used lucene on a large scale and have
   had to compensate for this somehow, any other solutions? Morus, thank
   you very more for your imput, and I am looking into your solution,
   just putting my feelers out there once more.
  
   The lucene API is very limited as to it's descriptions of it's
   components, short of digging into the code, is there a good doc
   somewhere out there that explains the workins of lucene?
  
  
   On Mon, 4 Oct 2004 01:57:06 -0700, Chris Fraschetti
   [EMAIL PROTECTED] wrote:
So before I spend a significant amount of time digging into the lucene
code, how does your experience with lucene give light to my
situation  Our current index is pretty huge, and with each
increase in side i've had, i've experienced a problem like this...
Without taking up too much of your time.. because obviously this i my
task, I thought i'd ask you if you'd had any experience with this
boolean clause nonsense...  of course it can be overcome, but if you
know a quick hack, awesome, otherwise.. no big, but off to work i go
:)
   
-Fraschetti
   
   
-- Forwarded message --
From: Morus Walter [EMAIL PROTECTED]
Date: Mon, 4 Oct 2004 09:01:50 +0200
Subject: Re: BooleanQuery - Too Many Clases on date range.
To: Lucene Users List [EMAIL PROTECTED], Chris
Fraschetti [EMAIL PROTECTED]
   
Chris Fraschetti writes:
 So i decicded to move my epoch date to the  20040608 date which fixed
 my boolean query problem in regards to my current data size (approx
 600,000) 

 but now as soon as I do a query like ...  a*
 I get the boolean error again. Google obviously can handle this query,
 and I'm pretty sure lucene can handle it.. any ideas? With out
 without a date dange specified i still get the  TooManyClauses error.
   
   
 I tired cranking the maxclauses up to Integer.MaxInt, but java gave me
 a out of memory error. Is this b/c the boolean search tried to
 allocate that many clauses by default or because my query actually
 needed that many clauses?
   
boolean search allocates clauses for all tokens having the prefix or
matching the wildcard expression.
   
 Why does it work on small indexes but not
 large?
Because there are fewer tokens starting with a.
   
 Is there any way to have the parser create as many clauses as
 it can and then search with what it has? w/o recompiling the source?

You need to create your own version of Wildcard- and Prefix-Query
that takes a maximum term number and ignores further clauses.
And you need a variant of the query parser that uses these queries.
   
This can be done, even without recompiling lucene, but you will have to
do some programming at the level of lucene queries.
Shouldn't be hard, since you can use the sources as a starting point.
   
I guess this does not exist because the lucene developer decided to prefer
a query error rather than uncomplete results.
   
Morus
   
   
--
___
Chris Fraschetti, Student CompSci System Admin
University of San Francisco
e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu
   
  
  
  
  
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 






-
To unsubscribe, e

Re: BooleanQuery - Too Many Clases on date range.

2004-10-04 Thread Stephane James Vaucher
I've used the simple message that the user's request was too vague and
that he should modify it. I haven't had too many complaints about this
especially when I explained why to a client:

If one user of many does a*, the whole system will grind to a halt as that
one request will use up all of the available memory (wildcards aren't very
scalable...).

Here is an example of a working system:
http://theserverside.com/search/search.tss

I don't know if many people complain that when they do a*, that no results
appear, but a request for javap* returns javapro, javaplus, javapolis...

HTH,
sv

On Mon, 4 Oct 2004, Chris Fraschetti wrote:

 absoultely, limiting the user's query is no problem here. I've
 currently implemented the lucene javascript to catcha lot of user
 quries that could cause issues.. blank queries, ? or * at the
 beginning of query, etc etc... but I couldn't think of a way to
 prevent the user from doing a*  but not   comment*   wanting comments
 or commentary...  any suggestions would be warmly welcomed.


 On Mon, 4 Oct 2004 14:08:00 -0400 (EDT), Stephane James Vaucher
 [EMAIL PROTECTED] wrote:
  Ok, got it, got a small comment though.
 
  For large wildcard queries, please note that google does not support wild
  cards. Search hell*, and there will be no correct matches with hello.
 
  Is there a reason why you wish to allow such large queries? We might
  be able to find alternative ways of helping you out. No one will use a
  query a*. If someone does, the results would be completely meaningless
  (many false positives for a user). However a query like program* might be
  interesting to a user.
 
  The problem with hacking term expansion is that the rules of this
  expansion might be hard to define (as is maybe one should use the
  first, the most frequent terms or the even the least frequent, depending
  on your app).
 
  sv
 
  On Mon, 4 Oct 2004, Chris Fraschetti wrote:
 
   The date portion of my code works great now.. no problems there, so
 
 
   let me thank you now for your date filter solution... but my current
   problem is in regards to a stand alone   a* query giving me
   the too many clauses exception
  
  
   On Mon, 4 Oct 2004 12:47:24 -0400 (EDT), Stephane James Vaucher
   [EMAIL PROTECTED] wrote:
BTW, what's wrong with the DateFilter solution, I mentionned earlier?
   
I've used it before (before lucene-1.4 though) without memory problems,
thus I always assumed that it avoided the allocation problems with prefix
queries.
   
sv
   
   
   
On Mon, 4 Oct 2004, Chris Fraschetti wrote:
   
 Surely some folks out there have used lucene on a large scale and have
 had to compensate for this somehow, any other solutions? Morus, thank
 you very more for your imput, and I am looking into your solution,
 just putting my feelers out there once more.

 The lucene API is very limited as to it's descriptions of it's
 components, short of digging into the code, is there a good doc
 somewhere out there that explains the workins of lucene?


 On Mon, 4 Oct 2004 01:57:06 -0700, Chris Fraschetti
 [EMAIL PROTECTED] wrote:
  So before I spend a significant amount of time digging into the lucene
  code, how does your experience with lucene give light to my
  situation  Our current index is pretty huge, and with each
  increase in side i've had, i've experienced a problem like this...
  Without taking up too much of your time.. because obviously this i my
  task, I thought i'd ask you if you'd had any experience with this
  boolean clause nonsense...  of course it can be overcome, but if you
  know a quick hack, awesome, otherwise.. no big, but off to work i go
  :)
 
  -Fraschetti
 
 
  -- Forwarded message --
  From: Morus Walter [EMAIL PROTECTED]
  Date: Mon, 4 Oct 2004 09:01:50 +0200
  Subject: Re: BooleanQuery - Too Many Clases on date range.
  To: Lucene Users List [EMAIL PROTECTED], Chris
  Fraschetti [EMAIL PROTECTED]
 
  Chris Fraschetti writes:
   So i decicded to move my epoch date to the  20040608 date which fixed
   my boolean query problem in regards to my current data size (approx
   600,000) 
  
   but now as soon as I do a query like ...  a*
   I get the boolean error again. Google obviously can handle this query,
   and I'm pretty sure lucene can handle it.. any ideas? With out
   without a date dange specified i still get the  TooManyClauses error.
 
 
   I tired cranking the maxclauses up to Integer.MaxInt, but java gave me
   a out of memory error. Is this b/c the boolean search tried to
   allocate that many clauses by default or because my query actually
   needed that many clauses?
 
  boolean search allocates clauses for all tokens having the prefix or
  matching the wildcard expression.
 
   Why does

Re: WildCardQuery

2004-10-04 Thread Stephane James Vaucher
On Fri, 1 Oct 2004, Robinson Raju wrote:

 analyzer is StandardAnalyzer.
 i use MultiFieldQueryParser to parse.

 The flow is this:
 I have indexed a Database view. Now i need to search against a few columns
 i take in the search criteria and search field ,
 construct a wildcard query and add it to a boolean query

 WildcardQuery wQuery = new WildcardQuery(new Term(searchFields[0],
 searchString));

What is the value of searchString? Is it a word? QueryParser syntax is not
applied here.
Whats does ab* return?

 booleanQuery.add(wQuery, true, false);
 Query queryfilter = MultiFieldQueryParser.parse(filterString,
 filterFields, flags, analyzer);
 hits = parallelMultiSearcher.search(booleanQuery,queryFilter);

 when i dont use wild cards , it is taken as
 +((ITM_SHRT_DSC:natal ITM_SHRT_DSC:tylenol) (ITM_LONG_DSC:natal
 ITM_LONG_DSC:tylenol))
 But when wildcard is used , it is taken as
   +ITM_SHRT_DSC:nat* tylenol +ITM_LONG_DSC:nat* Tylenol

ITM_XXX fields are tokenized?

sv

 the first return around 300 records , the second , 0.

 any help would be appreciated
 Thanks
 Robin

 On Fri, 1 Oct 2004 02:06:04 -0400 (EDT), Stephane James Vaucher
 [EMAIL PROTECTED] wrote:
  Can you be a little more precise about how you process your documents?
 
  1) What's your analyser? SimpleAnalyzer?
  2) How do you parse the query? Out-of-the-box QueryParser?
 
   can we not enter space or do an OR search with two words one of which
   has a wildcard ?
 
  Simple answer, yes.
 
  Complicated answer, words are delimited by your tokeniser. That's included
  in your analyser (hence my question above). The asterix syntax comes
  from using a query parser that transforms the query into a PrefixQuery
  object.
 
  sv
 
  On Fri, 1 Oct 2004, Robinson Raju w Hi ,
 
 
  Would there be a problem if one enters space while using wildcards ?
   say i search for 'abc' . i get 100 hits as results
   'man' gives - 200
   'abc man' gives 300
   but
   'ab* man'
   'abc ma*'
   ab* ma*'
   ab* OR ma*
   ..
   all of these return 0 results.
   can we not enter space or do an OR search with two words one of which
   has a wildcard ?
  
   Regards,
   Robin
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: WildCardQuery

2004-10-01 Thread Stephane James Vaucher
Can you be a little more precise about how you process your documents?

1) What's your analyser? SimpleAnalyzer?
2) How do you parse the query? Out-of-the-box QueryParser?

 can we not enter space or do an OR search with two words one of which
 has a wildcard ?

Simple answer, yes.

Complicated answer, words are delimited by your tokeniser. That's included
in your analyser (hence my question above). The asterix syntax comes
from using a query parser that transforms the query into a PrefixQuery
object.

sv

On Fri, 1 Oct 2004, Robinson Raju w Hi ,
Would there be a problem if one enters space while using wildcards ?
 say i search for 'abc' . i get 100 hits as results
 'man' gives - 200
 'abc man' gives 300
 but
 'ab* man'
 'abc ma*'
 ab* ma*'
 ab* OR ma*
 ..
 all of these return 0 results.
 can we not enter space or do an OR search with two words one of which
 has a wildcard ?

 Regards,
 Robin

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BooleanQuery - Too Many Clases on date range.

2004-09-30 Thread Stephane James Vaucher
How about a DateFilter?

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/DateFilter.html

I don't believe it's got the same restrictions as boolean queries.

HTH,
sv

On Thu, 30 Sep 2004, Chris Fraschetti wrote:

 I recently read in regards to my problem that date_field:[0820483200
 TO 110448]
 is evluated into a series of boolean queries ... which has a cap of
 1024 ... considering my documents will have dates spanning over many
 years, and i need the granualirity of 'by day' searching, are there
 any reccomendations on how to make this work?

 Currently with query: +content_field:sometext +date_field:[0820483200
 TO 110448]
 I get the following exception:
 org.apache.lucene.search.BooleanQuery$TooManyClauses


 any suggestions on how I can still keep the granuality of by day, but
 without limiting my search results? Are there any date formats that I
 can change those numbers to that would allow me to complete the search
 (i.e.  Feb, 15 2004 ) .. can lucene's range do a proper search on
 formatted dates?

 Is there a combination of RangeQuery and Query/MultiTermQuery that I can use?

 your help is greatly appreciated.





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WordListLoader's whereabouts

2004-09-27 Thread Stephane James Vaucher
Hi Tate,

From the commit:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06510.html

I'd say you can use the german WordListLoader (renaming it or using a
nightly cvs version of the refactored class). I think there might be a
versionning issue here as from:

http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard

It is mentionned that:
DONE: Move language-specific analyzers into separate downloads. Also move
analysis/de/WordlistLoader.java one level upwards, as it's not specific to
German at all.

That should be only applicable for lucene 1.9... Last version comment for
BrazilianAnalyzer:

move the word list loader from analysis.de to analysis, as it is not
specific to German at all; update the references to it

HTH,
sv

On Mon, 27 Sep 2004, Tate Avery wrote:

 Hello,

 I am trying to compile the analyzers from the Lucene sandbox
 contributions.  Many of them seem to import
 org.apache.lucene.analysis.WordlistLoader which is not currently in my
 classpath.

 Does anyone know where I can find this class?  It does not appear to be in Lucene 
 1.4, so I am assuming it is another contribution perhaps?  Any help in tracking it 
 down would be appreciated.

 Also, some of the analyzers appear to have their own copy of this class
 (i.e. org.apache.lucene.analysis.nl.WordlistLoader).  Could I just
 relocate that one to the shared package, perhaps?

 Thanks,
 Tate

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing size

2004-09-01 Thread Stephane James Vaucher
Hi Niraj,

I'd rather respond to the list as others may be interested in your
questions, and since I don't consider myself a guru, I appreciate being
corrected.

For a title, I'd say yes, use the Field Text(String name, String value)
constructor. Not the others that use a reader as they do not store the
value.

You want for it to be:
1) tokenised (so to have its fragments saved for searching, not only the
totality of the text)
2) indexed so to make it searchable
3) store as to make the field retrievable from the index

hth,
sv
p.s. my name is Stephane, it's been a while since I've been in Oz
that I haven't been called James

On Wed, 1 Sep 2004, Niraj Alok wrote:

 Hi James,

 Since this would be a minor issue hence I am not posting it on the lucene.

 Lets say I have one field as title which has a value of George Bush.
 I would need to search on that title and also retrieve its value. So you are
 saying that I should have it as Field.Text?

 Also, if I need to just search on that title but want to retrieve the
 value of another field content, then title should be unstored while
 content should be stored?

 Regards,
 Niraj
 - Original Message -
 From: Stephane James Vaucher [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, September 01, 2004 10:59 AM
 Subject: Re: indexing size


  On Wed, 1 Sep 2004, Niraj Alok wrote
   I was also thinking on the same lines.
   Actually the original code was written by some one else who has left and
 so
   I have to own this.
  
   At almost all the places, it is Field.Text and at some few places its
   Field.UnIndexed.
   I looked at the javadocs and found that there is Field.UnStored also.
  
   The problem is I am not too sure which one to change to what. It would
 be
   really enlightening if you could point the differences
   between those three and what would I need to change in my search code.
  
   If I make some of them Field.Unstored, I can see from the javadocs that
   it will be indexed and tokenized but not stored. If it is not stored,
   how can I use it while searching? Basically what is meant by indexed and
   stored, indexed and not stored and not indexed and stored?
 
  If all you need is to seach a field, you do not need to store it. If it is
  not stored it can still be tokenised and analysed by lucene. It will then
  be only stored as a set of token, but not as whole. You can thus use it
  for fields that you never need to retrieve from the index.
 
  For example:
  the quick brown fox jumped over the lazy dog.
 
  will be store in lucene only as tokens, not as a whole, so using a
  whitespace analyser using a stopword list {the}:
 
  You will have these tokens in lucene:
  quick
  brown
  fox
  jumped
  over
  dog
 
  You will NOT be able to retrieve the original text, but you will be able
  to search it.
 
  HTH,
  sv
 
  
   Regards,
   Niraj
   - Original Message -
   From: petite_abeille [EMAIL PROTECTED]
   To: Lucene Users List [EMAIL PROTECTED]
   Sent: Tuesday, August 31, 2004 8:57 PM
   Subject: Re: indexing size
  
  
   
On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote:
   
 You also have a large number of
 fields, and it looks like a lot (all?) of them are stored and
 indexed.
 That's what that large .fdt file indicated.  That file is  206 MB
 in
 size.
   
Try using Field.UnStored() to avoid storing all those data in your
indices as it's usually not necessary.
   
PA.
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing size

2004-08-31 Thread Stephane James Vaucher
On Wed, 1 Sep 2004, Niraj Alok wrote
 I was also thinking on the same lines.
 Actually the original code was written by some one else who has left and so
 I have to own this.

 At almost all the places, it is Field.Text and at some few places its
 Field.UnIndexed.
 I looked at the javadocs and found that there is Field.UnStored also.

 The problem is I am not too sure which one to change to what. It would be
 really enlightening if you could point the differences
 between those three and what would I need to change in my search code.

 If I make some of them Field.Unstored, I can see from the javadocs that
 it will be indexed and tokenized but not stored. If it is not stored,
 how can I use it while searching? Basically what is meant by indexed and
 stored, indexed and not stored and not indexed and stored?

If all you need is to seach a field, you do not need to store it. If it is
not stored it can still be tokenised and analysed by lucene. It will then
be only stored as a set of token, but not as whole. You can thus use it
for fields that you never need to retrieve from the index.

For example:
the quick brown fox jumped over the lazy dog.

will be store in lucene only as tokens, not as a whole, so using a
whitespace analyser using a stopword list {the}:

You will have these tokens in lucene:
quick
brown
fox
jumped
over
dog

You will NOT be able to retrieve the original text, but you will be able
to search it.

HTH,
sv


 Regards,
 Niraj
 - Original Message -
 From: petite_abeille [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Tuesday, August 31, 2004 8:57 PM
 Subject: Re: indexing size


 
  On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote:
 
   You also have a large number of
   fields, and it looks like a lot (all?) of them are stored and indexed.
   That's what that large .fdt file indicated.  That file is  206 MB in
   size.
 
  Try using Field.UnStored() to avoid storing all those data in your
  indices as it's usually not necessary.
 
  PA.
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Range query problem

2004-08-27 Thread Stephane James Vaucher
A description on how to search numerical fields is available on the wiki:

http://wiki.apache.org/jakarta-lucene/SearchNumericalFields

sv

On Thu, 26 Aug 2004, Alex Kiselevski wrote:


 Thanks, I'll try it

 -Original Message-
 From: Daniel Naber [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 26, 2004 12:59 PM
 To: Lucene Users List
 Subject: Re: Range query problem


 On Thursday 26 August 2004 11:02, Alex Kiselevski wrote:

  I have a strange problem with range query PERIOD:[1 TO 9] It works
  only if the second parameter is equals or less than 9 If it's greater
  than 9 , it finds no documents

 You have to store your numbers so that they will appear in the right
 order
 when sorted lexicographically, e.g. save 1 as 01 if you save numbers up
 to
 99, or as 0001 if you save numbers up to . You also have to use this

 format for searching I think.

 Regards
  Daniel

 --
 http://www.danielnaber.de

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 The information contained in this message is proprietary of Amdocs,
 protected from disclosure, and may be privileged.
 The information is intended to be conveyed only to the designated recipient(s)
 of the message. If the reader of this message is not the intended recipient,
 you are hereby notified that any dissemination, use, distribution or copying of
 this communication is strictly prohibited and may be unlawful.
 If you have received this communication in error, please notify us immediately
 by replying to the message and deleting it from your computer.
 Thank you.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: what is wrong with query

2004-08-25 Thread Stephane James Vaucher
You'll have to give us more information than that...

What is the problem you are seeing? I'll assume that you get no results.

Tell us of the structure of your documents and how you index every field.

Concerning your syntax, if you are using the distributed query parser, you
don't need the + before name, nor the + before university as they will be
added by the parser.

sv

On Wed, 25 Aug 2004, Alex Kiselevski wrote:


 Hi, pls,
 Tell me what is wrong with query:
 author:( +name AND full name~) AND book:( +university)


 Alex Kiselevsky
  Speech TechnologyTel:972-9-776-43-46
 RD, Amdocs - Israel  Mobile: 972-53-63 50 38
 mailto:[EMAIL PROTECTED]




 The information contained in this message is proprietary of Amdocs,
 protected from disclosure, and may be privileged.
 The information is intended to be conveyed only to the designated recipient(s)
 of the message. If the reader of this message is not the intended recipient,
 you are hereby notified that any dissemination, use, distribution or copying of
 this communication is strictly prohibited and may be unlawful.
 If you have received this communication in error, please notify us immediately
 by replying to the message and deleting it from your computer.
 Thank you.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: what is wrong with query

2004-08-25 Thread Stephane James Vaucher
From: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html

Fuzzy Searches

Lucene supports fuzzy searches based on the Levenshtein Distance, or
Edit Distance algorithm. To do a fuzzy search use the tilde, ~, symbol
at the end of a Single word Term.

I haven't used fuzzy searches, but it seems to indicate that it can only
be used with single word terms. The query parser might have been written
to support that (the output indicates that as well).

HTH,
sv

On Wed, 25 Aug 2004, Alex Kiselevski wrote:


 I use QueryParser
 And I got an exception :
 org.apache.lucene.queryParser.ParseException: Encountered ~ at line 1,
 column 44.
 Was expecting one of:
 AND ...
 OR ...
 NOT ...
 + ...
 - ...
 ( ...
 ) ...
 ^ ...
 QUOTED ...
 TERM ...
 SLOP ...
 PREFIXTERM ...
 WILDTERM ...
 [ ...
 { ...
 NUMBER ...

 at
 org.apache.lucene.queryParser.QueryParser.generateParseException(QueryPa
 rser.java:1045
 at
 org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.j
 ava:925)
 at
 org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:562)
 at
 org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500)
 at
 org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108)
 at
 com.stp.corr.cv.search.CVSearcher.getMatchedResults(CVSearcher.java:89)
 at com.stp.test.CVTest.main(CVTest.java:223)

 -Original Message-
 From: Stephane James Vaucher [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 25, 2004 10:07 AM
 To: Lucene Users List
 Subject: Re: what is wrong with query


 You'll have to give us more information than that...

 What is the problem you are seeing? I'll assume that you get no results.

 Tell us of the structure of your documents and how you index every
 field.

 Concerning your syntax, if you are using the distributed query parser,
 you don't need the + before name, nor the + before university as they
 will be added by the parser.

 sv

 On Wed, 25 Aug 2004, Alex Kiselevski wrote:

 
  Hi, pls,
  Tell me what is wrong with query:
  author:( +name AND full name~) AND book:( +university)
 
 
  Alex Kiselevsky
   Speech Technology  Tel:972-9-776-43-46
  RD, Amdocs - IsraelMobile: 972-53-63 50 38
  mailto:[EMAIL PROTECTED]
 
 
 
 
  The information contained in this message is proprietary of Amdocs,
  protected from disclosure, and may be privileged. The information is
  intended to be conveyed only to the designated recipient(s) of the
  message. If the reader of this message is not the intended recipient,
  you are hereby notified that any dissemination, use, distribution or
  copying of this communication is strictly prohibited and may be
  unlawful. If you have received this communication in error, please
  notify us immediately by replying to the message and deleting it from
  your computer. Thank you.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 The information contained in this message is proprietary of Amdocs,
 protected from disclosure, and may be privileged.
 The information is intended to be conveyed only to the designated recipient(s)
 of the message. If the reader of this message is not the intended recipient,
 you are hereby notified that any dissemination, use, distribution or copying of
 this communication is strictly prohibited and may be unlawful.
 If you have received this communication in error, please notify us immediately
 by replying to the message and deleting it from your computer.
 Thank you.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Time to index documents

2004-08-25 Thread Stephane James Vaucher
I don't think that the demo parser is meant as a production 
system component. You can look at Tidy or NekoHtml. They cleanup your html 
and are probably optimised.

sv

On Wed, 25 Aug 2004, Hetan Shah wrote:

 Hello all,
 
 Is there a way to reduce the indexing time taken when the indexer is 
 indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
 do this. I am using IndexHTML class to create the index out of HTML files.
 
 Another issue that I see is every once in a while I get the following 
 output on the screen.
 
 adding ../31/1104852.html
 Parse Aborted: Encountered \ at line 7, column 1.
 Was expecting one of:
  ArgName ...
  = ...
  TagEnd ...
 
 Any suggestions on preventing this from happening?
 
 Thanks in advance.
 -H
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Time to index documents

2004-08-25 Thread Stephane James Vaucher
JGuru explanation: 
http://www.jguru.com/faq/view.jsp?EID=1074228

I have no sample code for neko, I think nutch uses it though. For tidy, 
you can look at ant in the sandbox:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/ant/src/main/org/apache/lucene/ant/HtmlDocument.java?rev=1.3view=markup

HTH,
sv

On Wed, 25 Aug 2004, Hetan Shah wrote:

 Do you have any pointers for sample code for them?
 Would highly appreciate it.
 Thanks.
 -H
 
 Stephane James Vaucher wrote:
 
  I don't think that the demo parser is meant as a production 
  system component. You can look at Tidy or NekoHtml. They cleanup your html 
  and are probably optimised.
  
  sv
  
  On Wed, 25 Aug 2004, Hetan Shah wrote:
  
  
 Hello all,
 
 Is there a way to reduce the indexing time taken when the indexer is 
 indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
 do this. I am using IndexHTML class to create the index out of HTML files.
 
 Another issue that I see is every once in a while I get the following 
 output on the screen.
 
 adding ../31/1104852.html
 Parse Aborted: Encountered \ at line 7, column 1.
 Was expecting one of:
  ArgName ...
  = ...
  TagEnd ...
 
 Any suggestions on preventing this from happening?
 
 Thanks in advance.
 -H
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
  
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Time to index documents

2004-08-25 Thread Stephane James Vaucher
Hetan,

If you are using a corpus with multiple editors, I suggest that you 
use a cleaner like tidy as there might be weird stuff appearing in the 
html.

sv

On Thu, 26 Aug 2004, Karthik N S wrote:

 Hi Hetan
 
 
Th's the  major Problem of non Standatrdized Tags for HTML Document's
   u are Indexing ,resulting in lag time taken for Indexing process
 
 
If u can Tweak the HTMLParser.jj file within  lucene.zip   '/demo/html'
 file
[U have to have some Knowledge of JAVACC for this].
 
 
 
 Karthik
 
 -Original Message-
 From: Hetan Shah [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 26, 2004 3:01 AM
 To: Lucene Users List
 Subject: Time to index documents
 
 
 Hello all,
 
 Is there a way to reduce the indexing time taken when the indexer is
 indexing about 30,000 + files. It is roughly taking around 6-7 hours to
 do this. I am using IndexHTML class to create the index out of HTML files.
 
 Another issue that I see is every once in a while I get the following
 output on the screen.
 
 adding ../31/1104852.html
 Parse Aborted: Encountered \ at line 7, column 1.
 Was expecting one of:
  ArgName ...
  = ...
  TagEnd ...
 
 Any suggestions on preventing this from happening?
 
 Thanks in advance.
 -H
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene PDF indexing

2004-08-24 Thread Stephane James Vaucher
You need to add log4j to your classpath:

http://logging.apache.org/log4j/docs/

sv

On 24 Aug 2004, sivalingam T wrote:

 Hi

I have written one files for PDF Indexing. Here I have written as follows ..  

This is my IndexPDF file.

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermEnum;

import org.pdfbox.searchengine.lucene.LucenePDFDocument;

import java.io.File;
import java.util.Date;
import java.util.Arrays;

class IndexPDF {
  private static boolean deleting = false;// true during deletion pass
  private static IndexReader reader;  // existing index
  private static IndexWriter writer;  // new index being built
  private static TermEnum uidIter;// document id iterator

  public static void main(String[] argv) {
try {
  String index = index;
  boolean create = false;
  File root = null;

  String usage = IndexHTML [-create] [-index index] root_directory;

  if (argv.length == 0) {
System.err.println(Usage:  + usage);
return;
  }

  for (int i = 0; i  argv.length; i++) {
if (argv[i].equals(-index)) {   // parse -index option
  index = argv[++i];
} else if (argv[i].equals(-create)) {   // parse -create option
  create = true;
} else if (i != argv.length-1) {
  System.err.println(Usage:  + usage);
  return;
} else
  root = new File(argv[i]);
  }

  Date start = new Date();

  if (!create) {  // delete stale docs
deleting = true;
indexDocs(root, index, create);
  }

  writer = new IndexWriter(index, new StandardAnalyzer(), create);
  writer.maxFieldLength = 100;

  indexDocs(root, index, create); // add new docs

  System.out.println(Optimizing index...);
  writer.optimize();
  writer.close();

  Date end = new Date();

  System.out.print(end.getTime() - start.getTime());
  System.out.println( total milliseconds);

} catch (Exception e) {
  System.out.println( caught a  + e.getClass() +
 \n with message:  + e.getMessage());
}
  }

  /* Walk directory hierarchy in uid order, while keeping uid iterator from
  /* existing index in sync.  Mismatches indicate one of: (a) old documents to
  /* be deleted; (b) unchanged documents, to be left alone; or (c) new
  /* documents, to be indexed.
   */

  private static void indexDocs(File file, String index, boolean create)
   throws Exception {
if (!create) {// incrementally update

  reader = IndexReader.open(index);   // open existing index
  uidIter = reader.terms(new Term(uid, )); // init uid iterator

  indexDocs(file);

  if (deleting) { // delete rest of stale docs
while (uidIter.term() != null  uidIter.term().field() == uid) {
  System.out.println(deleting  +
 HTMLDocument.uid2url(uidIter.term().text()));
  reader.delete(uidIter.term());
  uidIter.next();
}
deleting = false;
  }

  uidIter.close();// close uid iterator
  reader.close(); // close existing index

} else// don't have exisiting
  indexDocs(file);
  }

  private static void indexDocs(File file) throws Exception
  {
if (file.isDirectory())
{ // if a directory
  String[] files = file.list();   // list its files
  Arrays.sort(files); // sort the files
  for (int i = 0; i  files.length; i++)
{  // recursively index them
indexDocs(new File(file, files[i]));
}

} 
if ((file.getPath().endsWith(.pdf )) || (file.getPath().endsWith(.PDF )))
 {
System.out.println( Indexing PDF document:  + file );
try
{
//Document doc = LucenePDFDocument.getDocument( file );
writer.addDocument(LucenePDFDocument.getDocument( file));
}
catch(Exception e)
{}
 }
 
  }
  
}

when i use the following commands, the exceptions are thrown if anybody know please 
inform me.


C:\java org.apache.lucene.demo.IndexPDF -create -index c:\lucene\pdf c:\pdfs\Words.pdf

Indexing PDF document: c:\pdfs\Words.pdf
Exception in thread main java.lang.NoClassDefFoundError: org/apache/log4j/Cate
gory
at org.pdfbox.searchengine.lucene.LucenePDFDocument.addContent(LucenePDF
Document.java:197)
at 

Re: pdfboxhelp

2004-08-23 Thread Stephane James Vaucher
Your classpath should point to a directory that contains log4j.properties, 
not the file directly, see below.

sv

On Mon, 23 Aug 2004, Santosh wrote:

 Hi natarajan,
 I kept log4j.properties in the classpath
 my new classpath is
 
 C:\j2sdk1.4.1\lib\log4j.properties;

should be C:\j2sdk1.4.1\lib\
 
 but there is no difference in the output
 
 
 - Original Message -
 From: Natarajan.T [EMAIL PROTECTED]
 To: 'Lucene Users List' [EMAIL PROTECTED]
 Sent: Monday, August 23, 2004 10:56 AM
 Subject: RE: pdfboxhelp
 
 
  Hi Santhosh,
 
  The attached file must be in your class path.
 
 
  Natarajan.
 
 
 
  -Original Message-
  From: Santosh [mailto:[EMAIL PROTECTED]
  Sent: Monday, August 23, 2004 10:51 AM
  To: Lucene Users List
  Subject: Fw: pdfboxhelp
 
  hi karthik,
  did u find any solution? should I send the pdf to u?
  - Original Message -
  From: Santosh [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Monday, August 23, 2004 10:23 AM
  Subject: Re: pdfboxhelp
 
 
   hi karthik,
I kept log4j in the classpath , I am sending classpath variable
  
   CLASSPATH
  
  
  .;..;C:\j2sdk1.4.1\lib;C:\j2sdk1.4.1\lib\jndi.jar;C:\j2sdk1.4.1\lib\webc
  lien
  
  t.jar;C:\j2sdk1.4.1\lib\mail.jar;C:\j2sdk1.4.1\lib\activation.jar;C:\j2s
  dk1.
  
  4.1\lib\xml-apis.jar;D:\JAVAPRO;C:\j2sdk1.4.1\jre\lib\ext\msbase.jar;C:\
  j2sd
   k1.4.1\lib\servlet.jar;E:\Program Files\Apache Tomcat
   4.0\common\lib\servlet.jar;C:\Program
  
  Files\Altova\xmlspy\XMLSpyInterface.jar;C:\j2sdk1.4.1\lib\sax.jar;C:\j2s
  dk1.
  
  4.1\lib\dom.jar;C:\j2sdk1.4.1\lib\xalan.jar;C:\j2sdk1.4.1\lib\xercesImpl
  .jar
  
  ;C:\j2sdk1.4.1\lib\xmlParserAPIs.jar;C:\j2sdk1.4.1\lib\parser.jar;C:\j2s
  dk1.
  
  4.1\lib\jaxp.jar;C:\j2sdk1.4.1\lib\xml.jar;C:\j2sdk1.4.1\lib\classes12.z
  ip;C
  
  :\struts.jar;F:\apache-ant-1.6.1\lib\ant.jar;C:\j2sdk1.4.1\lib\PDFBox-0.
  6.6.
  
  jar;C:\j2sdk1.4.1\lib\lucene-20030909.jar;D:\setups\searchEngine\PDFBox-
  0.6.
   6\external\log4j.jar
  
   please check the error
  
  
  
   - Original Message -
   From: Karthik N S [EMAIL PROTECTED]
   To: Lucene Users List [EMAIL PROTECTED]
   Sent: Monday, August 23, 2004 10:26 AM
   Subject: RE: pdfboxhelp
  
  
Hi Santosh
   
  I think u'r Pdf is using  Log4j package ,Try toe set the classpath
  for
log4j.jar path.
   
 [ Is it a just a WARNING  or an ERROR  u are getting.
   
  Send me in u'r Configuration management Let me help u with it
  ; [
   
   
Karthik
   
-Original Message-
From: Santosh [mailto:[EMAIL PROTECTED]
Sent: Monday, August 23, 2004 10:11 AM
To: Lucene Users List
Cc: Ben Litchfield
Subject: Re: pdfboxhelp
   
   
hi karthik,
   
I have downloaded pdfbox and kept pdfjar file in the classpath, but
  when
  I
am typing following command in the command prompt I am getting the
  error:
   
D:\setups\searchEngine\PDFBox-0.6.6\srcjava org.pdfbox.ExtractText
C:\test.pdf
C:\test.txt
log4j:WARN No appenders could be found for logger
(org.pdfbox.pdfparser.PDFParse
r).
log4j:WARN Please initialize the log4j system properly
   
why I am getting this error? plz help
   
   
- Original Message -
From: Karthik N S [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, August 23, 2004 9:21 AM
Subject: RE: pdfboxhelp
   
   
 Hi


 To Begin with try to build Indexes offline  [ out of Tomcat
   container]
 and  on completing indxexes, feed u'r search  with the realpath of
  the
offline indexed folder,Start the Tomcat and then use the
 search on As u experiment it out u will be comfortable
   withrequirment
of Indexing /Search..   ; [

 Karthik

 -Original Message-
 From: Santosh [mailto:[EMAIL PROTECTED]
 Sent: Saturday, August 21, 2004 4:55 PM
 To: Lucene Users List
 Subject: Re: pdfboxhelp


 Yes I did the same.
 I copied all the classes into classes folder but
 now when I am building the index using IndexHTML the pdfs are not
  added
   to
 this index, only text and htmls are added to index.
 what changes should I do for IndexHTML.java to build index with
  pdf
 - Original Message -
 From: Karthik N S [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Saturday, August 21, 2004 4:54 PM
 Subject: RE: pdfboxhelp


  Hi
 
  If u are using the jar file with Web Interface for jsp/servlet
  dev,
Place
  the jar file in  webapps/u'rapplication/Web-inf/lib
  and also correct the Classpath for the present modification.
 
  2)create u'r own package and put all u'r java files  copy the
  java
   files
 to
  /Web-inf/Classes/u'r package
 
 
  Then use the same..;{
 
 
  Karthik
 
  -Original Message-
  From: Santosh 

Re: Lucene Search Applet

2004-08-23 Thread Stephane James Vaucher
Hi Simon,

Does this work? From FSDirectory api:

If the system property 'disableLuceneLocks' has the String value of 
true, lock creation will be disabled.

Otherwise, I think there was a Read-Only Directory hack:

http://www.mail-archive.com/[EMAIL PROTECTED]/msg05148.html

HTH,
sv

On Mon, 23 Aug 2004, Simon mcIlwaine wrote:

 Thanks Jon that works by putting the jar file in the archive attribute. Now
 im getting the disablelock error cause of the unsigned applet. Do I just
 comment out the code anywhere where System.getProperty() appears in the
 files that you specified and then update the JAR Archive?? Is it possible
 you could show me one of the hacked files so that I know what I'm modifying?
 Does anyone else know if there is another way of doing this without having
 to hack the source code?
 
 Many thanks.
 
 Simon
 
 - Original Message - 
 From: Jon Schuster [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Saturday, August 21, 2004 2:08 AM
 Subject: Re: Lucene Search Applet
 
 
  I have Lucene working in an applet and I've seen this problem only when
  the jar file really was not available (typo in the jar name), which is
  what you'd expect. It's possible that the classpath for your
  application is not the same as the classpath for the applet; perhaps
  they're using different VMs or JREs from different locations.
 
  Try referencing the Lucene jar file in the archive attribute of the
  applet tag.
 
  Also, to get Lucene to work from an unsigned applet, I had to modify a
  few classes that call System.getProperty(), because the properties that
  were being requested were disallowed for applets. I think the classes
  were IndexWriter, FSDirectory, and BooleanQuery.
 
  --Jon
 
 
  On Aug 20, 2004, at 6:57 AM, Simon mcIlwaine wrote:
 
   Im a new Lucene User and I'm not too familiar with Applets either but
   I've
   been doing a bit of testing on java applet security and if im correct
   in
   saying that applets can read anything below there codebase then my
   problem
   is not a security restriction one. The error is reading
   java.lang.NoClassDefFoundError and the classpath is set as I have it
   working
   in a Swing App. Does someone actually have Lucene working in an
   Applet? Can
   it be done?? Please help.
  
   Thanks
  
   Simon
  
   - Original Message -
  
   From: Terry Steichen [EMAIL PROTECTED]
   To: Lucene Users List [EMAIL PROTECTED]
   Sent: Wednesday, August 18, 2004 4:17 PM
   Subject: Re: Lucene Search Applet
  
  
   I suspect it has to do with the security restrictions of the applet,
   'cause
   it doesn't appear to be finding your Lucene jar file.  Also, regarding
   the
   lock files, I believe you can disable the locking stuff just for
   purposes
   like yours (read-only index).
  
   Regards,
  
   Terry
 - Original Message -
 From: Simon mcIlwaine
 To: Lucene Users List
 Sent: Wednesday, August 18, 2004 11:03 AM
 Subject: Lucene Search Applet
  
  
 Im developing a Lucene CD-ROM based search which will search html
   pages on
   CD-ROM, using an applet as the UI. I know that theres a problem with
   lock
   files and also security restrictions on applets so I am using the
   RAMDirectory. I have it working in a Swing application however when I
   put it
   into an applet its giving me problems. It compiles but when I go to
   run the
   applet I get the error below. Can anyone help? Thanks in advance.
 Simon
  
 Error:
  
 Java.lang.noClassDefFoundError: org/apache/lucene/store/Directory
  
 At: Java.lang.Class.getDeclaredConstructors0(Native Method)
  
 At: Java.lang.Class.privateGetDeclaredConstructors(Class.java:1610)
  
 At: Java.lang.Class.getConstructor0(Class.java:1922)
  
 At: Java.lang.Class.newInstance0(Class.java:278)
  
 At: Java.lang.Class.newInstance(Class.java:261)
  
 At: sun.applet.AppletPanel.createApplet(AppletPanel.java:617)
  
 At: sun.applet.AppletPanel.runloader(AppletPanel.java:546)
  
 At: sun.applet.AppletPanel.run(AppletPanel.java:298)
  
 At: java.lang.Thread.run(Thread.java:534)
  
 Code:
  
 import org.apache.lucene.search.IndexSearcher;
  
 import org.apache.lucene.search.Query;
  
 import org.apache.lucene.search.TermQuery;
  
 import org.apache.lucene.store.RAMDirectory;
  
 import org.apache.lucene.store.Directory;
  
 import org.apache.lucene.index.Term;
  
 import org.apache.lucene.search.Hits;
  
 import java.awt.*;
  
 import java.awt.event.*;
  
 import javax.swing.*;
  
 import java.io.*;
  
 public class MemorialApp2 extends JApplet implements ActionListener{
  
 JLabel prompt;
  
 JTextField input;
  
 JButton search;
  
 JPanel panel;
  
 String indexDir = C:/Java/lucene/index-list;
  
 private static RAMDirectory idx;
  
 public void init(){
  
 Container cp = getContentPane();
  
 panel = new JPanel();
  
 

Re: Lucene Search Applet

2004-08-23 Thread Stephane James Vaucher
I haven't used it, and I'm a little confused from the code:
/** ...
 * pIf the system property 'disableLuceneLocks' has the String value of
 * true, lock creation will be disabled.
 */
public final class FSDirectory extends Directory {
  private static final boolean DISABLE_LOCKS =
  Boolean.getBoolean(disableLuceneLocks) || Constants.JAVA_1_1;
...

I don't see a System.getProperty(String).

You might have to patch this, if I'm correct. This should stop the 
Directory from trying to use locks.

HTH,
sv

On Mon, 23 Aug 2004, Simon mcIlwaine wrote:

 Hi Stephane,
 
 A bit of a stupid question but how do you mean set the system property
 disableLuceneLocks=true? Can I do it from a call from FSDirectory API or do
 I have to actually hack the code? Also if I do use RODirectory how do I go
 about using it? Do I have to update the Lucene JAR archive file with
 RODirectory class included as I tried using it and its not recognising the
 class?
 
 Many Thanks
 
 Simon
 
 - Original Message - 
 From: Stephane James Vaucher [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Monday, August 23, 2004 2:22 PM
 Subject: Re: Lucene Search Applet
 
 
  Hi Simon,
 
  Does this work? From FSDirectory api:
 
  If the system property 'disableLuceneLocks' has the String value of
  true, lock creation will be disabled.
 
  Otherwise, I think there was a Read-Only Directory hack:
 
  http://www.mail-archive.com/[EMAIL PROTECTED]/msg05148.html
 
  HTH,
  sv
 
  On Mon, 23 Aug 2004, Simon mcIlwaine wrote:
 
   Thanks Jon that works by putting the jar file in the archive attribute.
 Now
   im getting the disablelock error cause of the unsigned applet. Do I just
   comment out the code anywhere where System.getProperty() appears in the
   files that you specified and then update the JAR Archive?? Is it
 possible
   you could show me one of the hacked files so that I know what I'm
 modifying?
   Does anyone else know if there is another way of doing this without
 having
   to hack the source code?
  
   Many thanks.
  
   Simon
  
   - Original Message - 
   From: Jon Schuster [EMAIL PROTECTED]
   To: Lucene Users List [EMAIL PROTECTED]
   Sent: Saturday, August 21, 2004 2:08 AM
   Subject: Re: Lucene Search Applet
  
  
I have Lucene working in an applet and I've seen this problem only
 when
the jar file really was not available (typo in the jar name), which is
what you'd expect. It's possible that the classpath for your
application is not the same as the classpath for the applet; perhaps
they're using different VMs or JREs from different locations.
   
Try referencing the Lucene jar file in the archive attribute of the
applet tag.
   
Also, to get Lucene to work from an unsigned applet, I had to modify a
few classes that call System.getProperty(), because the properties
 that
were being requested were disallowed for applets. I think the classes
were IndexWriter, FSDirectory, and BooleanQuery.
   
--Jon
   
   
On Aug 20, 2004, at 6:57 AM, Simon mcIlwaine wrote:
   
 Im a new Lucene User and I'm not too familiar with Applets either
 but
 I've
 been doing a bit of testing on java applet security and if im
 correct
 in
 saying that applets can read anything below there codebase then my
 problem
 is not a security restriction one. The error is reading
 java.lang.NoClassDefFoundError and the classpath is set as I have it
 working
 in a Swing App. Does someone actually have Lucene working in an
 Applet? Can
 it be done?? Please help.

 Thanks

 Simon

 - Original Message -

 From: Terry Steichen [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, August 18, 2004 4:17 PM
 Subject: Re: Lucene Search Applet


 I suspect it has to do with the security restrictions of the applet,
 'cause
 it doesn't appear to be finding your Lucene jar file.  Also,
 regarding
 the
 lock files, I believe you can disable the locking stuff just for
 purposes
 like yours (read-only index).

 Regards,

 Terry
   - Original Message -
   From: Simon mcIlwaine
   To: Lucene Users List
   Sent: Wednesday, August 18, 2004 11:03 AM
   Subject: Lucene Search Applet


   Im developing a Lucene CD-ROM based search which will search html
 pages on
 CD-ROM, using an applet as the UI. I know that theres a problem with
 lock
 files and also security restrictions on applets so I am using the
 RAMDirectory. I have it working in a Swing application however when
 I
 put it
 into an applet its giving me problems. It compiles but when I go to
 run the
 applet I get the error below. Can anyone help? Thanks in advance.
   Simon

   Error:

   Java.lang.noClassDefFoundError: org/apache/lucene/store/Directory

Re: Lucene Search Applet

2004-08-23 Thread Stephane James Vaucher
Thanks Erik for correcting me,

I feel a bit stupid: I actually looked at the api to make sure that I 
wasn't in left field, but I trusted common-sense and stopped at the 
constructor ;)

Should this property be changed in the next major release of lucene to 
org.apache...disableLuceneLocks?

sv

On Mon, 23 Aug 2004, Erik Hatcher wrote:

 
 On Aug 23, 2004, at 10:48 AM, Stephane James Vaucher wrote:
  I haven't used it, and I'm a little confused from the code:
  /** ...
   * pIf the system property 'disableLuceneLocks' has the String value 
  of
   * true, lock creation will be disabled.
   */
  public final class FSDirectory extends Directory {
private static final boolean DISABLE_LOCKS =
Boolean.getBoolean(disableLuceneLocks) || Constants.JAVA_1_1;
  ...
 
  I don't see a System.getProperty(String).
 
 :)
 
 check the javadocs for Boolean.getBoolean()
 
 It's by far one on of the dumbest and most confusing API's ever!  
 (basically this does a System.getProperty(disableLuceneLocks) and 
 converts it to a boolean.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Stephane James Vaucher
Stupid question:

Are you sure you have the right number of docs in your index? i.e. you're
not adding the same document twice into or via your tmp index.

sv

On Thu, 19 Aug 2004, Rob Jose wrote:

 Paul
 Thank you for your response.  I have appended to the bottom of this message
 the field structure that I am using.  I hope that this helps.  I am using
 the StandardAnalyzer.  I do not believe that I am changing any default
 values, but I have also appended the code that adds the temp index to the
 production index.

 Thanks for you help
 Rob

 Here is the code that describes the field structure.
 public static Document Document(String contents, String path, Date modified,
 String runDate, String totalpages, String pagecount, String countycode,
 String reportnum, String reportdescr)

 {

 SimpleDateFormat showFormat = new
 SimpleDateFormat(TurbineResources.getString(date.default.format));

 SimpleDateFormat searchFormat = new SimpleDateFormat(MMdd);

 Document doc = new Document();

 doc.add(Field.Keyword(path, path));

 doc.add(Field.Keyword(modified, showFormat.format(modified)));

 doc.add(Field.UnStored(searchDate, searchFormat.format(modified)));

 doc.add(Field.Keyword(runDate, runDate==null?:runDate));

 doc.add(Field.UnStored(searchRunDate,
 runDate==null?:runDate.substring(6)+runDate.substring(0,2)+runDate.substri
 ng(3,5)));

 doc.add(Field.Keyword(reportnum, reportnum));

 doc.add(Field.Text(reportdescr, reportdescr));

 doc.add(Field.UnStored(cntycode, countycode));

 doc.add(Field.Keyword(totalpages, totalpages));

 doc.add(Field.Keyword(page, pagecount));

 doc.add(Field.UnStored(contents, contents));

 return doc;

 }



 Here is the code that adds the temp index to the production index.

 File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode);

 tempReader = IndexReader.open(tempFile);

 try

 {

 boolean createIndex = false;

 File f = new File(sIndex + File.separatorChar + sCntyCode);

 if (!f.exists())

 {

 createIndex = true;

 }

 prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
 StandardAnalyzer(), createIndex);

 }

 catch (Exception e)

 {

 IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
 sCntyCode, false));

 CasesReports.log(Tried to Unlock  + sIndex);

 prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);

 CasesReports.log(Successfully Unlocked  + sIndex + File.separatorChar +
 sCntyCode);

 }

 prodWriter.setUseCompoundFile(true);

 prodWriter.addIndexes(new IndexReader[] { tempReader });





 - Original Message -
 From: Paul Elschot [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 12:16 AM
 Subject: Re: Index Size


 On Wednesday 18 August 2004 22:44, Rob Jose wrote:
  Hello
  I have indexed several thousand (52 to be exact) text files and I keep
  running out of disk space to store the indexes.  The size of the documents
  I have indexed is around 2.5 GB.  The size of the Lucene indexes is around
  287 GB.  Does this seem correct?  I am not storing the contents of the

 As noted, one would expect the index size to be about 35%
 of the original text, ie. about 2.5GB * 35% = 800MB.
 That is two orders of magnitude off from what you have.

 Could you provide some more information about the field structure,
 ie. how many fields, which fields are stored, which fields are indexed,
 evt. use of non standard analyzers, and evt. non standard
 Lucene settings?

 You might also try to change to non compound format to have a look
 at the sizes of the individual index files, see file formats on the lucene
 web site.
 You can then see the total disk size of for example the stored fields.

 Regards,
 Paul Elschot


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-18 Thread Stephane James Vaucher
From: Doug Cutting
http://www.mail-archive.com/[EMAIL PROTECTED]/msg08757.html

 An index typically requires around 35% of the plain text size.

I think it's a little big.

sv

On Wed, 18 Aug 2004, Rob Jose wrote:

 Hello
 I have indexed several thousand (52 to be exact) text files and I keep 
 running out of disk space to store the indexes.  The size of the 
 documents I have indexed is around 2.5 GB.  The size of the Lucene 
 indexes is around 287 GB.  Does this seem correct?  I am not storing the 
 contents of the file, just indexing and tokenizing.  I am using Lucene 
 1.3 final.  Can you guys let me know what you are experiencing?  I don't 
 want to go into production with something that I should be configuring 
 better.  
 
 I am not sure if this helps, but I have a temp index and a real index.  I index the 
 file into the temp index, and then merge the temp index into the real index using 
 the addIndexes method on the IndexWriter.  I have also set the production writer 
 setUseCompoundFile to true.  I did not set this on the temp index.  The last thing 
 that I do before closing the production writer is to call the optimize method.  
 
 I would really appreciate any ideas to get the index size smaller if it is at all 
 possible.
 
 Thanks
 Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Swapping Indexes?

2004-08-17 Thread Stephane James Vaucher
On Tue, 17 Aug 2004, Patrick Burleson wrote:

 Forward back to list.
 
 
 -- Forwarded message --
 From: Patrick Burleson [EMAIL PROTECTED]
 Date: Tue, 17 Aug 2004 11:30:19 -0400
 Subject: Re: Swapping Indexes?
 To: Stephane James Vaucher [EMAIL PROTECTED]
 
 Stephane,
 
 Thank you for the ideas. I'm going about implenting idea 1 (I like the
 idea of leaving the temp index around for recovery), but I have a
 question reguarding your original index. Do you just copy over the
 temp index and don't worry abou cleaning up the old index directory?

Actually, I use a IndexWriter in overwrite mode on the master dir and 
merge the temp dir. This cleans up the old master.
 
 Right now I have my code deleting the files in the main index
 directory after telling the search controller to switch to the temp
 index. But by doing that, I need to manage existing searches and not
 break them while they are running. I also still run into the open
 files problem on Windows when trying to delete a file one of the
 searchers has open before it's closed.

I used to way some time (~1 minute) for all searches on the old master to 
finish after redirecting to the temp dir, then I would switch to the new 
master. 

 Thoughts?

If you apply a lease-like contract with your searchers where they 
borrow a reference to a searcher and then hand it back to the manager, 
you can probably trace your open files.

HTH,
sv
 
 Patrick
 
 
 
 
 On Mon, 16 Aug 2004 18:22:20 -0400 (EDT), Stephane James Vaucher
 [EMAIL PROTECTED] wrote:
  I've tried two options that seem to work:
 
  1) Have a singleton that is responsible that will control your searchers.
  This controller can temporarilly redirect your searchers to
  c:/temp/myindex, allowing you to copy you index to c:/myindex. After that
  process completes, your controller can tell your searchers to use
  c:/myindex, allowing you to then erase your temp index.
 
  If you index nightly, you can always *not* erase your tmp dir, your index
  process will do this automatically if you create your IndexWriter with
  the overwrite option. This way, you can have a backup index if there is
  a system failure at some point (like when you copy/move directories).
 
  2) Use an incremental index. Regularly, I scan my files, see if there are
  modification/additions and update my master index. Removing from the
  master index, adding to a temp dir, then merging. I haven't seen any
  weirdness on windows with this process.
 
  HTH,
  sv
 
 
 
  On Mon, 16 Aug 2004, Patrick Burleson wrote:
 
   I've read in the docs about updating an index and its suggestion
   reguarding swapping out indexes with a directory rename.
  
   Here's my question, how to do this when searches are running live?
  
   Say I have a directory that holds the current valid index:
  
   C:\myindex
  
   and when I'm running my nightly process to generate the index, it gets
   temporarily indexed to:
  
   C:\temp\myindex
  
   How can I very quickly replace C:\myindex with C:\temp\myindex?
  
   I can't simply do a rename since C:\myindex will likely have open
   files. (Gotta love windows)
  
   And I can't delete all files in myindex, again because of the open files issue.
  
   Any ideas?
  
   Thanks,
   Patrick
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Swapping Indexes?

2004-08-17 Thread Stephane James Vaucher
On Tue, 17 Aug 2004, Patrick Burleson wrote:

 On Tue, 17 Aug 2004 13:17:10 -0400 (EDT), Stephane James Vaucher 
  
  Actually, I use a IndexWriter in overwrite mode on the master dir and
  merge the temp dir. This cleans up the old master.
  
 
 I'm a bit of a Lucene newbie here, and I am trying to understand what
 you mean by merge the temp dir? 

IndexWriter.addIndexes()

 Do you copy your exiting Index to
 the temp location, then use the overwrite feature of IndexWriter to
 re-create the master, then what do you merge? Shouldn't the master
 index now have everything?

What I mean is the following:

1) create tmp dir
2) redirect searchers to tmp dir
3) wait for everyone to use tmp dir (or other mecanism)
4) open indexwriter on master dir erasing it
5) merge tmp directory, using addIndexes() method
6) redirect searchers to new master dir
 
  
  I used to way some time (~1 minute) for all searches on the old master to
  finish after redirecting to the temp dir, then I would switch to the new
  master.
  
 
 I'm going to make this a setting, so that test won't have to wait a
 whole minute. But I think this is the cleanest solution without having
 to implement some sort of leaseing solution. Our searches should be
 fast and 1 minute is a long time. They should all be done by then.
 
I used to reindex all my docs at 5h00AM, I probably could have waited 10 
minutes since I didn't have users, it's all about requirements ;)

 Thanks again,
 Patrick
 

sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Swapping Indexes?

2004-08-16 Thread Stephane James Vaucher
I've tried two options that seem to work:

1) Have a singleton that is responsible that will control your searchers. 
This controller can temporarilly redirect your searchers to 
c:/temp/myindex, allowing you to copy you index to c:/myindex. After that 
process completes, your controller can tell your searchers to use 
c:/myindex, allowing you to then erase your temp index.

If you index nightly, you can always *not* erase your tmp dir, your index 
process will do this automatically if you create your IndexWriter with 
the overwrite option. This way, you can have a backup index if there is 
a system failure at some point (like when you copy/move directories).

2) Use an incremental index. Regularly, I scan my files, see if there are 
modification/additions and update my master index. Removing from the 
master index, adding to a temp dir, then merging. I haven't seen any 
weirdness on windows with this process.

HTH,
sv

On Mon, 16 Aug 2004, Patrick Burleson wrote:

 I've read in the docs about updating an index and its suggestion
 reguarding swapping out indexes with a directory rename.
 
 Here's my question, how to do this when searches are running live? 
 
 Say I have a directory that holds the current valid index: 
 
 C:\myindex
 
 and when I'm running my nightly process to generate the index, it gets
 temporarily indexed to:
 
 C:\temp\myindex
 
 How can I very quickly replace C:\myindex with C:\temp\myindex? 
 
 I can't simply do a rename since C:\myindex will likely have open
 files. (Gotta love windows)
 
 And I can't delete all files in myindex, again because of the open files issue. 
 
 Any ideas?
 
 Thanks,
 Patrick
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: boost keywords

2004-08-13 Thread Stephane James Vaucher
Other indexing strategies:

- AFAIK, you could probably cheat by multiplying the number of tokens in
headers thus affecting the scoring.

For example:
h1hello world/h1 p foo bar /p
content - hello world hello world foo bar

This is not very tweekable though.

- As Tate suggests, you can also use multiple fields and apply your search
on all of them:

h1hello world/h1 p foo bar /p
content- hello world foo bar
headers- hello world

or even
h1hello world/h1 h2 foo bar /h2
content- hello world foo bar
header1- hello world
header2- foo bar

The result of this is that you can fine-grained control over different
fields. At this point, you can boost at indexing or at search time. I
personnaly opt for search time because it is more open for tweeking as
oposed to reindexing everything whenever you want to change a boost
factor.

As for the complexities that Tate mentions for query parsing, he's right
that it's a pain when using the built-in query parser, but you can always
use the api directly to build whatever queries you need.

HTH,
sv

On Fri, 13 Aug 2004, Tate Avery wrote:


 Well, as far as I know you can boost 3 different things:

 - Field
 - Document
 - Query

 So, I think you need to craft a solution using one of those.

 Here are some possibilities for each:

 1) Field
   - make a keyword field which is alongside your content field
   - boost your keyword field during indexing
   - expand user queries to search 'content' and 'keywords'

 2) Document
   - I don't really think this one helps you in anyway

 3) Query
   - Scan a user query and selectively boost words that are known keywords
   - This requires a keyword list and is not really scalable

 That is all that comes to mind, at first glance.  So, IMO, the winner IS #1.

 For example:

   Field _headline = Field.Text(headline, ...);
   _headline.setBoost(3);

   Field _content = Field.Text(content, ...);

   _document.addField(_headline);
   _document.addField(_content);


 But, the tricky part is modifying queries to use both fields.  If a user
 enters virus, it is easy (i.e. content:(virus) OR headline:(virus)).
 But, it quickly gets more complex with more complex queries (especially
 boolean queries with AND and such ... you probably would need something
 roughly like this:  a AND b = content:(a AND b) OR headline:(a AND b)
 OR (content:a AND headline:b) OR (headline:a AND content:b) and so on).

 That's my 2 cents.

 T



 -Original Message-
 From: news [mailto:[EMAIL PROTECTED] Behalf Of Leos Literak
 Sent: Friday, August 13, 2004 8:52 AM
 To: [EMAIL PROTECTED]
 Subject: Re: boost keywords


 Gerard Sychay napsal(a):
  Well, there is always the Lucene wiki. There's not a patterns page per
  se, but you could start one..

 of course I could. If I had something to add :-)

 but back to my issue. no reaction? So much people using
 Lucene and no one knows? I would be gratefull for any
 advice. Thanks

 Leos


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search exception in servlet!Please help me

2004-08-03 Thread Stephane James Vaucher
What is the exception? Is hits null or the index (i) out of bounds?

sv

On Tue, 3 Aug 2004, xuemei li wrote:

 hi,all,

 I am using lucene to search.When I use console to run my code it works
 fine.But after I put my code to a servlet.It will throw exception.Here is
 my exception code:
  Document doc= hits.doc(i);--exception
 But I can use the following code to get the hits.length() value.
 out.println(centerpThere are  +hits.length()+  matches for the
 word you have entered !/p/center);

 What's the problem?Any reply will be appreciated.

 thanks,
 Xuemei Li



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Setting up the index directory on tomcat

2004-07-29 Thread Stephane James Vaucher
Assuming you are using a FSDirectory and have the appropriate permissions, 
yup.

sv

On Thu, 29 Jul 2004, Ian McDonnell wrote:

 Is this done simply by saying:
 
 String indexDirectory = /path of directory you want index to be stored in
 
 Ian
 
 _
 Sign up for FREE email from SpinnersCity Online Dance Magazine  Vortal at 
 http://www.spinnerscity.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: continous index update

2004-07-28 Thread Stephane James Vaucher
I don't know if this helps, but this is what I do. I believe this is
correct, but I have just finished impl and haven't tested it fully:

- keep referrence to valid searcher
- open a reader on the old index
- open a writer to a tmp Directory (RAM of FS)
- find removed/modified files, remove from reader and add (if update, or
new document) to writer
- close reader, open writer on master dir merge with tmp dir
- update searcher reference

This seems to work while there is concurrent requests, but I need to be
more thorough.
HTH,
sv

On Wed, 28 Jul 2004, jitender ahuja wrote:

 Hi all,
   I am trying to make an automatic index update file based o a background 
 thread, but it gives errors in deleting the existing index, if (only if) the server 
 accesses the index at the same time or has once accessed it and even if a different 
 request is posed, i.e. for a different index directory or a different job, it makes 
 no difference.
 Can anyone tell that in such a continous update scenario, how the old index can be 
 updated as I feel deletion is a must of the earlier contents so as to get the new 
 contents in place.

 Regards,
 Jitender


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: When does IndexReader pick up changes?

2004-07-28 Thread Stephane James Vaucher
IIRC, if you use a searcher, changes are picked up right away. With a
reader, I would expect it should react the same way.

disclaimerI'm not a lucene guru, I might be wrong/disclaimer
Where I'm less sure is with a FSDirectory, as it uses an internal
RAMDirectory. If two separate processes (within the same classloader,
FS with same paths are reused) use different FSDirectories, you
might notice a flushing behaviour.

sv

On 28 Jul 2004 [EMAIL PROTECTED] wrote:

 Hi,

 Does anyone know if the IndexWriter has to be closed for an IndexReader
 to pick up the changes?

 Thanks.

 --- Lucene Users List [EMAIL PROTECTED]
 wrote:
 Hi,
 
  If I do this:
 
- open index writer
- add document

- open reader
- search with reader
- close reader
- close
 writer
 
  Will the reader pick up the document that
  was added to the
 index since it was opened
  after the document was added?  Or will it
 
 only pick up changes that occur after
  the index writer is closed?
 
  Thanks for the help!


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory Requirements

2004-05-13 Thread Stephane James Vaucher
On Thu, 13 May 2004, Matt Quail wrote:

  do you know of any method to reduce the memory consumption of lucene
  when searching?

 It depends on the complexity of the search, I think. Also, I belive
 scoring might use more memory than the search itself (can anyone confirm
 this?). For example, I often use the HitCollector interface (and a
 BitSet) for queries where I  am not interested in the score.

 Apart from that, I'm not aware of any other methods for reducing the
 memory consumption.

Avoid prefix queries and wildcards, since they can be rewritten into large
boolean queries. You can limit the rewritting with a maximum number of
clauses in a BooleanQuery to prevent requests from taking too much memory.

 =Matt

 Sascha Ottolski wrote:

  Am Donnerstag, 13. Mai 2004 12:56 schrieb Matt Quail:
 
 I noticed that  most users have +- 1G of RAM to run Lucene. Does
 anyone have experiences running it on a 128MB or 256MB machine?
 
 I regularly test my app that uses Lucene by passing -Xmx8m to the
 JVM; this is on a box with 1G of ram, but the JVM never more than 8M.
 My app runs fine (though there is a little more garbage collection
 activity).
 
 
  do you know of any method to reduce the memory consumption of lucene
  when searching? I've just increased from 400 to --Xmx500m, since
  sometimes OutOfMemoryExceptions occured (running suns java 1.4.2_04 and
  lucene-1.4rc3 on an index with ca. 18 Mio entries, building an index
  (including stored-only-fields) of 9 GB).
 
  I've seen that there are ways to limit the memory need for indexing and
  optimizing(?) by reducing the MergeFactor, but that doesn't seem to
  apply for searching :-(
 
  Thanks,
 
  Sascha
 
 
 =Matt

sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Analysis of wildcard queries

2004-05-10 Thread Stephane James Vaucher
I've seen this:
http://www.jguru.com/faq/view.jsp?EID=538312

I've seen in the code that there is a method to set lowercasing, but I
need to remove accentuated chars as well. Any suggestions as to which is
preferable, preprocessing the input or subclassing a QueryParser and
redefining getWildcardQuery?

cheers,
sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Range searches for numbers

2004-05-06 Thread Stephane James Vaucher
Quick reference:

http://wiki.apache.org/jakarta-lucene/SearchNumericalFields

If you are stuck, you can always encode the long in a string format (the 
date formatter in lucene might do this already). Or even, you could also 
treat it like a date and use your long like a date filter.

HTH,
sv

On 6 May 2004 [EMAIL PROTECTED] wrote:

 Hi,
 
 What's the best way to store numbers for range searching?  If someone
 has some info about this I'd love to see it.
 
 This is my current plan:
 When I convert the number to a string I will zero pad it so range searches
 work.  The conversions will be like this for integers:
1 to 101
 
2 to 102
 1000 to 1001000
 
 I'm just adding a 1 to the
 start of the string (or adding 10).  This is so negative numbers work
 too!  They will just be subtracted from a long (10):
-1 to 099
 
-2 to 098
 -1000 to 0999000
 
 This works great for range
 searches.  But how do I convert negative longs?  I can't subtract 100
 from a long can I?  It too big to fit in another long.
 
 Any advice is appreciated!
 
 -Reece
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Understanding Boolean Queries

2004-04-29 Thread Stephane James Vaucher
On Thu, 29 Apr 2004, Tate Avery wrote:

 Hello,

 I have been reviewing some of the code related to boolean queries and I
 wanted to see if my understanding is approximately correct regarding how
 they are handled and, more importantly, the limitations.

You can always submit requests for enhancements in bugzilla, so as to keep
track this issue.

 Here is what I have come to understand so far:

 1) The QueryParser code generated from javacc will parse my boolean query
 and determine for each clause whether or not is 'required' (based on a few
 conditions, but, in short, whether or not it was introduced or followed by
 'AND') or 'prohibited' (based, in short, on it being preceded by 'NOT').

Your usage seems pretty particular, why are you using the javacc
QueryParser?

 2) As my BooleanQuery is being constructed, it will throw a
 BooleanQuery.TooManyClauses exception if I exceed
 BooleanQuery.maxClauseCount (which defaults to 1024).

It's configurable through sys properties or by
BooleanQuery.setMaxClauseCount(int maxClauseCount)

 3) The maxClauseCount threshold appears not to care whether or not my
 clauses are 'required' or 'prohibited'... only how many of them there are in
 total.

 4) My BooleanQuery will prepare its own Scorer instance (i.e.
 BooleanScorer).  And, during this step, it will identify to the scorer which
 clauses are 'required' or 'prohibited'.  And, if more than 32 fall into this
 category, a IndexOutOfBoundsException (More than 32 required/prohibited
 clauses in query.) is thrown.
 That's as far as I got.
 Now, I am a bit confused at this point.  Does this mean I can make a boolean
 query consisting of up to 1024 clauses as long as no more than 32 of them
 are required or prohibited?  This doesn't seem right.  So, am I missing
 something in the way I am understanding this.
 I am (as you may have guessed) generating large boolean queries.  And, in
 some rare cases, I am receiving the exception identified in #4 (above).  So,
 I am trying to figure out whether or not I need to change/filter my queries
 in a special way in order to avoid this exception.  And, in order to do
 this, I want to understand how these queries are being handled.
 Finally, is there something related to the query syntax that could be my
 mistake?  For example, what is the difference between:
   A B AND C D AND D E
 ... and...
   (A B) AND (C D) AND (D E)
 ... could that be the crux of it?

I can't help you here, and the doc seems rather thin (or nonexistent for
this class). I don't know the relation between the query and how the
scorer will process it.

Sorry I can't be of assistance,
sv

 Thank you for your time,
 Tate Avery


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Understanding Boolean Queries

2004-04-29 Thread Stephane James Vaucher
Hi Tate,

Forgot to ask, what version of Lucene? (IIRC, = 1.2, means no
maxClauseCount)

sv

On Thu, 29 Apr 2004, Tate Avery wrote:

 Thank you for the response.

 I am not using the QueryParser directly... it was just part of my overall
 understanding of how this exception is coming about.  Same thing,
 essentially, with the maxClauseCount.


 Here is some code to illustrate what is confusing me and what I am trying to
 ascertain:

   int _numClauses = XXX;
   boolean _required = XXX;  // 3 examples of these var settings below

   BooleanQuery _query = new BooleanQuery();

   for (int _i = 0; _i  _numClauses; _i++)
   {
   _query.add(
   new BooleanClause(
   new TermQuery(new Term(body, term + _i)),
   _required,
   false));
   }

   Hits _hits = new IndexSearcher(INDEX_DIR).search(_query);


 1) With _numClauses= and _required=false (for example), I have no
 problems.
 (This is confusing since  is more than maxClauseCount... but I won't
 complain).

 2) With _numClauses=32 and _required=true, I also have no problems.

 3) With _numClauses=33 and _required=true, I get
 java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
 clauses in query. as a runtime exception.


 So, I guess I am trying to ask the following:

 Is a query like (T1 AND T2 AND ... AND T32 AND T33) just completely illegal
 for Lucene?
 OR is there some way to extend this limit?
 OR am I missing something that is clouding my understanding?



 Thanks,
 Tate



 -Original Message-
 From: Stephane James Vaucher [mailto:[EMAIL PROTECTED]
 Sent: Thursday, April 29, 2004 1:10 PM
 To: Lucene Users List; [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: Re: Understanding Boolean Queries


 On Thu, 29 Apr 2004, Tate Avery wrote:

  Hello,
 
  I have been reviewing some of the code related to boolean queries and I
  wanted to see if my understanding is approximately correct regarding how
  they are handled and, more importantly, the limitations.

 You can always submit requests for enhancements in bugzilla, so as to keep
 track this issue.

  Here is what I have come to understand so far:
 
  1) The QueryParser code generated from javacc will parse my boolean query
  and determine for each clause whether or not is 'required' (based on a few
  conditions, but, in short, whether or not it was introduced or followed by
  'AND') or 'prohibited' (based, in short, on it being preceded by 'NOT').

 Your usage seems pretty particular, why are you using the javacc
 QueryParser?

  2) As my BooleanQuery is being constructed, it will throw a
  BooleanQuery.TooManyClauses exception if I exceed
  BooleanQuery.maxClauseCount (which defaults to 1024).

 It's configurable through sys properties or by
 BooleanQuery.setMaxClauseCount(int maxClauseCount)
 
  3) The maxClauseCount threshold appears not to care whether or not my
  clauses are 'required' or 'prohibited'... only how many of them there are
 in
  total.
 
  4) My BooleanQuery will prepare its own Scorer instance (i.e.
  BooleanScorer).  And, during this step, it will identify to the scorer
 which
  clauses are 'required' or 'prohibited'.  And, if more than 32 fall into
 this
  category, a IndexOutOfBoundsException (More than 32 required/prohibited
  clauses in query.) is thrown.
  That's as far as I got.
  Now, I am a bit confused at this point.  Does this mean I can make a
 boolean
  query consisting of up to 1024 clauses as long as no more than 32 of them
  are required or prohibited?  This doesn't seem right.  So, am I missing
  something in the way I am understanding this.
  I am (as you may have guessed) generating large boolean queries.  And, in
  some rare cases, I am receiving the exception identified in #4 (above).
 So,
  I am trying to figure out whether or not I need to change/filter my
 queries
  in a special way in order to avoid this exception.  And, in order to do
  this, I want to understand how these queries are being handled.
  Finally, is there something related to the query syntax that could be my
  mistake?  For example, what is the difference between:
  A B AND C D AND D E
  ... and...
  (A B) AND (C D) AND (D E)
  ... could that be the crux of it?

 I can't help you here, and the doc seems rather thin (or nonexistent for
 this class). I don't know the relation between the query and how the
 scorer will process it.

 Sorry I can't be of assistance,
 sv

  Thank you for your time,
  Tate Avery
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED

Re: Combining text search + relational search

2004-04-28 Thread Stephane James Vaucher
I'm a bit confused why you want this.

As far as I know, but relational db searches will return exact
matches without a mesure of relevancy. To mesure relevancy, you need a
search engine. For your results to be coherent, you would have to put
everything in the lucene index.

As for memory consumption, for searching, if the index is on disk, then
the memory footprint depends on the type of queries you use. For indexing,
it depends if you use tmp RAMDirectory to do merges, otherwise, memory
consumption is minimal.

HTH
sv

On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote:


 I need to somehow aloow users to do a text search and query relational
 database attributes at the same time. The attributes are basically metadata
 about the documents that the text search will be perfomed on. I have the
 text of the documents indexed in Lucene. Does anyone have any advice or
 examples. I also need to make sure I don't garble up all the memory on our
 server

 Thanks
 Mike


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: status of LARM project

2004-04-27 Thread Stephane James Vaucher
I suggest you look at:
http://www.manageability.org/blog/stuff/open-source-web-crawlers-java

From what I know of nutch, it's meant as the basic for a competitor to the
big search engines (i.e. google). For a small web site, it might be
overkill especially if it requires you to build from CVS (unless there are
distributions).

Note:
I've got the book Programming Spiders, Bots and Aggregators in Java, it
describes spiders using a project called: j-spider
http://sourceforge.net/projects/j-spider/
It could probably be adapted for your needs.

HTH,
sv

On Wed, 28 Apr 2004, Kelvin Tan wrote:

 As far as I know, LARM is defunct. I read somewhere, perhaps apocryphal, that
 Clemens got a job which wasn't supportive of his continued development on LARM.
 AFAIK there aren't any other active developers of LARM (at least at the time it
 branched off to SF).

 Otis recently posted to use Nutch instead of LARM.

 Kelvin

 On 28 Apr 2004 09:44:04 +0800, Sebastian Ho said:
  Hi
 
  I have look at LARM website and I get different results
 
  http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages
  It says that development has stopped for this project.
 
  LARM hosted on sourceforge.
  The last message was dated 2003 in the mailing list. Is it still
  supported and active?
 
  LARM hosted on apache.
  It says the project is moved to sourceforge.
 
  Any one here who is active in LARM can comment on the status?
 
  Regards
 
  Sebastian Ho
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Segments file get deleted?!

2004-04-26 Thread Stephane James Vaucher
I would have to agree with Surya's diagnosis, can you give us details on 
your update process?

Please include OS, and if there are some non-java processes (e.g. doing 
copies).

cheers,
sv

On Mon, 26 Apr 2004, Nader S. Henein wrote:

 Can you give us a bit of background, we've been using Lucene since the first
 stable release 2 years ago, and I 've never had segments disappear on me,
 first of all can you provide some background on your setup and secondly when
 you say a certain period of time, how much time are we talking about here
 and does that interval coincide with your indexing schedule, because you may
 have the create flag on the Indexer set to true so it simply recreates the
 index at every update and deleted whatever was there, of course if there are
 no files to index at any point it will just give you a blank index. 
 
 
 Nader Henein
 
 -Original Message-
 From: Surya Kiran [mailto:[EMAIL PROTECTED] 
 Sent: Monday, April 26, 2004 7:48 AM
 To: [EMAIL PROTECTED]
 Subject: Segments file get deleted?!
 
 
 Hi all, we have implemented our portal search using Lucene. It  works fine.
 But after a certain period of time Lucene segments file get deleted.
 Eventually all searches fails. Anyone can guess where the error could be.
 
 Thanks a lot.
 
 Regards
 Surya.
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [Jakarta Lucene Wiki] Updated: NewFrontPage

2004-04-26 Thread Stephane James Vaucher
I don't know what you think of the NewFrontPage, but if you like it, I 
could do a switch, renaming old FrontPage to OldFrontPage and the new one 
to FrontPage.

Also, if anyone knows how to do this, it would be appreciated. I haven't 
figured out yet how to rename/destroy pages (are there permissions in 
MoinMoin?). Amongst other things, the doc says there should be an 
action=DeletePage.

cheers,
sv

On Mon, 26 Apr 2004 [EMAIL PROTECTED] wrote:

Date: 2004-04-26T07:38:29
Editor: StephaneVaucher [EMAIL PROTECTED]
Wiki: Jakarta Lucene Wiki
Page: NewFrontPage
URL: http://wiki.apache.org/jakarta-lucene/NewFrontPage
 
Added link to PoweredBy page
 
 Change Log:
 
 --
 @@ -17,6 +17,7 @@
   || IntroductionToLucene || Articles and Tutorials introducing Lucene ||
   || OnTheRoad || Information on presentations and courses ||
   || InformationRetrieval || Articles and Tutorials on information retrieval ||
 + || PoweredBy || Link to projects using Lucene ||
   || [LuceneFAQ]|| The Lucene FAQ ||
   || HowTo|| Lucene HOWTO's : small tutorials and code snippets ||
   || [Resources]|| Contains useful links ||
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Adding duplicate Fields to Documents

2004-04-24 Thread Stephane James Vaucher
From my experience (that is little experience;)), fields that are not
tokenised, are stored separately. Someone more qualified can surely give
you more details.

You can look at your index with Luke, it might be insightful.
sv

On Thu, 22 Apr 2004, Gerard Sychay wrote:

 Hello,

 I am wondering what happens when you add two Fields with same names to
 a Document.  The API states that if the fields are indexed, their text
 is treated as though appended.  This much makes sense.  But what about
 the following two cases:

 - Adding two fields with same name that are indexed, not tokenized
 (keywords)?  E.g. given (field_name, keyword1) and (field_name,
 keyword2), would the final keyword field be (field_name,
 keyword1keyword2)?  Seems weird..

 - Adding two fields with same name that are stored, but not indexed and
 not tokenized (e.g. database keys)?  Are they appended (which would mess
 up the database key when retrieved from the Hit)?

 Thanks,
 Gerard

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searcher not aware of index changes

2004-04-21 Thread Stephane James Vaucher
This is not normal behaviour. Normally using a new IndexSearcher should
reflect the modified state of your index. Could you post a more
informative bit of code?

sv

On Wed, 21 Apr 2004 [EMAIL PROTECTED] wrote:

 Hi!

 My Searcher's instance it not aware of changes to the index. I even create a
 new instance but it seems only a complete restart does help(?):

 indexSearcher = new IndexSearcher(IndexReader.open(index));

 Timo

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searcher not aware of index changes

2004-04-21 Thread Stephane James Vaucher
Normally the code should work, iif your you don't keep references to the
old Searcher (and not try cacheing it). Make sure you aren't doing this by
mistake.

For the design of your facade, you could always implement Searchable and
do the delegation to the up-to-date instance of IndexSearcher.

Quick comment: you should call .close() on your searcher before removing
the reference. If this causes exceptions in future searches, it would
indicate incorrect cacheing.

HTH,
sv

On Wed, 21 Apr 2004 [EMAIL PROTECTED] wrote:

 On Wednesday 21 April 2004 19:20, Stephane James Vaucher wrote:
  This is not normal behaviour. Normally using a new IndexSearcher should
  reflect the modified state of your index. Could you post a more
  informative bit of code?

 BTW Why can't Lucene care for it itself?


 Well, according to my logging it does create a new instance. I use only one
 instance of SessoinFacade:

 public class SearchFacade extends Observable
 {
   protected class IndexObserver implements Observer
   {
   private final Log log = LogFactory.getLog(getClass());

   public Searcher indexSearcher;

   public IndexObserver()
   {
   newSearcher();  // init
   }

   public void update(Observable o, Object arg)
   {
   log.debug(Index has changed, creating new Searcher );
   newSearcher();
   }

   private void newSearcher()
   {
   try
   {
   indexSearcher = new
 IndexSearcher(IndexReader.open(Configuration.LuceneIndex.MAIN));
   }
   catch (IOException e)
   {
   log.error(Could not instantiate searcher:  + e);
   }
   }

   public Searcher getIndexSearcher()
   {
   return indexSearcher;
   }
   }

   private IndexObserver indexObserver;

   public SearchFacade()
   {
   addObserver(indexObserver = new IndexObserver());
   }

   public void createIndex()
   {
   ...
   setChanged();   // index has changed
   notifyObservers();
   }

   public Hits search(String query)
   {
   Searcher searcher = indexObserver.getIndexSearcher();
   }

 }

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: what web crawler work best with Lucene?

2004-04-21 Thread Stephane James Vaucher
How big is the site?

I mostly use an inhouse solution, but I've used HttpUnit for web scrapping
small sites (because of its high-level api).

Here is a hello world example:
http://wiki.apache.org/jakarta-lucene/HttpUnitExample

For a small/simple site, small modifications to this class could suffice.
IT WILL NOT function on large sites because of memory problems.

For larger sites, there are questions like:

- memory:
For example, spidering all links on every page can lead to visiting too
many links. Keeping all visited links in memory can be problematic

- noise
If you get every page on your web site, you might be adding noise to the
search engine. Spider navigation rules can help out, like saying that you
should only follow links/index documents of a specific form like
www.mysite.com/news/article.jsp?articleid=xxx

- speed:
Too much speed can be bad if you doing 100 hits/sec on a site could hurt
it (especially if it's not you who are the webmaster)
Too little speed can be bad if you want to make sure you quickly get new
pages.

- categorisation:
You might want to separate information in your index. For example, you
might want a user to do a search in the documentation section or in the
press release section. This categorisation can be done by specifying
sections to the site, or a subsequent analysis of available docs.

-up-to-date information
You'll want to think of your update schedule, so that if you add a new
page, it gets indexed quickly. This problem also occurs when you modify an
existing page, you might want the modification to be detected rapidly.

HTH,
sv

On Thu, 22 Apr 2004, Tuan Jean Tee wrote:

 Have anyone implemented any open source web crawler with Lucene? I have
 a dynamic website and are looking at putting in a search tools. Your
 advice is very much appreciated.

 Thank you.


 IMPORTANT -

 This email and any attachments are confidential and may be privileged in
 which case neither is intended to be waived. If you have received this
 message in error, please notify us and remove it from your system. It is
 your responsibility to check any attachments for viruses and defects
 before opening or sending them on. Where applicable, liability is
 limited by the Solicitors Scheme approved under the Professional
 Standards Act 1994 (NSW). Minter Ellison collects personal information
 to provide and market our services. For more information about use,
 disclosure and access, see our privacy policy at www.minterellison.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Using Runtime.exec to extract text [Was: Bridge with OO]

2004-04-20 Thread Stephane James Vaucher
In case you don't know. Using Runtime.exec() on windows, you need to 
consume the output streams of the application will block. This is not the 
case on linux.

http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-traps.html

In short:
Because some native platforms only provide limited buffer size for 
standard input and output streams, failure to promptly write the input 
stream or read the output stream of the subprocess may cause the 
subprocess to block, and even deadlock.

HTH,
sv

On Tue, 20 Apr 2004, Argyn wrote:

 I've the same requirement. I used antiword, xlhtml and ppthtml on win2k. I 
 called them with Runtime.exec(). There are still problems: all three hang 
 up sometimes. Otherwise, it worked. I indexed several hunderds of 
 thousands files in development mode. I never got into production.
 
 Argyn
 
 
 On Mon, 19 Apr 2004 16:53:41 -0400 (EDT), Stephane James Vaucher 
 [EMAIL PROTECTED] wrote:
 
  Actually, the objective would be to use OO to extract text from MSOffice
  formats. If I read your code correctly, your code should only work with 
  OO
  as the docs are in xml.
 
  Thanks for the code for OO docs through,
  sv
 
  On Mon, 19 Apr 2004, Mario Ivankovits wrote:
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bridge with OpenOffice

2004-04-19 Thread Stephane James Vaucher
I'll make a copy of the code available on the wiki before it disappears 
off the Web.

Now for some info on using OO on a production system:

http://www.oooforum.org/forum/viewtopic.php?t=2913highlight=jurt

summary src=Web, not my experienceOO works well (but is slow), but is 
not multi-threaded (the communication bridge is)./summary

Quotes from end of 2003:

Kai Sommerfeld from Sun wrote:

Quote:
The answer is not a simple 'yes' or 'no'. It's more: 'partly'. There
are parts of the OOo API that are threadsafe, others are not. Newer
components are generally threadsafe. Components thare are mainly
wrappers for old Office code are mostly not. A main problem is that we
cannot state for sure which components are actually thread safe and
which are not. It's as worse as I say it here.

We're trying to solve the multithreading issues for one of the next
major releases of OOo. But this is definitely not an easy task,
especially, since rewriting all non-threadaware code is simply not an
option because of missing developer resources.

Juergen Schmidt from Sun wrote:

Quote:
If you want to use OO in a safe way you shouldn't use it multi
threaded. But we want to improve the server functionality of OO in
genral so that your described scenario should be possible.

Sorry, but currently you have to workaround this in your own application
and you should use OO single threaded. But as i said we are working on
this feature.

Niklas Nebel from Sun who seem to have success with some code running 
successfully as multithreaded, wrote:

Quote:
The document API functions use the SolarMutex, so you should be able to
use them from multiple threads without problems (with one call blocking
the next, of course). Listener callbacks might be a problem if handled
by different threads, but at least for the spreadsheet API I'm not aware
of any other problems.

Don't forget that every API call over a connection to a running office
is multi-threaded, as the connection is handled by a different thread
from office user interactions.

sv

On Mon, 19 Apr 2004, Magnus Johansson wrote:

 Yes I have tried it and it seems to work ok.
 I haven't really used it in a production environment
 however.
 
 There was some code here
 
 http://www.gzlinux.org/docs/category/dev/java/doc2txt.pdf
 
 it is however not there anymore, Google HTML version is however
 avaialble at
 
 http://66.102.9.104/search?q=cache:549doYEZTD4J:www.gzlinux.org/docs/category/dev/java/doc2txt.pdf+Appending+the+favoured+extension+to+the+origin+document+namehl=enie=UTF-8
 
 
 /magnus
 
 
  Anyone try what Joerg suggested here?
 
  http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=6231
 
  sv
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bridge with OpenOffice

2004-04-19 Thread Stephane James Vaucher
Actually, the objective would be to use OO to extract text from MSOffice 
formats. If I read your code correctly, your code should only work with OO 
as the docs are in xml. 

Thanks for the code for OO docs through,
sv

On Mon, 19 Apr 2004, Mario Ivankovits wrote:

 Stephane James Vaucher wrote:
 
  Anyone try what Joerg suggested here?
  http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=6231 
 
   
 
 Dont know what you would like to do, but if you simply would like to 
 extract text, you could simply try this sniplet:
 
 ---snip--- JarFile jar = new JarFile(file, false);
ZipEntry entry = jar.getEntry(content.xml);
if (entry == null)
{
throw new IOException(content.xml missing in file:  + file);
}
InputStream is = jar.getInputStream(entry);
 
XMLReader xr = 
 XMLReaderFactory.createXMLReader(org.apache.crimson.parser.XMLReaderImpl); 
 
xr.setEntityResolver(new EntityResolver()
{
public InputSource resolveEntity(String publicId, String 
 systemId) throws SAXException, IOException
{
if (systemId.toLowerCase().endsWith(.dtd))
{
StringReader stringInput = new StringReader( );
return new InputSource(stringInput);
}
else
{
return null;
}
}
});
 
final StringBuffer sbText = new StringBuffer(10240);
xr.setContentHandler(new ContentHandler()
{
public void skippedEntity(String name) throws SAXException
{
}
 
public void setDocumentLocator(Locator locator)
{
}
 
public void ignorableWhitespace(char ch[], int start, int 
 length) throws SAXException
{
}
 
public void processingInstruction(String target, String data) 
 throws SAXException
{
}
 
public void startDocument() throws SAXException
{
}
 
public void startElement(String namespaceURI, String 
 localName, String qName, Attributes atts) throws SAXException
{
if (qName.equals(text:p))
{
if (sbText.length()  0  
 sbText.charAt(sbText.length() - 1) != '\n')
{
sbText.append('\n');
}
}
}
 
public void endPrefixMapping(String prefix) throws SAXException
{
}
 
public void characters(char ch[], int start, int length) 
 throws SAXException
{
sbText.append(ch, start, length);
}
 
public void endElement(String namespaceURI, String localName, 
 String qName) throws SAXException
{
}
 
public void endDocument() throws SAXException
{
}
 
public void startPrefixMapping(String prefix, String uri) 
 throws SAXException
{
}
});
 
InputSource source = new InputSource(is);
source.setPublicId();
source.setSystemId();
xr.parse(source);
 
System.err.println(TXT:  + sbText.toString());
 ---snip---
 
 Ciao,
 Mario
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Bad link contrib page

2004-04-16 Thread Stephane James Vaucher
Since it has been moved to the Sandbox, someone should remove the term 
highlighter reference.

http://jakarta.apache.org/lucene/docs/contributions.html
Miscellaneous - Term Highlighter 

cheers,
sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Bridge with OpenOffice

2004-04-16 Thread Stephane James Vaucher
Anyone try what Joerg suggested here?

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=6231

sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Presentation in Mtl

2004-04-15 Thread Stephane James Vaucher
Erik,

I'll add a wiki page tomorrow for Lucene training and presentations, 
unless you beat me to it.

sv
BTW, I visited nofluffjuststuff.com, and I've noticed that highlighting is 
misspelled again ;) Apologies if I'm poking fun at an Americanism.


On Thu, 15 Apr 2004, Erik Hatcher wrote:

 This is great to see!  I've been presenting Lucene at the No Fluff, 
 Just Stuff symposiums for a while and really enjoy doing so.  I also 
 presented it last year at the O'Reilly Open Source Conference with the 
 toughest attendee possible, Doug Cutting himself.
 
 I'm continuing my Lucene presentations this year still on the NFJS 
 tour, and also I will be presenting it at JavaOne.  The JavaOne 
 presentation is only one hour though, so it will be a very quick (yet 
 techie) pass through what Lucene offers.
 
 Let's create a wiki page that lists all the venues for Lucene 
 presentations and training.  If you are in the US and near a city 
 listed here http://www.nofluffjuststuff.com come on out!
 
   Erik
 
 On Apr 15, 2004, at 1:06 AM, Matt Quail wrote:
 
  I too gave a Lucene presentation to my local JUG (Canberra, Australia)
  last night.
 
  It also went over very well. Lucene totally rocks!
 
  =Matt
 
  Stephane James Vaucher wrote:
 
  Hi everyone,
  I did a presentation tonight in Montreal at a java users group 
  metting.
  I've got to say that they were maybe 4 companies present that use 
  Lucene
  and find it very useful and simple to use. It lead to the longuest
  discussion (positive that is) I having at the users' group.
  So I've got to tell the Lucene contributors GOOD JOB!
  I'll probably upload my ppt presentation (heavily based on existing
  tutorials) to the wiki, so you can comment it.
  cheers,
  sv
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Presentation in Mtl

2004-04-14 Thread Stephane James Vaucher
Hi everyone,

I did a presentation tonight in Montreal at a java users group metting.
I've got to say that they were maybe 4 companies present that use Lucene
and find it very useful and simple to use. It lead to the longuest
discussion (positive that is) I having at the users' group.

So I've got to tell the Lucene contributors GOOD JOB!

I'll probably upload my ppt presentation (heavily based on existing
tutorials) to the wiki, so you can comment it.

cheers,
sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Presentation in Mtl

2004-04-14 Thread Stephane James Vaucher
Wow discussion Lucene in French for 2 1/2 hours has affected my english.
Please ignore spelling mistakes ;), but don't ignore the spirit of the
message.

sv

On Thu, 15 Apr 2004, Stephane James Vaucher wrote:

 Hi everyone,

 I did a presentation tonight in Montreal at a java users group metting.
 I've got to say that they were maybe 4 companies present that use Lucene
 and find it very useful and simple to use. It lead to the longuest
 discussion (positive that is) I having at the users' group.

 So I've got to tell the Lucene contributors GOOD JOB!

 I'll probably upload my ppt presentation (heavily based on existing
 tutorials) to the wiki, so you can comment it.

 cheers,
 sv


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)

2004-04-13 Thread Stephane James Vaucher
I'm actually pretty lazy about index updates, and haven't had the need for 
efficiency, since my requirement is that new documents should be 
available on a next working day basis.

I reindex everything from scatch every night (400,000 docs) and store it 
in an timestamped index. When the reindexing is done, I alert a controller 
of the new active index. I keep a few versions of the index in case of 
a failure somewhere and I can always send a message to the controller to 
use an old index.

cheers,
sv

On Tue, 13 Apr 2004, petite_abeille wrote:

 
 On Apr 13, 2004, at 02:45, Kevin A. Burton wrote:
 
  He mentioned that I might be able to squeeze 5-10% out of index merges 
  this way.
 
 Talking of which... what strategy(ies) do people use to minimize 
 downtime when updating an index?
 
 My current strategy is as follow:
 
 (1) use a temporary RAMDirectory for ongoing updates.
 (2) perform a copy on write when flushing the RAMDirectory into the 
 persistent index.
 
 The second step means that I create an offline copy of a live index 
 before invoking addIndexes() and then substitute the old index with the 
 new, updated, one. While this effectively increase the time it takes to 
 update an index, it nonetheless reduce the *perceived* downtime for it.
 
 Thoughts? Alternative strategies?
 
 TIA.
 
 Cheers,
 
 PA.
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Simple spider demo

2004-04-13 Thread Stephane James Vaucher
I'm wondering if there is interest for a simple spider demo.

I've got an example of how to use HttpUnit to spider on a web site and 
have it index it on disk (only html page now). I can send it to the list 
if anyone is interested (it's one class,  200 loc).

cheers,
sv



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ANN: Docco 0.3

2004-04-13 Thread Stephane James Vaucher
Looks cool, but I've got a question:

How do you handle symlinks on *nix? I think it's stuck in a loop

When indexing my home dir, I see it indexing: 
/home/vauchers/.Cirano-gnome/.gnome-desktop/Home directory/.Cirano-gnome/...

cheers,
sv

On Wed, 14 Apr 2004, Peter Becker wrote:

 Hello,
 
 we released Docco 0.3 along with two updates for its plugins.
 
 Docco is a personal document retrieval tool based on Apache's Lucene 
 indexing engine and Formal Concept Analysis. It allows you to create an 
 index for files on your file system which you can then search for 
 keywords. It can index plain text, HTML, XML and OpenOffice files and 
 with the support of plugins others like PDF, DOC and XLS.
 
 This new version of Docco features a number of small enhancements: the 
 diagram layout can be changed, printing and graphic export options have 
 been added and some plugins have been updated.
 
 The new POI plugin should be able to index MS Word documents again (the 
 old one broke with recent Java versions), the PDFbox plugin gets all the 
 recent updates from the PDFbox project. Old plugins will still continue 
 to work, though.
 
 You can find the updated files here:
  http://sourceforge.net/project/showfiles.php?group_id=21448
 
 Note that you can now also use the export plugins to add more graphic 
 export options.
 
 Enjoy!
   Peter
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Simple spider demo

2004-04-13 Thread Stephane James Vaucher
I've uploaded it to the wiki:

http://wiki.apache.org/jakarta-lucene/HttpUnitExample

dislaimer
It's not anywhere close to production quality, especially since it's based 
on a unit test framework.
/disclaimer

sv

On Tue, 13 Apr 2004, Stephane James Vaucher wrote:

 I'm wondering if there is interest for a simple spider demo.
 
 I've got an example of how to use HttpUnit to spider on a web site and 
 have it index it on disk (only html page now). I can send it to the list 
 if anyone is interested (it's one class,  200 loc).
 
 cheers,
 sv
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: suitability of lucene for project

2004-04-12 Thread Stephane James Vaucher
It could be part of you solution, but I don't think so. Let me explain:

I've done this a few times something similar to what you describe. I use 
often use HttpUnit to get information. How you process it, it's up 
to you. If you want it to be indexed (searchable), you can use Lucene. If 
you want to extract structured (or semi-structured) information, use 
wrapper induction techniques (not Lucene).

cheers,
sv

On 13 Apr 2004, Sebastian Ho wrote:

 hi all
 
 i am investigating technologies to use for a project which basically
 retrieves html pages on a regular basis(or whenever there are changes)
 and allow html parsing to extract specific information, and presenting
 them as links in a webpage. Note that this is not a general search
 engine kind of project but we are extracting clinical information from
 various website and consolidating them.
 
 Pls advise me whether Lucene can do the above and in areas where it
 cannot, suggestions to solutions will be appreciated.
 
 Thanks
 
 Sebastian Ho
 Bioinformatics Institute
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Highlight package

2004-04-06 Thread Stephane James Vaucher
Hello all,

The link to Mark Harwood's highlight package is down, anyone 
have any idea where his package would be available? 

cheers,
sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numeric field data

2004-04-04 Thread Stephane James Vaucher
Added to the wiki, can of course be removed if it's transfered to the
FAQs.

sv

On Sun, 4 Apr 2004, Kevin A. Burton wrote:

 Stephane James Vaucher wrote:

 Hi Tate,
 
 There is a solution by Erik that pads numbers in the index. That would
 allow you to search correctly. I'm not sure about decimal, but you could
 always add a multiplier.
 
 
 Wonder if that should go in the FAQ... wiki...




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numeric field data

2004-04-02 Thread Stephane James Vaucher
Hi Tate,

There is a solution by Erik that pads numbers in the index. That would 
allow you to search correctly. I'm not sure about decimal, but you could 
always add a multiplier.

HTH,
sv

On Fri, 2 Apr 2004, Tate Avery wrote:

 Hello,
 
 Is there a way (direct or indirect) to support a field with numeric data?
 
 More specifically, I would be interested in doing a range search on numeric
 data and having something like:
 
   number:[1 TO 2]
 
 ... and not have it return 11 or 103, etc.  But, return 1.5, for example.
 
 Is there any support in current and/or upcoming versions for this type of
 thing?
 Or, has anyone figured out a creative workaround to obtain the desired
 result?
 
 
 Thank you for any comments,
 Tate
 
 p.s.  Ideally, I would be able to do equal, greater than, less than and
 these in combination with each other (i.e. ranges, greater than or equal to,
 etc.).
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Nested category strategy

2004-04-01 Thread Stephane James Vaucher
Another possibility is to add all combinations in a single field.

addField(category, /Science/);
addField(category, /Science/Medicine);
addField(category, /Science/Foo);
addField(category, /Biology);

Your wildcard search should work, and you shouldn't have the problem with
a search /Science/*.

HTH,
sv

On Thu, 1 Apr 2004, Tate Avery wrote:


 Could you put them all into a tab-delimited string and store that as a
 single field, then use a TabTokenizer on the field to search?

 And, if you need to, do a .split(\t) on the field value in order to break
 them back up into individual categories.




 -Original Message-
 From: David Black [mailto:[EMAIL PROTECTED]
 Sent: Thursday, April 01, 2004 2:49 PM
 To: [EMAIL PROTECTED]
 Subject: Nested category strategy


 Hey All,

 I'm trying to figure out the best approach to something.

 Each document I index has an array of categories which looks like the
 following example

 /Science/Medicine/Serology/blood gas
 /Biology/Fluids/Blood/

 etc.

 Anyway, there's a couple things I'm trying to deal with.

 1. The fact that we have an undefined array size.  I can't just shove
 these into a single field.  I could explode them into multiple fields
 on the fly like category_1, category_2. etc. etc

 2. The fact that a search will need to be performed like  category:
 /Science/Medicine/* would need to return all items within that
 category.

 Thanks in advance to anyone who can give me some help here.

 Thanks


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance of hit highlighting and finding term positions for a specific document

2004-03-30 Thread Stephane James Vaucher
I agree with you that a highlight package should be available directly 
from the lucene website. To offer this much-desired feature, having a 
dependency on a personal web site seems a little weird to me. It would 
also force the community to support this functionality, which would seem 
appropriate.

cheers,
sv

On Tue, 30 Mar 2004, Kevin A. Burton wrote:

 I'm playing with this package:
 
 http://home.clara.net/markharwood/lucene/highlight.htm
 
 Trying to do hit highlighting.  This implementation uses another 
 Analyzer to find the positions for the result terms. 
 
 This seems that it's very inefficient since lucene already knows the 
 frequency and position of given terms in the index.
 
 My question is whether it's hard to find a TermPosition for a given term 
 in a given document rather than the whole index.
 
 IndexReader.termPositions( Term term ) is term specific not term and 
 document specific.
 
 Also it seems that after all this time that Lucene should have efficient 
 hit highlighting as a standard package.  Is there any interest in seeing 
 a contribution in the sandbox for this if it uses the index positions?
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Demoting results

2004-03-29 Thread Stephane James Vaucher
Mark,

Thanks for the update, since I contributed the page, I was going to modify
it (I don't want to force work on other.

sv

On Mon, 29 Mar 2004 [EMAIL PROTECTED] wrote:

 Hi Doug,
 Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally 
 useful than my
 implementation :-)
 Unless anyone has a particularly good reason I'll remove the link to my code that 
 Stephane put on the Wiki contributions page.
 I definitely find BoostingQuery very useful and would be happy to see it in Lucene 
 core but I'm not sure its popular
 enough to warrant adding special support to the query parser.

 BTW, I've had a thought about your suggestion for making the highlighter use some 
 form of RAMindex of sentence fragments
 and then querying it to get the best fragments. This is nice in theory but could 
 fail to find anything if the query is of these forms:
 a AND b
 a b
 When the code that breaks a doc into sentence docs splits co-occuring a and b 
 terms into seperate docs
 this would produce no match. I dont think there's an easy way round that so I'll 
 stick to the current approach of scoring
 fragments simply based on terms found in the query.


 Cheers
 Mark

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is RangeQuery more efficient than DateFilter?

2004-03-29 Thread Stephane James Vaucher
I've added some information contained on this thread on the wiki.

http://wiki.apache.org/jakarta-lucene/DateRangeQueries

If you wish to add more information, go right ahead, but since I added
this info, I believe it's ultimately my responsibility to maintain it.

sv

On Mon, 29 Mar 2004, Kevin A. Burton wrote:

 Erik Hatcher wrote:

 
  One more point... caching is done by the IndexReader used for the
  search, so you will need to keep that instance (i.e. the
  IndexSearcher) around to benefit from the caching.
 
 Great... Damn... looked at the source of CachingWrapperFilter and it
 makes sense.  Thanks for the pointer.  The results were pretty amazing.
 Here are the results before and after. Times are in millis:

 Before caching the Field:

 Searching for Jakarta:
 2238
 1910
 1899
 1901
 1904
 1906

 After caching the field:
 2253
 10
 6
 8
 6
 6

 That's a HUGE difference :)

 I'm very happy :)




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Javadocs lucene 1.4

2004-03-29 Thread Stephane James Vaucher
Are the javadocs available on the site?

I'd like to see the javadocs for lucene-1.4 (specifically SpanQuery)
somewhere on the lucene website. I've subscribed to the users mailing
list, but I've never got a feel for the new version. Is there any way
for this to happen, or should I await 1.4-rc1?

cheers,
sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Demoting results

2004-03-28 Thread Stephane James Vaucher
Mark,

I've added a section in the wiki called:

http://wiki.apache.org/jakarta-lucene/CommunityContributions

and have added an entry for your message. If you want to edit the
message, go for it. I believe that the wiki can support attached files if
you want to upload there.

cheers,
sv

On Sun, 28 Mar 2004 [EMAIL PROTECTED] wrote:

 I've found an elegant way of doing this now for all types of search - a
 new NegatingQuery class that takes any Query object in its constructor
 and selects all documents that DONT match and gives them a user-definable boost.

 The code is here:
 http://www.inperspective.com/lucene/demote.zip

 Cheers
 Mark


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene 1.4 - lobby for final release

2004-03-26 Thread Stephane James Vaucher
I'm personally a fan of a release small but often approach, but what are 
the new features available in 1.4 (a list would be nice, on the wiki 
perhaps)? Will there be interim builds available to try these new features 
out soon?

There seem to be no nightly builds on:

http://cvs.apache.org/builds/jakarta-lucene/nightly/

cheers,
sv

On Fri, 26 Mar 2004, Chad Small wrote:

 thanks Erik.  Ok this is my official lobby effort for the release of 1.4 to final 
 status.  Anyone else need/want a 1.4 release?
  
 Does anyone have any information on 1.4 release plans?
  
 thanks,
 chad.
 
   -Original Message- 
   From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
   Sent: Fri 3/26/2004 1:25 PM 
   To: Lucene Users List 
   Cc: 
   Subject: Re: too many files open error
   
   
 
   On Mar 26, 2004, at 1:33 PM, Chad Small wrote:
Is this :) serious?
   
   This is open-source.   I'm only as serious as it would take for someone
   to push it through.  I don't know what the timeline is, although lots
   of new features are available.
   
Because we have a need/interest in the new field sorting capabilities
and QueryParser keyword handling of dashes (-) that would be in 1.4,
I believe.  It's so much easier to explain that we'll use a final
release of Lucene instead of a dev build Lucene.
   
   Why explain it?!  Just show great results and let that be the
   explanation :)
   
   
If so, what would an expected release date be?
   
   *shrug* - feel free to lobby for it.  I don't know what else is planned
   before a release.
   
   Erik
   
   
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene 1.4 - lobby for final release

2004-03-26 Thread Stephane James Vaucher
I hope nobody minds, I've added a link on the wiki to the head of 
CHANGES.txt. I'm not sure if anyone is maintaining the wiki, if not, I 
can take a look at it. I could maybe rearrange things to look like 
sample-site:
http://wiki.apache.org/avalon
/sample-site

Any comments? I'll probably just go ahead and do it and await critisism ;) 

cheers,
sv

On Fri, 26 Mar 2004, Erik Hatcher wrote:

 On Mar 26, 2004, at 3:32 PM, Stephane James Vaucher wrote:
  I'm personally a fan of a release small but often approach, but what 
  are
  the new features available in 1.4 (a list would be nice, on the wiki
  perhaps)? Will there be interim builds available to try these new 
  features
  out soon?
 
 There is a CHANGES.txt in the root of the jakarta-lucene CVS repository 
 that stays pretty much current and accurate.  I'm pasting it below for 
 the 1.3 - CVS HEAD changes.
 
 
  There seem to be no nightly builds on:
 
  http://cvs.apache.org/builds/jakarta-lucene/nightly/
 
 
 I guess at this time you will have to build it yourself from CVS.  
 There is one show-stopper before we can release an RC1.  We must fully 
 convert to ASL 2.0 (meaning every single source file needs the license 
 header as well as any other files that can be tagged with it).  I know 
 Otis has changed some files, but we need a full sweep.  There have been 
 some utilities posted in a committers area to facilitate this change 
 more automatically if we want to use them.
 
   Erik
 
 excerpt from CHANGES.txt
 
 1.4 RC1
 
   1. Changed the format of the .tis file, so that:
 
  - it has a format version number, which makes it easier to
back-compatibly change file formats in the future.
 
  - the term count is now stored as a long.  This was the one aspect
of the Lucene's file formats which limited index size.
 
  - a few internal index parameters are now stored in the index, so
that they can (in theory) now be changed from index to index,
although there is not yet an API to do so.
 
  These changes are back compatible.  The new code can read old
  indexes.  But old code will not be able read new indexes. (cutting)
 
   2. Added an optimized implementation of TermDocs.skipTo().  A skip
  table is now stored for each term in the .frq file.  This only
  adds a percent or two to overall index size, but can substantially
  speedup many searches.  (cutting)
 
   3. Restructured the Scorer API and all Scorer implementations to take
  advantage of an optimized TermDocs.skipTo() implementation.  In
  particular, PhraseQuerys and conjunctive BooleanQuerys are
  faster when one clause has substantially fewer matches than the
  others.  (A conjunctive BooleanQuery is a BooleanQuery where all
  clauses are required.)  (cutting)
 
   4. Added new class ParallelMultiSearcher.  Combined with
  RemoteSearchable this makes it easy to implement distributed
  search systems.  (Jean-Francois Halleux via cutting)
 
   5. Added support for hit sorting.  Results may now be sorted by any
  indexed field.  For details see the javadoc for
  Searcher#search(Query, Sort).  (Tim Jones via Cutting)
 
   6. Changed FSDirectory to auto-create a full directory tree that it
  needs by using mkdirs() instead of mkdir().  (Mladen Turk via Otis)
 
   7. Added a new span-based query API.  This implements, among other
  things, nested phrases.  See javadocs for details.  (Doug Cutting)
 
   8. Added new method Query.getSimilarity(Searcher), and changed
  scorers to use it.  This permits one to subclass a Query class so
  that it can specify it's own Similarity implementation, perhaps
  one that delegates through that of the Searcher.  (Julien Nioche
  via Cutting)
 
   9. Added MultiReader, an IndexReader that combines multiple other
  IndexReaders.  (Cutting)
 
 10. Added support for term vectors.  See Field#isTermVectorStored().
  (Grant Ingersoll, Cutting  Dmitry)
 
 11. Fixed the old bug with escaping of special characters in query
  strings: http://issues.apache.org/bugzilla/show_bug.cgi?id=24665
  (Jean-Francois Halleux via Otis)
 
 12. Added support for overriding default values for the following,
  using system properties:
- default commit lock timeout
- default maxFieldLength
- default maxMergeDocs
- default mergeFactor
- default minMergeDocs
- default write lock timeout
  (Otis)
 
 13. Changed QueryParser.jj to allow '-' and '+' within tokens:
  http://issues.apache.org/bugzilla/show_bug.cgi?id=27491
  (Morus Walter via Otis)
 
 14. Changed so that the compound index format is used by default.
  This makes indexing a bit slower, but vastly reduces the chances
  of file handle problems.  (Cutting)
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED

Re: Documentation and presentations

2004-03-26 Thread Stephane James Vaucher
Erik, maybe Otis and yourself should slow down on development. You 
wouldn't want your book to discuss lucene-1.3 if you release a version 1.5 
before it hits the stores... unless that's your master plan;)

sv

On Fri, 26 Mar 2004, Erik Hatcher wrote:

 So far so good, Stephane, on the wiki changes - looks good!
 
 As for our book - at this point, early summer seems like when it'll 
 actually be on the shelves.  By the end of April we should have mostly 
 everything complete, reviewed, and entirely in the publishers hands.  
 *ugh* - this process takes much longer than even exaggerated estimates.
 
   Erik
 
 
 On Mar 26, 2004, at 6:00 PM, Stephane James Vaucher wrote:
 
  Hello lucene community,
 
  I'll be presenting lucene at the GUJM (Java Users Group of Montreal),
  mid-April, could you send me references, articles, presentations not
  readily available on the lucene site (at
  http://jakarta.apache.org/lucene/docs/resources.html)?
 
  Otis or Erik, I'll mention that you have written a book on lucene. When
  will it be out?
 
  I'll also see if I can rearrange the wiki using the information you 
  send
  me, and I'll contribute my presentation (in french).
 
  cheers,
  sv
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Wiki and news

2004-03-26 Thread Stephane James Vaucher
On the wiki, I've looked up some reference for lucene community releases
to put under News (http://wiki.apache.org/jakarta-lucene/LatestNews), if
I've missed some, you can modify the page yourself (it's a wiki after 
all).

sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-14 Thread Stephane James Vaucher
Just found the rest of the thread. I'll shut up now ;)

sv

On Sun, 14 Mar 2004, Stephane James Vaucher wrote:

 Back from a weeks' vacation, so this reply is a little late, maybe out of
 order as well ;). Comment inline:

 On Tue, 9 Mar 2004, Kevin A. Burton wrote:

  Doug Cutting wrote:
 
   Erik Hatcher wrote:
  
   Well, one issue you didn't consider is changing a public method
   signature.  I will make this change, but leave the Hashtable
   signature method there.  I suppose we could change the signature to
   use a Map instead, but I believe there are some issues with doing
   something like this if you do not recompile your own source code
   against a new Lucene JAR so I will simply provide another
   signature too.
  
  
   This would no longer compile with the change Kevin proposes.
  
   To make things back-compatible we must:
  
   1. Keep but deprectate StopFilter(Hashtable) constructor;
   2. Keep but deprecate StopFilter.makeStopTable(String[]);
   3. Add a new constructor: StopFilter(HashMap);
   4. Add a new method: StopFilter.makeStopMap(String[]);

 Why impose implementation details in the constructor? Shouldn't the
 constructor use a Map (not a HashMap), a Set, or a String array?

 sv

  
   Does that make sense?
  
  This patch and attachment take care of this problem...
 
  It does make this class more complex than it needs to be... but 1/2 of
  the methods are deprecated.
 
  Kevin
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing numbers

2004-03-05 Thread Stephane James Vaucher
Weird idea, how about transforming your long into a Date and using a
DateFilter to use a ranged query?

sv

On Fri, 5 Mar 2004, Erik Hatcher wrote:

 Terms in Lucene are text.  If you want to deal with number ranges, you
 need to pad them.

 0001 for example.  Be sure all numbers have the same width
 and zero padded.

 Lucene use lexicographical ordering, so you must be sure things collate
 in this way.

   Erik

 On Mar 5, 2004, at 11:46 AM, [EMAIL PROTECTED] wrote:

  On Friday 05 March 2004 15:42, Otis Gospodnetic wrote:
  Try with Field.Keyword.
 
  Ok, works.
 
  Another problem: Range searches don't work.
 
  id:(1 TO 1069421083284)
 
  does return only 1 hit - 1069421083284.
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing numbers

2004-03-05 Thread Stephane James Vaucher
On Fri, 5 Mar 2004 [EMAIL PROTECTED] wrote:

 On Friday 05 March 2004 18:01, Erik Hatcher wrote:
  0001 for example.  Be sure all numbers have the same width
  and zero padded.

 And what about a range like 100 TO 1000?

You mean 0100 To 1000 or 100 to 0001000 ;)

sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Sys properties Was: java.io.tmpdir as lock dir .... once again

2004-03-03 Thread Stephane James Vaucher
As I've stated in my earlier mail, I like this change. More importantly, 
could this become a standard way of changing configurations at runtime? 
For example, the default merge factor could also be set in this manner.

sv

On Wed, 3 Mar 2004, Michael Duval wrote:

 
 I agree with both the property name change and also making it static.
 
 Mike
 
 Doug Cutting wrote:
 
  Michael Duval wrote:
I've hacked the code for the time being by updating FSDirectory and
 
  replaced all System.getProperty(java.io.tmpdir)
  calls with a call to a new method getLockDir().   This method 
  checks for a lucene.lockdir prop before the
  java.io.tmpdir prop giving the end user a bit more flexibility in 
  where locks are stored.
 
 
  In general, I support this change.
 
  Here is the method:
 
   /** Allow flexible locking directories - Michael R. Duval 3/02/04 */
   private String getLockDir() {
 String lockDir;
 
 if ((lockDir = System.getProperty(lucene.lockdir)) == null)
 return System.getProperty(java.io.tmpdir);
 else
 return  lockDir;
   }
 
 
  In particular, I have some quibbles.  The property should be named 
  something like org.apache.lucene.lockdir, not just lucene.lockdir. 
  And there's no reason to look it up each time: it can just be a static.
 
  private static final String LOCK_DIR =
System.getProperty(org.apache.lucene.lockdir,
   System.getProperty(java.io.tmpdir));
 
  Doug
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Sys properties Was: java.io.tmpdir as lock dir .... once again

2004-03-03 Thread Stephane James Vaucher
How about (looking big rather than small):

- MaxClause from BooleanQuery (I know there has been discussions on 
the dev list, but I haven't been following it)
- default commit_lock_name
- default commit_lock_timeout
- default maxFieldLength
- default maxMergeDocs
- default mergeFactor
- default minMergeDocs
- default write_lock_name
- default write_lock_timeout

I'm currently configuring parts of my app using sys properties, 
particularly the mergeFactor because my prod system has 2GB of RAM and is 
windows based and my dev machine has 256MB and is linux. If no one takes a 
crack at this, I'll see what I can do in 2 weeks, after my vacations.

Cheers,
sv

On Wed, 3 Mar 2004, Doug Cutting wrote:

 Stephane James Vaucher wrote:
  As I've stated in my earlier mail, I like this change. More importantly, 
  could this become a standard way of changing configurations at runtime? 
  For example, the default merge factor could also be set in this manner.
 
 Sure, that's reasonable, so this would be something like:
 
 private static final int DEFAULT_MERGE_FACTOR =
  
 Integer.parseInt(System.getProperty(org.apache.lucene.mergeFactor,10));
 
 In IndexWriter.java.
 
 What other candidates are there for this treatment?
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: java.io.tmpdir as lock dir .... once again

2004-03-02 Thread Stephane James Vaucher
I've done something similar to configure my merge factor (but it was
outside my code), and am planning on setting the limit on boolean queries
this way as well. I think it's pretty clean especially if you use
org.apache.lucene.xxx properties with decent default values.

Adding this feature could probably better document the hazards of use an
index lock in a distributed system, considering many people like to know
the implications of running lucene in a web (and potentially replicated)
env.

my 2c,
sv

On Tue, 2 Mar 2004, Otis Gospodnetic wrote:

 This looks nice.
 However, what happens if you have two Java processes that work on the
 same index, and give it different lock directories?
 They'll mess up the index.

If you sell people coffee, they can always burn themselves. Might as well
warn them.

 Should we try to prevent this by not offering this option, or should we
 offer it, document it well, and leave it up to the user to play by the
 rules or not?

 I'm leaning towards the latter, but I think some Lucene developers
 would be more conservative.

 Otis


 --- Michael Duval [EMAIL PROTECTED] wrote:
 
  Hello All,
 
  I've come across my first gotcha with the system property
  java.io.tmpdir as the lock directory.
 
  Over here at APS we run lucene in two different servlet containers on
 
  two different servers for both performance
  and security reasons.  One container gives read access to the
  collection
  and the other is contantly updating the collection.
  The collection is NFS mounted from both servers.   This worked fine
  until the lucene update 1.3.   Now the lock files are being
  written to the temp dir's in each of the respective containers root
  dir's.   This of course breaks the locking scheme.
 
  I could have changed the tmpdir prop to write files back into the
  collection directory but this would also pollute
  the tmpdir with other non-related files.  My solution was as follows:
 
  I've hacked the code for the time being by updating FSDirectory and
  replaced all System.getProperty(java.io.tmpdir)
  calls with a call to a new method getLockDir().   This method
  checks
  for a lucene.lockdir prop before the
  java.io.tmpdir prop giving the end user a bit more flexibility in
  where locks are stored.
 
  Here is the method:
 
/** Allow flexible locking directories - Michael R. Duval 3/02/04
  */
private String getLockDir() {
  String lockDir;
 
  if ((lockDir = System.getProperty(lucene.lockdir)) == null)
  return System.getProperty(java.io.tmpdir);
  else
  return  lockDir;
}
 
  Hopefully a solution similar to this will make it in to one of the
  next
  distributions.
 
  Thanks and Cheers,
 
  Mike
 
  --
  Michael R. Duval [EMAIL PROTECTED] 
  E-Journal Programmer/Analyst
  The American Physical Society
  1 Research Road
  Ridge, NY 11961
 
  www.aps.org
  631 591 4127
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Field boosting Was: Indexing multiple instances of the same field for each document

2004-02-27 Thread Stephane James Vaucher
Slightly off topic to this thread, but how would adding different fields 
with the same name deal with boosts? I've looked at the javadoc and FAQ, 
but I think it's not a common use of this feature, any insight?

E.G.
Document doc = new Document();
Field f1 = Field.Keyword(fieldName, foo);
f1.setBoost(1);
doc.add(f1);

Field f2 = Field.Keyword(fieldName, bar);
f2.setBoost(2);
doc.add(f2);

Cheers,
sv

On Fri, 27 Feb 2004, Doug Cutting wrote:

 I think it's document.add().  Fields are pushed onto the front, rather 
 than added to the end.
 
 Doug
 
 Roy Klein wrote:
  I think it's got something to do with Document.invertDocument().
  
  When I reverse the words in the phrase, the other document matches the
  phrase query.
  
  Roy
  
 
  
  -Original Message-
  From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
  Sent: Friday, February 27, 2004 4:34 PM
  To: Lucene Users List
  Subject: Re: Indexing multiple instances of the same field for each
  document
  
  
  On Feb 27, 2004, at 4:10 PM, Roy Klein wrote:
  
 Hi Erik,
 
 While you might be right in this example (using Field.Keyword), I can 
 see how this would still be a problem in other cases. For instance, if
  
  
 I were adding more than one word at a time in the example I attached.
  
  
  I concur that it appears to be a bug.  It is unlikely folks use Lucene 
  like this too much though - there probably are not too many scenarios 
  where combining things into a single String or Reader is a burden.
  
  I'm interested to know where in the code this oddity occurs so I can 
  understand it more.  I did a brief bit of troubleshooting but haven't 
  figured it out yet.  Something in DocumentWriter I presume.
  
  Erik
  
  
  
  
 Roy
 
 
 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED]
 Sent: Friday, February 27, 2004 2:12 PM
 To: Lucene Users List
 Subject: Re: Indexing multiple instances of the same field for each 
 document
 
 
 Roy,
 
 On Feb 27, 2004, at 12:12 PM, Roy Klein wrote:
 
 Document doc = new Document();
 doc.add(Field.Text(contents, the));
 
 Changing these to Field.Keyword gets it to work.  I'm delving a little
  
  
 bit to understand why, but it seems if you are adding words 
 individually anyway you'd want them to be untokenized, right?
 
 Erik
 
 
 
 doc.add(Field.Text(contents, quick));
 doc.add(Field.Text(contents, brown));
 doc.add(Field.Text(contents, fox));
 doc.add(Field.Text(contents, jumped));
 doc.add(Field.Text(contents, over));
 doc.add(Field.Text(contents, the));
 doc.add(Field.Text(contents, lazy));
 doc.add(Field.Text(contents, dogs));
 doc.add(Field.Keyword(docnumber, 1));
 writer.addDocument(doc);
 doc = new Document();
 doc.add(Field.Text(contents, the quick brown fox jumped 
 over the lazy dogs));
 doc.add(Field.Keyword(docnumber, 2));
 writer.addDocument(doc);
 writer.close();
 }
 
 public static void query(File indexDir) throws IOException
 {
 Query query = null;
 PhraseQuery pquery = new PhraseQuery();
 Hits hits = null;
 
 try {
 query = QueryParser.parse(quick brown, contents, new 
 StandardAnalyzer());
 } catch (Exception qe) {System.out.println(qe.toString());}
 if (query == null) return;
 System.out.println(Query:  + query.toString());
 IndexReader reader = IndexReader.open(indexDir);
 IndexSearcher searcher = new IndexSearcher(reader);
 
 hits = searcher.search(query);
 System.out.println(Hits:  + hits.length());
 
 for (int i = 0; i  hits.length(); i++)
 {
 System.out.println( hits.doc(i).get(docnumber) +  );
 }
 
 
 pquery.add(new Term(contents, quick));
 pquery.add(new Term(contents, brown));
 System.out.println(PQuery:  + pquery.toString());
 hits = searcher.search(pquery);
 System.out.println(Phrase Hits:  + hits.length());
 for (int i = 0; i  hits.length(); i++)
 {
 System.out.println( hits.doc(i).get(docnumber) +  );
 }
 
 searcher.close();
 reader.close();
 
 }
 public static void main(String[] args) throws Exception {
 if (args.length != 1) {
 throw new Exception(Usage:  + test.class.getName() +  
 index dir);
 }
 File indexDir = new File(args[0]);
 test(indexDir);
 query(indexDir);
 }
 }
 
 -
 -
 -
 -
 ---
 My results:
 Query: contents:quick contents:brown
 Hits: 2
 1
 2
 PQuery:
 contents:quick brown
 Phrase Hits: 1
 2
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 

Re: Field boosting Was: Indexing multiple instances of the same field for each document

2004-02-27 Thread Stephane James Vaucher
Cheers, I index information in chunks. The reason for this is that I have
an IR tool that returns information ordered by confidence rather than by
the fields I index. I just add fields as they come, but I would be
interested in knowing how other people deal with confidence. Following
your answer, I can't add my confidence as boosts to the terms as I index
them, do you have any suggestions? I'm guessing that I'll probably have to
add multiple copies of my fields to simulate boosting.

sv

On Fri, 27 Feb 2004, Erik Hatcher wrote:

 On Feb 27, 2004, at 6:26 PM, Stephane James Vaucher wrote:
  Slightly off topic to this thread, but how would adding different
  fields
  with the same name deal with boosts? I've looked at the javadoc and
  FAQ,
  but I think it's not a common use of this feature, any insight?

 There is only one boost per field name.  However, the effect is the
 multiplication of them all interestingly.  So, in your example below,
 the boost of the fieldName is 2.

   Erik

 
  E.G.
  Document doc = new Document();
  Field f1 = Field.Keyword(fieldName, foo);
  f1.setBoost(1);
  doc.add(f1);
 
  Field f2 = Field.Keyword(fieldName, bar);
  f2.setBoost(2);
  doc.add(f2);
 
  Cheers,
  sv
 
  On Fri, 27 Feb 2004, Doug Cutting wrote:
 
  I think it's document.add().  Fields are pushed onto the front, rather
  than added to the end.
 
  Doug
 
  Roy Klein wrote:
  I think it's got something to do with Document.invertDocument().
 
  When I reverse the words in the phrase, the other document matches
  the
  phrase query.
 
  Roy
 
 
 
  -Original Message-
  From: Erik Hatcher [mailto:[EMAIL PROTECTED]
  Sent: Friday, February 27, 2004 4:34 PM
  To: Lucene Users List
  Subject: Re: Indexing multiple instances of the same field for each
  document
 
 
  On Feb 27, 2004, at 4:10 PM, Roy Klein wrote:
 
  Hi Erik,
 
  While you might be right in this example (using Field.Keyword), I
  can
  see how this would still be a problem in other cases. For instance,
  if
 
 
  I were adding more than one word at a time in the example I
  attached.
 
 
  I concur that it appears to be a bug.  It is unlikely folks use
  Lucene
  like this too much though - there probably are not too many scenarios
  where combining things into a single String or Reader is a burden.
 
  I'm interested to know where in the code this oddity occurs so I can
  understand it more.  I did a brief bit of troubleshooting but haven't
  figured it out yet.  Something in DocumentWriter I presume.
 
Erik
 
 
 
 
 Roy
 
 
  -Original Message-
  From: Erik Hatcher [mailto:[EMAIL PROTECTED]
  Sent: Friday, February 27, 2004 2:12 PM
  To: Lucene Users List
  Subject: Re: Indexing multiple instances of the same field for each
  document
 
 
  Roy,
 
  On Feb 27, 2004, at 12:12 PM, Roy Klein wrote:
 
 Document doc = new Document();
 doc.add(Field.Text(contents, the));
 
  Changing these to Field.Keyword gets it to work.  I'm delving a
  little
 
 
  bit to understand why, but it seems if you are adding words
  individually anyway you'd want them to be untokenized, right?
 
   Erik
 
 
 
 doc.add(Field.Text(contents, quick));
 doc.add(Field.Text(contents, brown));
 doc.add(Field.Text(contents, fox));
 doc.add(Field.Text(contents, jumped));
 doc.add(Field.Text(contents, over));
 doc.add(Field.Text(contents, the));
 doc.add(Field.Text(contents, lazy));
 doc.add(Field.Text(contents, dogs));
 doc.add(Field.Keyword(docnumber, 1));
 writer.addDocument(doc);
 doc = new Document();
 doc.add(Field.Text(contents, the quick brown fox jumped
  over the lazy dogs));
 doc.add(Field.Keyword(docnumber, 2));
 writer.addDocument(doc);
 writer.close();
 }
 
 public static void query(File indexDir) throws IOException
 {
 Query query = null;
 PhraseQuery pquery = new PhraseQuery();
 Hits hits = null;
 
 try {
 query = QueryParser.parse(quick brown, contents, new
  StandardAnalyzer());
 } catch (Exception qe) {System.out.println(qe.toString());}
 if (query == null) return;
 System.out.println(Query:  + query.toString());
 IndexReader reader = IndexReader.open(indexDir);
 IndexSearcher searcher = new IndexSearcher(reader);
 
 hits = searcher.search(query);
 System.out.println(Hits:  + hits.length());
 
 for (int i = 0; i  hits.length(); i++)
 {
 System.out.println( hits.doc(i).get(docnumber) +  );
 }
 
 
 pquery.add(new Term(contents, quick));
 pquery.add(new Term(contents, brown));
 System.out.println(PQuery:  + pquery.toString());
 hits = searcher.search(pquery);
 System.out.println(Phrase Hits:  + hits.length());
 for (int i = 0; i  hits.length(); i++)
 {
 System.out.println( hits.doc(i).get

Re: Incrementally updating and monitoring the index

2004-02-13 Thread Stephane James Vaucher
On Fri, 13 Feb 2004 [EMAIL PROTECTED] wrote:

 Hi!
 
 Can Lucene incrementally update its index (i.e. balancing will a list of docs 
 and removing those that are no more found)?

Incremental updates (additions and deletions) are possible, but I'm not 
sure if I understand your question. Lucene holds its own instances of 
documents structured in text fields (not going into details here). There 
lucene documents are created and updated programatically, not 
automatically because lucene does not keep tabs on external documents.

 I'd like to monitor the index for certain queries/terms, i.e. I want to be 
 notified if there are (new) hits for a list of terms each time after I add a 
 document to the index - continously.
 
 Is this possibe? The index will contain several hundrets of thousands of 
 documents and will be frequently accessed concurrently.

Very possible, before adding a document, you can check (with the judicious 
use of an id) if it has already been added. If it hasn't, do your 
notification, but this requires programming.

For concurrent write access, there is a lock, do you might want to use a 
singleton responsible for adding documents.

 
 TIA
 Timo
 

HTH,
sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The First Parameter of the IndexWriter

2004-02-10 Thread Stephane James Vaucher
You should probably take a look at the javadoc:
http://jakarta.apache.org/lucene/docs/api/index.html

As for where to store the index, you'll want to put it somewhere where all
potential users can access it, as well as where there is enough space for
your index. In a nutshell, you need to think of:
- amount of storage required
- permissions (e.g. if you need to access it from a app server with
security restrictions)
- access, on a shared HD or not
- deployment, if for a product, then it should be included in your
installation strategy, so you might use c:/Program Files/.../MyApp/index,
or /usr/local/MyApp/index.

On win, I personnally use my D drive in a path corresponding
to d:/app-name/index.

HTH,
sv

On Tue, 10 Feb 2004, Caroline Jen wrote:

 I am constructing a web site.  I am learning the
 Lucene so that I can use it to search the database.  I
 started with reading the Introdution In Text Indexing
 with Jakarta Apache Lucene at
 http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html

 and in the example given, it looks that I have to
 specify a directory for the first parameter of the
 IndexWriter (see below).

String indexDir = System.getProperty
(java.io.tmpdir, tmp) + System.getProperty
(file.separator) + index-1;

Analyzer analyzer = new StandardAnalyzer();
boolean createFlag = true;

IndexWriter writer = new IndexWriter(indexDir,
 analyzer, createFlag);

 I have a record created and stored in a table in my
 database whenever a user submits his/her inputs.  And
 I want to index that record.  What should be the
 indexDir in my case?  Should I follow the above
 example and use java.io.tmpdir?  I sort of doubt it.
  Please advise.

 __
 Do you Yahoo!?
 Yahoo! Finance: Get your refund fast by filing online.
 http://taxes.yahoo.com/filing.html

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]