Re: Patchs for RussianAnalyzer

2004-03-30 Thread Vladimir Yuryev
Erik,
Look, please second my letter whithout attachment. It has the texts in 
body letter.
Vladimir.

On Mon, 29 Mar 2004 12:06:45 -0500
 Erik Hatcher [EMAIL PROTECTED] wrote:
Vladimir,

I have just taken a look at your submitted patches.  I have no 
objections to making Cp1251 the default charset used in the no-arg 
constructor to RussianAnalyzer, but all of your other changes are 
formatting along with the addition of some other constructors.

Could you please provide a functionality-only diff for your patches, 
preferably in a single file attached to a Bugzilla issue?

Thanks,
Erik
On Mar 17, 2004, at 8:25 AM, Vladimir Yuryev wrote:

Dear developers!

The user using RussianAnalyzer writes to you of Lucene. There is one 

problem at work only with it of Analyzer it is parameter of the  
Russian coding (you it know as the set of the code tables for one  
language always causes admiration). East Europe or the population 
the  
using applied programs in Russian use the coding windows-1251 as 
basic  
or widely widespread client a platform MS Windows. There is an 
opinion  
to update constructor without parameters establishing default  
Cp1251.

See attached file: RussianAnalyzerPatchs.tgz
RussianAnalyzer.java.path
RussianLetterTokenizer.java.patch
RussianLowerCaseFilter.java.patch
RussianStemFilter.java.patch
TestRussianAnalyzer.java.path
Such updating will remove mess (for the beginners in Lucene or  
beginners of Russian) and will facilitate use Analyzers at 
switchings  
multilanguage search.
Regards,
Vladimir Yuryev.
RussianAnalyzerPatchs.tgz 
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Patchs for RussianAnalyzer

2004-03-30 Thread Erik Hatcher
On Mar 30, 2004, at 3:38 AM, Vladimir Yuryev wrote:
Erik,
Look, please second my letter whithout attachment. It has the texts in 
body letter.
Vladimir.
I don't have that e-mail you refer to.  Please use the standard Jakarta 
Bugzilla issue tracking system, though.  You can place an attachment to 
an issue after you create it - e-mail ends up mangling in-line patches.

What I'm after is a clean patch that *only* changes the functionality 
you desire, not code formatting also.  We can clean up code formatting 
in another pass if needed - or I can just do that on my end after 
reviewing the functionality-only patch.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Re: UNIX command-line indexing script?

2004-03-30 Thread Linto Joseph Mathew
charlie,

i wrote this in java.Ofcourse I am ready to share. But i have some problems when 
indexing large volume of data. I am under testing.

Linto


 


On Fri, 26 Mar 2004 Charlie Smith wrote :
So, Linto,

  Did you write this in PERL or JAVA.  Would you be willing to part with copy of
source?



 Linto wrote on 3/16/04

 I  have wrote one that will index PDF,DOC,XLS,XML,HTML,TXT and plain/text
files. I wrote this based on demo application and using other
 open soure componets POI by Apache (for doc and exel) and PDFBox. I modified
client interface also. Now its looks like google. Still i have to do a couple
of things.
   1) At present i'm using UNIX 'file' command to check it is plain text.
  This will spwan process and take more time. The advantage this is
in unix based mechines where file extention is not important.( it uses
magic numbers. )
   2) The information such as Index Location, Directory, URL, etc. should
be kept in an xml file. So that it cam be dynamic.
   3) Categeory
 
 
 Since apache guys provided good frame work every thing made easy. Thanks
guys!
 

 Linto




On Sat, 13 Mar 2004 Charlie Smith wrote :
 Anyone written a simple UNIX command-line indexing script which will read a
 bunch off different kinds of docs and index them?  I'd like to make a cron
job
 out of this so as to be able to come back and read it later during a search.
 
 PERL or JAVA script would be fine.
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Patchs for RussianAnalyzer

2004-03-30 Thread Vladimir Yuryev
Erik,
I made BUG # 28050.
Vladimir
On Tue, 30 Mar 2004 06:19:04 -0500
 Erik Hatcher [EMAIL PROTECTED] wrote:
On Mar 30, 2004, at 3:38 AM, Vladimir Yuryev wrote:
Erik,
Look, please second my letter whithout attachment. It has the texts 
in 
body letter.
Vladimir.
I don't have that e-mail you refer to.  Please use the standard 
Jakarta Bugzilla issue tracking system, though.  You can place an 
attachment to an issue after you create it - e-mail ends up mangling 
in-line patches.

What I'm after is a clean patch that *only* changes the functionality 
you desire, not code formatting also.  We can clean up code 
formatting in another pass if needed - or I can just do that on my 
end after reviewing the functionality-only patch.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene optimization with one large index and numerous small indexes.

2004-03-30 Thread Doug Cutting
Esmond Pitt wrote:
Don't want to start a buffer size war, but these have always seemed too
small to me. I'd recommend upping both InputStream and OutputStream buffer
sizes to at least 4k, as this is the cluster size on most disks these days,
and also a common VM page size.
Okay.

Reading and writing in smaller quantities
than these is definitely suboptimal.
This is not obvious to me.  Can you provide Lucene benchmarks which show 
this?  Modern filesystems have extensive caches, perform read-ahead and 
delay writes.  Thus file-based system calls do not have a close 
correspondence to physical operations.

To my thinking, the primary role of file buffering in Lucene is to 
minimize the overhead of the system call itself, not to minimize 
physical i/o operations.  Once the overhead of the system call is made 
insignificant, larger buffers offer little measurable improvement.

Also, we cannot increase the size of these blindly.  Buffers are the 
largest source of per-query memory allocation in Lucene, with one (or 
two for phrases and spans) allocated for every query term.  Folks whose 
applications perform wildcard queries have encountered out-of-memory 
exceptions with the current buffer size.

Possibly one could implement a term wildcard mechanism which does not 
require a buffer per term, or perhaps one could allocate small buffers 
for infrequent terms (the vast majority).  If such changes were made 
then it might be feasable to bump up the buffer size somewhat.  But, 
back to my first point, one must first show that larger buffers offer 
significant performance improvements.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: too many files open error

2004-03-30 Thread Charlie Smith
Thanks for the information.  I downloaded 1.3-rc2 and put a IndexReader.close()
at the end of the search
routine.  This seems to have cleared up the problems.   Also, demo source code
for results.jsp to 
return a pointer to IndexReader so that it could be closed at end of search.
Ie.
  searcher = new IndexSearcher(ir =
IndexReader.open(indexName)); //create an
indexSearcher for our page
...
ir.close();


 erik 3/27/2004 4:44:28 AM 
On Mar 27, 2004, at 1:28 AM, Charlie Smith wrote:
 What would be the URL for the JUnit stuff?

Look in the src/test directory of where you checked out Lucene.  All
JUnit tests live there and below.

 BTW: I was able to build a new Index.class file, with the additional
 line
 iw.setUserCompoundFile(true)  after extracting the
 lucene-1.4-rc1-dev.jar.

 Then reindexed.  Guess what - no worky.  :(

Maybe you'd care to share some *technical* details to elaborate on no
worky?!

Still get the too many files open error on invoking a modified results.jsp. 
(The one that comes with Lucene.)
The index is created with a call to the IndexWriter.class file.   The
Index.class file calls IndexWriter,
and I modified to have the setUseCompoundFile(true).  Added lines 350 and 442
as suggested.


What Index.class are you talking about?  The demo application?

 Can I get 1.3-RC2?  Could someone point me to the URL for this
 download please
 ;)

Use CVS :)

 I noticed following entry in mail archives:
 http://www.mail-archive.com/[EMAIL PROTECTED]/ 
 msg06118.html

 along with 139 others that dealt with the too many files open
 problem.

 Looks like this is a high priority problem that might justify a new
 release in
 and of itself?

People have been using Lucene for years and managing the file handle
issue by setting ulimit and other tricks like optimizing to reduce the
number of segments.  So it is not as much a problem as it is a known
issue that can be managed.


My ulimit is set to unlimited.
From what I can tell, it is a stress test issue that seems to work under
1.3-rc2.  Would anyone understand the differences to know if it will work as
well under next stable release of  Lucene?


I'm not up to speed on what the issues with 1.3 final are - I've just
started hearing about it.  Is there a reproducible example that
demonstrates a problem?

   Erik


John Brown has made his source available.  Go to Google and search for
docSearcher.  He seems quite willing to help where needed.  Use the reults.jsp
routine that comes with Lucene to test, with following changes:
snip
 Analyzer analyzer = new StopAnalyzer();  
//construct our usual analyzer
---
 Analyzer analyzer = new StandardAnalyzer();  
//construct our usual analyzer
68,69c54,56
 query = QueryParser.parse(queryString, contents,
analyzer); //parse the
 } catch (ParseException e) {  //query
and construct the Query
---
 query = QueryParser.parse(queryString, body,
analyzer); //parse the
 //query = query.rewrite(reader);
   } catch (ParseException e) {  //query
and construct the Query
87a75


 trtdfont size=5Search results for /fontfont size=5
color=blue%=queryString%/td/tr
 trtr

108a96,97
 // cws: 2/25/04 added this to get format href link.
 RE r = new RE(/path/to/site/root/);
111d99
 tr
114,122c102,131
 String doctitle = doc.get(title);//get
its title
 String url = doc.get(url);   //get
its url field
 if ((doctitle == null) || doctitle.equals()) //use
the url if it has no title
 doctitle = url;
//then
output!
 %
 tda href=%=url%%=doctitle%/a/td
 td%=doc.get(summary)%/td
 /tr
---
 String path = doc.get(path);
 String type = doc.get(type);
 String title = doc.get(title);

 // cws: 2/25/04 added this to get format href link.
 String path_part =  r.subst(path, /);

 String summary = doc.get(summary);
 String size = doc.get(size);
 String date = doc.get(mod_date);

 // date formating
   java.util.Date bd=DateField.stringToDate(date);
   Calendar nowD=Calendar.getInstance();
   nowD.setTime(bd);
   int mon=nowD.get(nowD.MONTH)+1;
   int year=nowD.get(nowD.YEAR);
   int day=nowD.get(nowD.DAY_OF_MONTH);
   date = 

Re: Lucene 1.4 - lobby for final release

2004-03-30 Thread Charlie Smith
Your opinion of course on the issued of too many files open not being a bug.
I found it to be otherwise.  

Thanks for the info on popular elections.   

Being a newbie to this list, I am finding that most others on the list a bit
more pleasant.  

But then, you not up for a popular election, are you?  Appreciate all you do to
keep
us from shooting ourselves in the foot.

Thankyou. :(


 cutting 3/29/2004 11:27:58 AM 
Charlie Smith wrote:
 I'll vote yes  please release new version with too many files open fixed.

There is no too many files open bug, except perhaps in your 
application.  It is however an easy to encounter problem if you don't 
close indexes or if you change Lucene's default parameters.  It will be 
considerably harder to make happen in 1.4, to keep so many people from 
shooting themselves in the foot.

Also, releases are not made by popular election.  They are made by 
volunteer developer when deemeed appropriate.  If you'd like to get more 
involved in Lucene's development, please contribute constructive efforts 
to the lucene-dev mailing list.

 Maybe default the setUserCompoundFile(true) to true on this go around.

This was discussed at lenght on the developer mailing list a while back. 
  The change has been made and will be present in 1.4.

 Otherwise, how can I get 1.3-RC2?  I can't seem to locate it.

The second hit for a Google search on lucene 1.3RC2 reveals:

   http://www.apachenews.org/archives/000134.html 

These search engines sure are amazing, aren't they!

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



The Filter got called more than one time

2004-03-30 Thread Ching-Pei Hsing
Hi,

We implemented a Filter that performs filtering based on some internal
pricing logic. While testing we discovered that this filter got called
several times, not like the FAQ says, exactly one time. And the number of
calls made was based on how big the result set was. I printed out the
calling stack and discovered that Hits.doc(n) also calls
IndexSearcher.search(Query, Filter) when there're more docs needed. I can
understand the lazy retrieve for optimization, but it seems wrong to me to
just call the search function again and again. At least the filter should
not be invoked over and over again.

Logic in our filter is a little bit heavier than usual already. We
definitely want to reduce the number of calls to it. Is there any way we can
work around this?

Call to Searcher.search()
at
com.comergent.reference.appservices.productService.search.query.PricingFilte
r.bits(PricingFilter.java:244)
at
com.comergent.api.appservices.search.query.CmgtFilter.bits(CmgtFilter.java:1
08)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93)
at org.apache.lucene.search.Hits.init(Hits.java:80)
at org.apache.lucene.search.Searcher.search(Searcher.java:71)

Call to Hits.doc()
at
com.comergent.reference.appservices.productService.search.query.PricingFilte
r.bits(PricingFilter.java:244)
at
com.comergent.api.appservices.search.query.CmgtFilter.bits(CmgtFilter.java:1
08)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93)
at org.apache.lucene.search.Hits.hitDoc(Hits.java:153)
at org.apache.lucene.search.Hits.doc(Hits.java:118)

Thanks

Ching-pei

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The Filter got called more than one time

2004-03-30 Thread Erik Hatcher
Use a caching mechanism for your filter, so the bitset is not  
regenerated.  CachingWrappingFilter is your friend :)

	Erik

On Mar 30, 2004, at 2:28 PM, Ching-Pei Hsing wrote:

Hi,

We implemented a Filter that performs filtering based on some internal
pricing logic. While testing we discovered that this filter got called
several times, not like the FAQ says, exactly one time. And the number  
of
calls made was based on how big the result set was. I printed out the
calling stack and discovered that Hits.doc(n) also calls
IndexSearcher.search(Query, Filter) when there're more docs needed. I  
can
understand the lazy retrieve for optimization, but it seems wrong to  
me to
just call the search function again and again. At least the filter  
should
not be invoked over and over again.

Logic in our filter is a little bit heavier than usual already. We
definitely want to reduce the number of calls to it. Is there any way  
we can
work around this?

Call to Searcher.search()
at
com.comergent.reference.appservices.productService.search.query.Pricing 
Filte
r.bits(PricingFilter.java:244)
at
com.comergent.api.appservices.search.query.CmgtFilter.bits(CmgtFilter.j 
ava:1
08)
at  
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93)
at org.apache.lucene.search.Hits.init(Hits.java:80)
at org.apache.lucene.search.Searcher.search(Searcher.java:71)

Call to Hits.doc()
at
com.comergent.reference.appservices.productService.search.query.Pricing 
Filte
r.bits(PricingFilter.java:244)
at
com.comergent.api.appservices.search.query.CmgtFilter.bits(CmgtFilter.j 
ava:1
08)
at  
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93)
at org.apache.lucene.search.Hits.hitDoc(Hits.java:153)
at org.apache.lucene.search.Hits.doc(Hits.java:118)

Thanks

Ching-pei

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


[patch] MultiSearcher should support getSearchables()

2004-03-30 Thread Kevin A. Burton
Seems to only make sense to allow a caller to find the searchables a 
MultiSearcher was created with:

 'diff' -uN MultiSearcher.java.bak MultiSearcher.java
--- MultiSearcher.java.bak  2004-03-30 14:57:41.660109642 -0800
+++ MultiSearcher.java  2004-03-30 14:57:46.530330183 -0800
@@ -208,4 +208,8 @@
return searchables[i].explain(query,doc-starts[i]); // dispatch to 
searcher
  }

+  public Searchable[] getSearchables() {
+return searchables;
+  }
+
}
--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster	



signature.asc
Description: OpenPGP digital signature


Near performance question

2004-03-30 Thread Joe Paulsen
Based on the nature of our documents, we sometimes 
experience extremely long response times when executing
NEAR operations against a document (sometimes well over 
minutes - even though the operation is restricted
to a single document).

Our analysis of the code indicates (we think):

It looks up each of the terms in the word.dbx file. 

It intersects the occurrence lists. (So far so good!) 

It takes each gid found in the occurrence list and: 
finds its parent right up until the root of the document (in dom.dbx).
 
Traverses the tree depth-first until it finds the node text of interest. 

Does the expected scan to find out 
if the term distance requirement is satisfied. 

We did some timings on our document (Rusticus). 
It started off taking  1 second per occ and grew to 25 seconds. 

If we changed the dom.dbx buffers, we got significant 
improvement, but still relatively slow (343 occs). 

QUESTION:
Seems to us the occs are ordered by gid 
(and we don't do any updating).  Is there 
a simple way to make use of the positioning 
information of the tree levels for the prior 
occurrence on the current occurrence so that 
we don't have to start again from the 
document root? 

Thanks,

Joe Paulsen



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Performance of hit highlighting and finding term positions for a specific document

2004-03-30 Thread Kevin A. Burton
I'm playing with this package:

http://home.clara.net/markharwood/lucene/highlight.htm

Trying to do hit highlighting.  This implementation uses another 
Analyzer to find the positions for the result terms. 

This seems that it's very inefficient since lucene already knows the 
frequency and position of given terms in the index.

My question is whether it's hard to find a TermPosition for a given term 
in a given document rather than the whole index.

IndexReader.termPositions( Term term ) is term specific not term and 
document specific.

Also it seems that after all this time that Lucene should have efficient 
hit highlighting as a standard package.  Is there any interest in seeing 
a contribution in the sandbox for this if it uses the index positions?

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: [patch] MultiSearcher should support getSearchables()

2004-03-30 Thread Erik Hatcher
On Mar 30, 2004, at 5:59 PM, Kevin A. Burton wrote:
Seems to only make sense to allow a caller to find the searchables a 
MultiSearcher was created with:
Could you elaborate on why it makes sense?  What if the caller changed 
a Searchable in the array?  Would anything bad happen?  (I don't know, 
haven't looked at the code).



 'diff' -uN MultiSearcher.java.bak MultiSearcher.java
--- MultiSearcher.java.bak  2004-03-30 14:57:41.660109642 -0800
+++ MultiSearcher.java  2004-03-30 14:57:46.530330183 -0800
@@ -208,4 +208,8 @@
return searchables[i].explain(query,doc-starts[i]); // dispatch to 
searcher
  }
+  public Searchable[] getSearchables() {
+return searchables;
+  }
+
}

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc  NewsMonster - 
http://www.newsmonster.org/
   Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster	



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Performance of hit highlighting and finding term positions for a specific document

2004-03-30 Thread Erik Hatcher
On Mar 30, 2004, at 7:56 PM, Kevin A. Burton wrote:
Trying to do hit highlighting.  This implementation uses another 
Analyzer to find the positions for the result terms.
This seems that it's very inefficient since lucene already knows the 
frequency and position of given terms in the index.
What if the original analyzer removed stopped words, stemmed, and 
injected synonyms?

Also it seems that after all this time that Lucene should have 
efficient hit highlighting as a standard package.  Is there any 
interest in seeing a contribution in the sandbox for this if it uses 
the index positions?
Big +1, regardless of the implementation details.  Hit hilighting is so 
commonly requested that having it available at least in the sandbox, or 
perhaps even in the core, makes a lot of sense.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Performance of hit highlighting and finding term positions for a specific document

2004-03-30 Thread Stephane James Vaucher
I agree with you that a highlight package should be available directly 
from the lucene website. To offer this much-desired feature, having a 
dependency on a personal web site seems a little weird to me. It would 
also force the community to support this functionality, which would seem 
appropriate.

cheers,
sv

On Tue, 30 Mar 2004, Kevin A. Burton wrote:

 I'm playing with this package:
 
 http://home.clara.net/markharwood/lucene/highlight.htm
 
 Trying to do hit highlighting.  This implementation uses another 
 Analyzer to find the positions for the result terms. 
 
 This seems that it's very inefficient since lucene already knows the 
 frequency and position of given terms in the index.
 
 My question is whether it's hard to find a TermPosition for a given term 
 in a given document rather than the whole index.
 
 IndexReader.termPositions( Term term ) is term specific not term and 
 document specific.
 
 Also it seems that after all this time that Lucene should have efficient 
 hit highlighting as a standard package.  Is there any interest in seeing 
 a contribution in the sandbox for this if it uses the index positions?
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance of hit highlighting and finding term positions for a specific document

2004-03-30 Thread Kevin A. Burton
Erik Hatcher wrote:

On Mar 30, 2004, at 7:56 PM, Kevin A. Burton wrote:

Trying to do hit highlighting.  This implementation uses another 
Analyzer to find the positions for the result terms.
This seems that it's very inefficient since lucene already knows the 
frequency and position of given terms in the index.


What if the original analyzer removed stopped words, stemmed, and 
injected synonyms?
Just use the same analyzer :)... I agree it's not the best approach for 
this reason and the CPU reason.

Also it seems that after all this time that Lucene should have 
efficient hit highlighting as a standard package.  Is there any 
interest in seeing a contribution in the sandbox for this if it uses 
the index positions?


Big +1, regardless of the implementation details.  Hit hilighting is 
so commonly requested that having it available at least in the 
sandbox, or perhaps even in the core, makes a lot of sense. 
Well if we could make it efficient by using the frequency and positions 
of terms we're all set :)... I just need to figure out how to do this 
efficiently per document.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: [patch] MultiSearcher should support getSearchables()

2004-03-30 Thread Kevin A. Burton
Erik Hatcher wrote:

On Mar 30, 2004, at 5:59 PM, Kevin A. Burton wrote:

Seems to only make sense to allow a caller to find the searchables a 
MultiSearcher was created with:


Could you elaborate on why it makes sense?  What if the caller changed 
a Searchable in the array?  Would anything bad happen?  (I don't know, 
haven't looked at the code).
Yes... something bad could happen... but that would be amazingly stupid 
... we should probably recommend that it be readonly.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Performance of hit highlighting and finding term positions for a specific document

2004-03-30 Thread Bruce Ritchie
Kevin A. Burton wrote:

I'm playing with this package:

http://home.clara.net/markharwood/lucene/highlight.htm

Trying to do hit highlighting.  This implementation uses another 
Analyzer to find the positions for the result terms.
This seems that it's very inefficient since lucene already knows the 
frequency and position of given terms in the index.

My question is whether it's hard to find a TermPosition for a given term 
in a given document rather than the whole index.

IndexReader.termPositions( Term term ) is term specific not term and 
document specific.
As far as I know it's not currently possible to get this information from a standard lucene index.

Also it seems that after all this time that Lucene should have efficient 
hit highlighting as a standard package.  Is there any interest in seeing 
a contribution in the sandbox for this if it uses the index positions?
I've been meaning to look into good ways to store token offset information to allow for very 
efficient highlighting and I believe Mark may also be looking into improving the highlighter via 
other means such as temporary ram indexes. Search the archives to get a background on some of the 
idea's we've tossed around ('Dmitry's Term Vector stuff, plus some' and 'Demoting results' come to 
mind as threads that touch this topic).

Regards,

Bruce Ritchie
http://www.jivesoftware.com/


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [patch] MultiSearcher should support getSearchables()

2004-03-30 Thread Erik Hatcher
On Mar 30, 2004, at 8:52 PM, Kevin A. Burton wrote:
Erik Hatcher wrote:

On Mar 30, 2004, at 5:59 PM, Kevin A. Burton wrote:

Seems to only make sense to allow a caller to find the searchables a 
MultiSearcher was created with:


Could you elaborate on why it makes sense?  What if the caller 
changed a Searchable in the array?  Would anything bad happen?  (I 
don't know, haven't looked at the code).
Yes... something bad could happen... but that would be amazingly 
stupid ... we should probably recommend that it be readonly.
No question that it'd be unwise to do.  We could say the same argument 
for making everything public access as well and say it'd be stupid to 
override this method, but we made it public anyway.  I'd rather opt on 
the side of safety.

Besides, you haven't provided a use case for why you need to get the 
searchers back from a MultiSearcher :)

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]