Re: SNOWBALL STEMMER + BOOSTING

2004-12-23 Thread Erik Hatcher
On Dec 23, 2004, at 1:17 AM, Karthik N S wrote:
Using Analysis Paralysis on SnowBall Stemmer [ using StandardAnalyzer.
ENGLISH_STOP_WORDS
and StopAnalyzer.ENGLISH_STOP_WORDS ] from
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html? 
page=last#thre
ad

for the word   'jakarta^4 apache'
both the cases return me something like this
=
org.apache.lucene.analysis.snowball.SnowballAnalyzer:
[JAKARTHA] [4] [APACHE]
=
I wonder what happened to the BOOSTING SYMBOL '^' and if the same word
is used on QueryParser.parse()
Analyzing a query expression outside of QueryParser is _not_ doing the  
same thing that QueryParser does.  QueryParser picks out the pieces it  
knows about (parenthesis, boost symbol, AND, OR, etc, etc) and only  
analyzes term text.  In your example it would analyze jakarta and  
apache separately.

, What would be the Hit's returned???
That all depends on what you indexed and what analyzer you used at  
index time :)

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: addIndexes() Question

2004-12-23 Thread Sergiu Gordea
I think you should change a little bit your plans, and to think that 
your goal is to
create a fast search engine not a fast indexing engine.
When you plan to index a lot of documents then it is possible to creata 
a lot of segments (if you don't optimize the index)
and the serch will be very slow comparing with the search on an 
optimized index.
The problem is that the optimization of big indexes is a time consuming 
operation, and also

addIndexes(Directory[] dirs) I think is also a time consuming operation.
Therefore I suggest to think how can you design the indices to have a fast search, and then 
you should design an offline indexing process. 

That is my suggestion ... maybe it doesn't fit your requirements, maybe it does 
...
 All the best,
 Sergiu
Ryan Aslett wrote:
Hi there, Im about to embark on a Lucene project of massive scale
(between 500 million and 2 billion documents).  I am currently working
on parallellizing the construction of the Index(es). 

Rough summary of my plan:
I have many, many physical machines, each with multiple processors that
I wish to dedicate to the construction of a single index. 
I plan on having each machine gather its documents from a central
sychronized source (network, JMS, whatever). 
Within each machine I will have multiple threads each responsible for
construcing an index slice.

When all machines and all threads are finished, I should have a slew of
index slices that I want to combine together to create one index.
My question is this:  Will it be more efficient to call
addIndexes(Directory[] dirs) on all the slices all at once? 

Or might it be better to continually merge small indexes into a larger
index, i.e. once an index slice reaches a particular size, merge it into
the main index and start building a new slice...
Any help would be appreciated.. 

Ryan Aslett
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Exception: cannot determine sort type

2004-12-23 Thread Erik Hatcher
On Dec 22, 2004, at 11:25 PM, Kauler, Leto S wrote:
java.lang.RuntimeException: no terms in field Title_Sort - cannot
determine sort type
Title_Sort is a sort-specific field (Store=false, Index=true,
Tokenise=false).  I do not have access to the actual Lucene-calling
code, but I do not believe that the creation of the SortField defines a
type (so just defaults to AUTO).
The issue occurs if the first field it accesses parses as a numeric 
value and then successive fields are String's.  If you are mixing and 
matching numeric and text information in this Title_Sort field you 
should specify the type.

We could specify the sort type as String but we do have some Date 
fields
too.  Are dates actually indexed as strings?
You're putting dates into Title_Sort also?  The type is specific to a 
sort field, so you can sort by dates too but you'd use a different 
field and a different type.

*Everything* in Lucene is indexed as a string.  But how a date looks as 
a string is a topic unto itself.  I prefer to use MMDD as a date 
formatted as a string (but when sorting, this could be treated as a 
numeric).

I am wondering why this exception might occur when the server/index is
under load.  I do realise there are many 'variables in the equation', 
so
there probably is not an easy answer to this.
I'm at a loss on this one without further details, thats for sure.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Relevance percentage

2004-12-23 Thread Chuck Williams
Gururaja,

If you want to score based solely on coord(), then Paul's approach looks
best.  However, based on your earlier messages, it looks to me like you
want to score based on all factors (with coord boosted as Paul
suggested, or lengthNorm flattened as I suggested -- either will get the
order you want in the example you posted), but you want to print the
(unboosted) coord percentage along with each result in the result list.

If this is the case, since the number of results per page on the result
list is presumably small, I think you are best off replicating the
explain() mechanism.  I don't have the source code, but you can look at
IndexSearcher.explain(), which recreates the weight with Query.weight(),
then calls what in this case will be
BooleanQuery.BooleanWeight.explain(), which has the code to recompute
coord on a result (specifically it computes overlap and maxoverlap and
then calls Similarity.coord()).  You could cut and paste this code to
just compute coord for your top-level BooleanQuery's.

Sorry I don't have source code to do this, but the approach should work.
Good luck,

Chuck

   -Original Message-
   From: Paul Elschot [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, December 22, 2004 11:59 PM
   To: lucene-user@jakarta.apache.org
   Subject: Re: Relevance percentage
   
   On Thursday 23 December 2004 08:13, Gururaja H wrote:
Hi Chuck Williams,
   
Thanks much for the reply.
   
If your queries are all BooleanQuery's of
TermQuery's, then this is very simple. Iterate down the list of
BooleanClause's and count the number whose score is  0, then
divide
this by the total number of clauses. Take a look at
BooleanQuery.BooleanWeight.explain() as it does this (along with
generating the rest of the explanation). If you support the full
   Lucene
query language, then you need to look at all the query types and
   decide
what exactly you want to compute (as coord is not always well-
   defined).
   
We are supporting full Lucene query language.
   
My request is, assuming queries are all BooleanQuery please
post the implementation source code for the same.  ie to calculate
the
   coord() method input parameters overlap and maxOverlap.
   
   I don't have the code, but I can give an overview of possible
   steps:
   
   First inherit from BooleanScorer to implement a score() method that
   returns only the coord() value (preferably a precomputed one).
   Then inherit from BooleanQuery.BooleanWeight to return the above
   Scorer.
   Then inherit from BooleanQuery to use the above Weight in
createWeight().
   Then inherit from QueryParser to use the above Query in
   getBooleanQuery().
   Finally use such a query in a search: the document scores will be
   the coord() values.
   
   Regards,
   Paul Elschot.
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exception: cannot determine sort type

2004-12-23 Thread Chris Hostetter
: The issue occurs if the first field it accesses parses as a numeric
: value and then successive fields are String's.  If you are mixing and

:  I am wondering why this exception might occur when the server/index is
:  under load.  I do realise there are many 'variables in the equation',
:  so
:  there probably is not an easy answer to this.

Knowing what i know about stress testing environments, i'm guessing you're
using some sort of auotmated load generating application, which is
generating random input from a dictionary of some kind -- possibly from
access logs of an existing system?  I'm also guessing that in some
configurations your load generator picks a random sort order independant
of the search terms it picks.

I'm also guessing that the issue has nothing to do with load ... if you
picked a single search term which you have manually tested once (sorting
by title) and know for a fact it works fine, and then you tell your load
generator to hit the index as hard as it can with that one query over and
over, it would probably work fine.

I think the problem is just that when it deals with random input and
random sort orders it (frequently) gets a result set in which the
first document has a numeric title field.


PS: I could be wrong, but if i remember right, the code that AUTO uses to
determine what sort type to use will treat it as a number if it *starts*
with something that looks like a number ... so look for titles like 1000
year plan in your data.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: analyzer effecting phrases?

2004-12-23 Thread Chris Hostetter
: Therefore I turned back to the standard analyzer and now do some replacing
: of the underscores in my ID string to avoid my original problem. This solved

maybe i'm missing something, but if you've got a field in your doc that
represents an ID, why not create that field as NonTokenized so you don't
have to worry about what characters the analyzer you're using thinks are
special?


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: (Offtopic) The unicode name for a character

2004-12-23 Thread Chris Hostetter
: However, I don't think that the names are consistent enough to permit a
: generic use of regular expressions. What Daniel is trying to achieve
: looks interesting anyway,

I'm not sure that that really matters in the long run ... I think the OP
was asking if there was a way to get the name in java because he figured
that way he could programaticly determine what the base character was in
his application.  But, that doesn't mean he needs to do this
progromatically every time his indexing/searching code sees a character
outside of LATIN-1

it would probably make more sense to write a little one off program that
could read in this file, and then spit out all of the non latin-1
characters with a guess as to which latin-1 character could act as a
substitution (if any) based on the name of the chracter, and a blank for
the user to override.  This program could be run once to generate a nice
small, efficient mapping table that could be (commited to cvs and) reused
over and over.

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting on a field that can have null values

2004-12-23 Thread Chris Hostetter

: I thought of putting empty strings instead of null values but I think
: empty strings are put first in the list while sorting which is the
: reverse of what anyone would want.

instead of adding a field with a null value, or value of an epty string,
why not just leave the field out for that/those doc(s)?

there's no requirement that every doc in your index has to have the exact
same set of fields.

If i rememebr correctly (you'll have to test this) sorting on a field
which doesn't exist for every doc does what you would want (docs with
values are listed before docs without)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how fields do you use in your indexs

2004-12-23 Thread Daniel Cortes
I'm doing a searcher of differnt format's files for a web. I want to 
know in your cases what field and atributes do you use for this search 
(tokenized,stored, etc..)
I'm thinking to create the field title filename contents 
date_of_modification (I'm indexing the body of html's files how contents).
Do you put something more?
thks to all

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: addIndexes() Question

2004-12-23 Thread Daniel Naber
On Thursday 23 December 2004 00:45, Ryan Aslett wrote:

 When all machines and all threads are finished, I should have a slew of
 index slices that I want to combine together to create one index.

You should simply skip this step and instead search the small indices with 
a ParallelMultiSearcher. This should scale much better than one huge index 
(note that ranking is currently messed up with (Parllel)MultiSearcher, see 
the bug reports for a proposed fix).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Change LuceneDocument

2004-12-23 Thread Michael Scholz
Hello,

is it possible to change the content of a field of an indexed document?

Regards,
Michael Scholz

Re: Exception: cannot determine sort type

2004-12-23 Thread Daniel Naber
On Thursday 23 December 2004 05:25, Kauler, Leto S wrote:

 java.lang.RuntimeException: no terms in field Title_Sort - cannot
 determine sort type

Is it a certain query that causes this? Does it really only happen under 
load or does the same query also give this without load?

 We could specify the sort type as String but we do have some Date fields
 too.  Are dates actually indexed as strings?

If you're using DateField: yes. But you don't have to use that class, you 
can save dates however you want.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: CFS file and file formats

2004-12-23 Thread Steve Rajavuori
I think there are several problems. 

1) First of all, there are both CFS files and standard (non-compound) files
in this directory, and all of them have recent update dates, so I assume
they are all being used. My code never explicitly sets the compound file
flag, so I don't know how this happened.

2) Is there a way to force all files into compound mode? For example, if I
set the compound setting, then call optimize, will that recreate everything
into the CFS format?

3) There are several other large .CFS files in this directory that I think
have somehow become detached from the index. They have recent update dates
-- however, the last time I ran optimize these were not touched, and they
are not being updated now. I know these segments have valid data, because
now when I search I am missing large chunks of data -- which I assume is in
these detached segments. So my thought is to edit the 'segments' file to
make Lucene recognize these again -- but I need to know the correct segment
size in order to do this. So how do I determine what the correct segment
size should be?

Steve

-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 22, 2004 4:50 PM
To: Lucene Users List
Subject: Re: CFS file and file formats


On Wednesday 22 December 2004 23:41, Steve Rajavuori wrote:

 Thanks. I am trying to repair a corrupted 'segments' file.

Why are you sure it's corrupted? Are the *.cfs file and the other files 
types mixed in one directory? Then that's the problem: if you have *.cfs, 
segments, and deletable, nothing else should exist in that directory or 
Lucene will get confused.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search question

2004-12-23 Thread roy-lucene-user
Erik,

They both use the StandardAnalyzer... however looking at the toString() makes
everything clearer.  In the case a string has the following email address:
[EMAIL PROTECTED], it gets split like so: first.last domain.com

However in 1.4 it does not get split.

So now we just check to see if an index was built using 1.2 or 1.4 and have
some checks thrown in.

Thanks for the guidance.

Roy.

On Wed, 22 Dec 2004 18:41:44 -0500, Erik Hatcher wrote
 What does toString() return for each of those queries?  Are you 
 using the same analyzer in both cases?
 
   Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Multiple collections

2004-12-23 Thread Jim Lynch
I'm investigating search engines and have started to look at Lucene.  I 
have a couple of questions, however.  The faq seems to indicate we can't 
do searches and indexing at the same time.  Is that still true, given 
that the faq is a few years old now?  If so is there locking going on or 
do I have to do it myself?

We have currently about 4 million documents comprised of  about 16 
million terms.  This is currently broken up into about 50 different 
collections which are separate databases.  Some of these collections 
are producted by a web crawler, some are produced by indexing a static 
file tree and some are produced via a feed from another system, which 
either adds new documents to a collection or replaces a document.  There 
are really 2 questions.  Is this too much data for Lucene?  And is there 
a way to keep separate collections (probably indexes) and search all 
(usually just a subset) of them at once?  I see the MultiSearcher object 
that may be the ticket, but IMHO javadocs leave a lot to be desired in 
the way of documentation.  They seem to completely leave out the glue 
and examples.

Thanks for any advice.
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: CFS file and file formats

2004-12-23 Thread Doug Cutting
Steve Rajavuori wrote:
1) First of all, there are both CFS files and standard (non-compound) files
in this directory, and all of them have recent update dates, so I assume
they are all being used. My code never explicitly sets the compound file
flag, so I don't know how this happened.
This can happen if your application crashes while the index was being 
updated.  In this case these were never entered into the segments file 
and may be partially written.

2) Is there a way to force all files into compound mode? For example, if I
set the compound setting, then call optimize, will that recreate everything
into the CFS format?
It should.  Except, on Windows not all old CFS file will be deleted 
immediately, but may instead be listed in the 'deleteable' file for a while.

3) There are several other large .CFS files in this directory that I think
have somehow become detached from the index. They have recent update dates
-- however, the last time I ran optimize these were not touched, and they
are not being updated now. I know these segments have valid data, because
now when I search I am missing large chunks of data -- which I assume is in
these detached segments. So my thought is to edit the 'segments' file to
make Lucene recognize these again -- but I need to know the correct segment
size in order to do this. So how do I determine what the correct segment
size should be?
These could also be the result of crashes.  In this case they may be 
partially written.

The safest approach is to remove files not mentioned in the segments 
file and update the index with the missing documents.  How does your 
application recover if it crashes during an update?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple collections

2004-12-23 Thread Erik Hatcher
On Dec 23, 2004, at 2:18 PM, Jim Lynch wrote:
I'm investigating search engines and have started to look at Lucene.  
I have a couple of questions, however.  The faq seems to indicate we 
can't do searches and indexing at the same time.
Where in the FAQ does it indicate this?  This is incorrect.  And I 
don't think this has ever been the case for Lucene.  Indexing and 
searching can most definitely occur at the same time.

We have currently about 4 million documents comprised of  about 16 
million terms.  This is currently broken up into about 50 different 
collections which are separate databases.  Some of these collections 
are producted by a web crawler, some are produced by indexing a static 
file tree and some are produced via a feed from another system, which 
either adds new documents to a collection or replaces a document.  
There are really 2 questions.  Is this too much data for Lucene?
It is not too much data for Lucene.  Your architecture around Lucene is 
the more important aspect.

  And is there a way to keep separate collections (probably indexes) 
and search all (usually just a subset) of them at once?  I see the 
MultiSearcher object that may be the ticket, but IMHO javadocs leave a 
lot to be desired in the way of documentation.  They seem to 
completely leave out the glue and examples.
MultiSearcher is pretty trivial to use.  There is an example in Lucene 
in Action's source code (ant SearchServer) and I'm using a 
MultiSearcher for the upcoming lucenebook.com site like this:

Searchable[] searchables = new Searchable[indexes.length];
for (int i = 0; i  indexes.length; i++) {
  searchables[i] = new IndexSearcher(indexes[i]);
}
searcher = new MultiSearcher(searchables);
Use MultiSearcher in the same manner as you would IndexSearcher.  You 
can also find out which index a particular hit was from using the 
subSearcher method.

As for your comment about the javadocs, allow me to refer you to 
Lucene's test suite.  TestMultiSearcher.java in this case.  This is the 
best documentation there is!  (besides Lucene in Action, of course :)

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: CFS file and file formats

2004-12-23 Thread Steve Rajavuori
Doug wrote:
The safest approach is to remove files not mentioned in the segments 
file and update the index with the missing documents.  How does your 
application recover if it crashes during an update?

There are around 20 million documents in the orphaned segments, so it would
take a very long time to update the index. Is there an unsafe way to edit
the segments file to add these back? It seems like the missing piece of
information I need to do this is the correct segment size -- where can I
find that?

My application doesn't really have any recovery method if it crashes. Can
you tell me what the proper error handling procedure is? If, in fact, these
segments were corrupted because the application crashed, what could I have
done programmatically to recover once that had happened?

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 23, 2004 1:34 PM
To: Lucene Users List
Subject: Re: CFS file and file formats


Steve Rajavuori wrote:
 1) First of all, there are both CFS files and standard (non-compound)
files
 in this directory, and all of them have recent update dates, so I assume
 they are all being used. My code never explicitly sets the compound file
 flag, so I don't know how this happened.

This can happen if your application crashes while the index was being 
updated.  In this case these were never entered into the segments file 
and may be partially written.

 2) Is there a way to force all files into compound mode? For example, if I
 set the compound setting, then call optimize, will that recreate
everything
 into the CFS format?

It should.  Except, on Windows not all old CFS file will be deleted 
immediately, but may instead be listed in the 'deleteable' file for a while.

 3) There are several other large .CFS files in this directory that I think
 have somehow become detached from the index. They have recent update dates
 -- however, the last time I ran optimize these were not touched, and they
 are not being updated now. I know these segments have valid data, because
 now when I search I am missing large chunks of data -- which I assume is
in
 these detached segments. So my thought is to edit the 'segments' file to
 make Lucene recognize these again -- but I need to know the correct
segment
 size in order to do this. So how do I determine what the correct segment
 size should be?

These could also be the result of crashes.  In this case they may be 
partially written.

The safest approach is to remove files not mentioned in the segments 
file and update the index with the missing documents.  How does your 
application recover if it crashes during an update?

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: CFS file and file formats

2004-12-23 Thread Doug Cutting
Steve Rajavuori wrote:
There are around 20 million documents in the orphaned segments, so it would
take a very long time to update the index. Is there an unsafe way to edit
the segments file to add these back? It seems like the missing piece of
information I need to do this is the correct segment size -- where can I
find that?
Do the CFS and non-CVS segment names correspond?  If so, then it 
probably crashed after the segment was complete, but perhaps before it 
was packed into a CFS file.  So I'd trust the non-CFS stuff first.  And 
it's easy to see the size of a non-CVS segement: it's just the number of 
bytes in each of the .f* files.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple collections

2004-12-23 Thread Jim Lynch
Hi, Erik,
I've been perusing the mail list today and see your name often.  As well 
as visiting the web site advertising your book.  If we decide to go this 
way, I'll be sure to pick up a copy.

The FAQ number 41 on page 
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.searchtoc=faq 
implies a problem with searching and indexing at the same time, unless 
I'm misunderstanding what it says.

So it is kosher to download the source code before buying the book?  I 
tend not to do that for a couple of reasons, it doesn't seem right and 
frequently authors go out of their way to make sure it's not very useful 
without the book.   Not that I consider that unfair, mind you.  It's 
just a common practice from my experience.

Any way thanks for the info. 

So what you are saying if I can read between the lines and extrapolate 
from what I've read, is that I can create an index for each of my 
collections as I see fit, putting them in separate directories and when 
I need to search I can select a subset of the directories with the 
MultiSearcher.  Since the user selects which collections he wants to 
search from via checkboxes, I can build a list of searchables to pass to 
MultiSearcher.  However, looking at the javadocs I see Searchable is an 
interface.  Hm, I'll have to look at some code to see how that works.

Thanks, you've given me something to chew on.
Jim.
At the risk of  being politically incorrect, Merry Christmas to you 
all.  Not that I care a whit about political correctness.  8)

Erik Hatcher wrote:
On Dec 23, 2004, at 2:18 PM, Jim Lynch wrote:
I'm investigating search engines and have started to look at Lucene.  
I have a couple of questions, however.  The faq seems to indicate we 
can't do searches and indexing at the same time.

Where in the FAQ does it indicate this?  This is incorrect.  And I 
don't think this has ever been the case for Lucene.  Indexing and 
searching can most definitely occur at the same time.

We have currently about 4 million documents comprised of  about 16 
million terms.  This is currently broken up into about 50 different 
collections which are separate databases.  Some of these 
collections are producted by a web crawler, some are produced by 
indexing a static file tree and some are produced via a feed from 
another system, which either adds new documents to a collection or 
replaces a document.  There are really 2 questions.  Is this too much 
data for Lucene?

It is not too much data for Lucene.  Your architecture around Lucene 
is the more important aspect.

  And is there a way to keep separate collections (probably indexes) 
and search all (usually just a subset) of them at once?  I see the 
MultiSearcher object that may be the ticket, but IMHO javadocs leave 
a lot to be desired in the way of documentation.  They seem to 
completely leave out the glue and examples.

MultiSearcher is pretty trivial to use.  There is an example in Lucene 
in Action's source code (ant SearchServer) and I'm using a 
MultiSearcher for the upcoming lucenebook.com site like this:

Searchable[] searchables = new Searchable[indexes.length];
for (int i = 0; i  indexes.length; i++) {
  searchables[i] = new IndexSearcher(indexes[i]);
}
searcher = new MultiSearcher(searchables);
Use MultiSearcher in the same manner as you would IndexSearcher.  You 
can also find out which index a particular hit was from using the 
subSearcher method.

As for your comment about the javadocs, allow me to refer you to 
Lucene's test suite.  TestMultiSearcher.java in this case.  This is 
the best documentation there is!  (besides Lucene in Action, of 
course :)

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


nable to read TLD META-INF/c.tld from JAR file ... standard.jar

2004-12-23 Thread Kevin A. Burton
What in the world is up with this exception?
We've migrated to using pre-compiled JSPs in Tomcat 5.5 for performance reasons but if 
I try to start with a FRESH webapp or try to update any of the JSPs and in-place and 
recompile I'll get this error:

Any idea?
I thought maybe the .jar files were corrupt but if I md5sum them they are identical to 
production and the Tomcat standard dist.

Thoughts?
org.apache.jasper.JasperException: /subscriptions/index.jsp(1,1) /init.jsp(2,0) Unable to read TLD 
META-INF/c.tld from JAR file 
file:/usr/local/jakarta-tomcat-5.5.4/webapps/rojo/ROOT/WEB-INF/lib/standard.jar: 
org.apache.jasper.JasperException: Failed to load or instantiate TagLibraryValidator class: 
org.apache.taglibs.standard.tlv.JstlCoreTLV

org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:39)

org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:405)

org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:86)

org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:339)
org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java:372)
org.apache.jasper.compiler.Parser.parseDirective(Parser.java:475)
org.apache.jasper.compiler.Parser.parseElements(Parser.java:1539)
org.apache.jasper.compiler.Parser.parse(Parser.java:126)

org.apache.jasper.compiler.ParserController.doParse(ParserController.java:211)

org.apache.jasper.compiler.ParserController.parse(ParserController.java:100)
org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:146)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:286)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:267)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:255)

org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:556)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:296)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple collections

2004-12-23 Thread Erik Hatcher
On Dec 23, 2004, at 3:29 PM, Jim Lynch wrote:
I've been perusing the mail list today and see your name often.
I really should get out more often.
The FAQ number 41 on page  
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi? 
file=chapter.searchtoc=faq implies a problem with searching and  
indexing at the same time, unless I'm misunderstanding what it says.
Ah... the issue is that an IndexReader/IndexSearcher that was  
constructed *before* documents were added will not see the new  
documents.  Searching still works successfully.  After you add  
documents, to find them you must use a new instance of  
IndexSearcher/IndexReader.  That FAQ is somewhat misleading I suppose.   
This FAQ will be deprecated in favor of the wiki, I hope:

http://wiki.apache.org/jakarta-lucene/LuceneFAQ
So it is kosher to download the source code before buying the book?
No objections from me.  It's made freely available to help sell the  
book of course, but here's one well known secret about writing books,  
authors don't make (much) money on them.  Basically Otis and I are  
completely insane.  http://www.blogscene.org/erik/Writing/demons.txt

  I tend not to do that for a couple of reasons, it doesn't seem right  
and frequently authors go out of their way to make sure it's not very  
useful without the book.   Not that I consider that unfair, mind you.   
It's just a common practice from my experience.
I've not heard of anyone making the code less useful on purpose.  After  
spending 14 months writing a book, though, it is hard to muster up the  
desire to polish off some source code.  You've hacked examples  
throughout, and there is nothing coherent to give away other than  
5-line snippets.  The most difficult thing to do when writing a book is  
try to come up with some theme or application for all the code.  For  
the Ant book it worked out nicely (a Lucene-based document/image search  
engine).  For Lucene in Action, there are so many variations that need  
to be shown that a single application to cover all the cases would be  
far too contrived.  I have a personal distaste for most of the code I  
see in books myself, and Otis and I worked hard to keep the examples  
relevant and useful.  The examples are mostly JUnit test cases.  When  
we broke the code we knew pretty quickly.  I'm proud of the code, and  
also quite proud of the way I packaged it.  I got to show off my Ant  
skillz to launch it all.  Fire it up and enjoy (or at least report back  
any suggestions or problems you have).

How useful the examples are without the book itself, though, is a tough  
one.  It is hard to package up 421 pages (I have physical copies of LIA  
in my hands as we speak!) of meaningful discourse into some example  
code.  It certainly isn't intentional to make the code less than  
useful, but there are certainly lots of explanations to go along with  
that code.

If anyone finds the example code difficult to understand without the  
book, though, by all means let me know.  I'd be happy to explain it  
here or post it to the blog I'll have live at lucenebook.com soon.

So what you are saying if I can read between the lines and extrapolate  
from what I've read, is that I can create an index for each of my  
collections as I see fit, putting them in separate directories and  
when I need to search I can select a subset of the directories with  
the MultiSearcher.  Since the user selects which collections he wants  
to search from via checkboxes, I can build a list of searchables to  
pass to MultiSearcher.  However, looking at the javadocs I see  
Searchable is an interface.  Hm, I'll have to look at some code to see  
how that works.
No need to extrapolate too much.  IndexSearcher is an instanceof  
Searchable.  Here's the code :)


Searchable[] searchables = new Searchable[indexes.length];
for (int i = 0; i  indexes.length; i++) {
  searchables[i] = new IndexSearcher(indexes[i]);
}
searcher = new MultiSearcher(searchables);
Erik
p.s. There is an issue with MultiSearcher and how it scores documents  
across multiple indexes that has recently been discussed.  It is in  
Bugzilla issue tracking system and e-mail archives.  I don't find it  
much of a problem (yet) as I've only just begun to use MultiSearcher in  
a production environment.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: nable to read TLD META-INF/c.tld from JAR file ... standard.jar

2004-12-23 Thread Erik Hatcher
Wrong list.
Though perhaps you should be using Jetty ;)
Erik
On Dec 23, 2004, at 4:17 PM, Kevin A. Burton wrote:
What in the world is up with this exception?
We've migrated to using pre-compiled JSPs in Tomcat 5.5 for  
performance reasons but if I try to start with a FRESH webapp or try  
to update any of the JSPs and in-place and recompile I'll get this  
error:

Any idea?
I thought maybe the .jar files were corrupt but if I md5sum them they  
are identical to production and the Tomcat standard dist.

Thoughts?
org.apache.jasper.JasperException: /subscriptions/index.jsp(1,1)  
/init.jsp(2,0) Unable to read TLD META-INF/c.tld from JAR file  
file:/usr/local/jakarta-tomcat-5.5.4/webapps/rojo/ROOT/WEB-INF/lib/ 
standard.jar: org.apache.jasper.JasperException: Failed to load or  
instantiate TagLibraryValidator class:  
org.apache.taglibs.standard.tlv.JstlCoreTLV
	 
org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHan 
dler.java:39)
	 
org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.jav 
a:405)
	 
org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.jav 
a:86)
	 
org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java: 
339)
	org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java: 
372)
	org.apache.jasper.compiler.Parser.parseDirective(Parser.java:475)
	org.apache.jasper.compiler.Parser.parseElements(Parser.java:1539)
	org.apache.jasper.compiler.Parser.parse(Parser.java:126)
	 
org.apache.jasper.compiler.ParserController.doParse(ParserController.ja 
va:211)
	 
org.apache.jasper.compiler.ParserController.parse(ParserController.java 
:100)
	org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:146)
	org.apache.jasper.compiler.Compiler.compile(Compiler.java:286)
	org.apache.jasper.compiler.Compiler.compile(Compiler.java:267)
	org.apache.jasper.compiler.Compiler.compile(Compiler.java:255)
	 
org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.j 
ava:556)
	 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.j 
ava:296)
	org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java: 
295)
	org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an  
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then  
you should work for Rojo!  If you recommend someone and we hire them  
you'll get a free iPod!
   Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: nable to read TLD META-INF/c.tld from JAR file ... standard.jar

2004-12-23 Thread Otis Gospodnetic
Most definitely Jetty.  I can't believe you're using Tomcat for Rojo!
;)

Otis

--- Erik Hatcher [EMAIL PROTECTED] wrote:

 Wrong list.
 
 Though perhaps you should be using Jetty ;)
 
   Erik
 
 
 On Dec 23, 2004, at 4:17 PM, Kevin A. Burton wrote:
 
  What in the world is up with this exception?
 
  We've migrated to using pre-compiled JSPs in Tomcat 5.5 for  
  performance reasons but if I try to start with a FRESH webapp or
 try  
  to update any of the JSPs and in-place and recompile I'll get this 
 
  error:
 
  Any idea?
 
  I thought maybe the .jar files were corrupt but if I md5sum them
 they  
  are identical to production and the Tomcat standard dist.
 
  Thoughts?
 
  org.apache.jasper.JasperException: /subscriptions/index.jsp(1,1)  
  /init.jsp(2,0) Unable to read TLD META-INF/c.tld from JAR file  
 
 file:/usr/local/jakarta-tomcat-5.5.4/webapps/rojo/ROOT/WEB-INF/lib/ 
  standard.jar: org.apache.jasper.JasperException: Failed to load or
  
  instantiate TagLibraryValidator class:  
  org.apache.taglibs.standard.tlv.JstlCoreTLV
   
 

org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHan
 
  dler.java:39)
   
 

org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.jav
 
  a:405)
   
 

org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.jav
 
  a:86)
   
 

org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:
 
  339)
  
 org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java: 
  372)
  org.apache.jasper.compiler.Parser.parseDirective(Parser.java:475)
  org.apache.jasper.compiler.Parser.parseElements(Parser.java:1539)
  org.apache.jasper.compiler.Parser.parse(Parser.java:126)
   
 

org.apache.jasper.compiler.ParserController.doParse(ParserController.ja
 
  va:211)
   
 

org.apache.jasper.compiler.ParserController.parse(ParserController.java
 
  :100)
  
 org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:146)
  org.apache.jasper.compiler.Compiler.compile(Compiler.java:286)
  org.apache.jasper.compiler.Compiler.compile(Compiler.java:267)
  org.apache.jasper.compiler.Compiler.compile(Compiler.java:255)
   
 

org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.j
 
  ava:556)
   
 

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.j
 
  ava:296)
  
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java: 
  295)
  org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245)
  javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 
 
  -- 
 
  Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for
 an  
  invite!  Also see irc.freenode.net #rojo if you want to chat.
 
  Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
 
  If you're interested in RSS, Weblogs, Social Networking, etc...
 then  
  you should work for Rojo!  If you recommend someone and we hire
 them  
  you'll get a free iPod!
 Kevin A. Burton, Location - San Francisco, CA
AIM/YIM - sfburtonator,  Web - http://peerfear.org/
  GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 
 
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Exception: cannot determine sort type

2004-12-23 Thread Kauler, Leto S
Thanks for the replies!  It would seem best for us to move to specifying
the sort type--good practice anyway and prevents possible field
problems.  I plan to run the stress testing again today but turning off
the sorting (just using default SCORE) and see how that goes.

Seasons greetings to you all.
--Leto


Daniel Naber wrote:
 Is it a certain query that causes this? Does it really only 
 happen under 
 load or does the same query also give this without load?

Each page on our website gathers content from Lucene using predefined
queries, kind of like a database.  The odd thing: I can not replicate
the problem if I browse the site casually.  It's only under this stress
testing that the problem occurs.  It does not happen on specific
pages/queries, but more random--about every second to fourth query has
the exception.

Makes me wonder if our code is crossing over somewhere when multiple
queries are performed at the same time.


Erik Hatcher wrote:
 The issue occurs if the first field it accesses parses as a numeric 
 value and then successive fields are String's.  If you are mixing and 
 matching numeric and text information in this Title_Sort field you 
 should specify the type.

Chris Hostetter wrote:
 I could be wrong, but if i remember right, the code that AUTO uses 
 to determine what sort type to use will treat it as a number if it 
 *starts* with something that looks like a number ... so look for
titles 
 like 1000 year plan in your data.

That makes sense. Our titles would sometimes contain, even start with,
numbers.


Erik Hatcher wrote:
 *Everything* in Lucene is indexed as a string.  But how a 
 date looks as 
 a string is a topic unto itself.  I prefer to use MMDD as a date 
 formatted as a string (but when sorting, this could be treated as a 
 numeric).

Will RangeQuery still work with that?  We do have separate date fields
which are indexed like the following code, but a move to the MMDD
format might be good as then we could apply a blanket String-type sort.

public static Date parseDate( String s )
   DateFormat dateFormat = new SimpleDateFormat(-MM-dd hh:mm:ss);
   return dateFormat.parse(s);
}
doc[0].add(Field.Keyword(field, parseDate( dateInString )));


CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom it 
is addressed and may contain privileged and/or confidential information. If you 
are not the intended recipient, any disclosure, copying or dissemination of the 
information is unauthorised and you should delete/destroy all copies and notify 
the sender. No liability is accepted for any unauthorised use of the 
information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word co-occurrences counts

2004-12-23 Thread Daniel Naber
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:

 1.To be able to return the number of times the word appears in all
 the documents (which it looks like lucene can do through IndexReader)

If you're referring to docFreq(Term t) , that will only return the number 
of documents that contain the term, ignoring how often the term occurs in 
these documents.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word co-occurrences counts

2004-12-23 Thread Andrew Cunningham
Ah, so is it possible to return the number of times a term appears?
Daniel Naber wrote:
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:
 

1.  To be able to return the number of times the word appears in all
the documents (which it looks like lucene can do through IndexReader)
   

If you're referring to docFreq(Term t) , that will only return the number 
of documents that contain the term, ignoring how often the term occurs in 
these documents.

Regards
Daniel
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Andrew Cunningham
computer dog~50 looks like what I'm after - now is there someway I can 
call this and pull
out the number of total occurances, not just the number of documents 
hits? (say if computer
and dog occur near each other several times in the same document).

Paul Elschot wrote:
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:
 

Hi all,
I have a curious problem, and initial poking around with Lucene looks
like it may only be able to half-handle the problem.

The problem requires two abilities:
1.	To be able to return the number of times the word appears in all
the documents (which it looks like lucene can do through IndexReader) 
2.	To be able to return the number of word co-occurrences within
the document set (ie. How many times does computer appear within 50
words of  dog) 


Is the second point possible?
   

You can use the standard query parser with a query like this:
dog computer~50
This query is not completely symmetric in the distance computation:
when computer occurs before dog, the allowed distance is 49, iirc.
There is also a SpanNearQuery for more generalized and flexible
distance queries, but this is not supported by the query parser,
so you'll have to construct these queries in your own program code.
In case you have non standard retrieval requirements, eg. you only
need the number of hits and no further information from the matching
documents, you may consider using your own HitCollector on the
lower level search methods.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Doug Cutting
Andrew Cunningham wrote:
computer dog~50 looks like what I'm after - now is there someway I can 
call this and pull
out the number of total occurances, not just the number of documents 
hits? (say if computer
and dog occur near each other several times in the same document).
You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms(field)[doc]) at the end to get 
rid of the lengthNorm() and field boost (if any).

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Doug Cutting
Doug Cutting wrote:
You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms(field)[doc]) at the end to get 
rid of the lengthNorm() and field boost (if any).
Much simpler would be to build a SpanNearQuery, call getSpans(), then 
loop, counting how many times Spans.next() returns true.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: nable to read TLD META-INF/c.tld from JAR file ... standard.jar

2004-12-23 Thread Kevin A. Burton
Otis Gospodnetic wrote:
Most definitely Jetty.  I can't believe you're using Tomcat for Rojo!
;)
 

I never said we were using Tomcat for Rojo ;)
Sorry about that btw... wrong list!
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Andrew Cunningham
Thanks Doug and all,
I'm intending to use Lucene to grab a lot of word co-occurance 
statistics out of a large corpus
to perform word disambiguation. Lucene's looking like a great option, 
but I appear to have hit
a snag. Here's my understanding:

1) Create a Similarity implementation, where:
   tf() returns freq
   sloppyFreq, idf, coord, return 1 (cause we only need to freq to score)
2) Perform the query
3) and then:
   word in document count = 
hits.score(k)/Similarity.decodeNorm(reader.norms(contents)[k])
4) A query call such as
   computer dog~50
   will return a count of 2 (I assume because the match occurs 
backwards and forwards).

My problem occurs when I have the following in a text file:
   computer ...(some words)... dog ...(some words)... computer
and I duplicate the text file several times over. Performing a the above 
query will return different
phrase counts per document?

Note: I'm just working with some modified demo code at the moment.
Thanks again,
Andrew
Doug Cutting wrote:
Andrew Cunningham wrote:
computer dog~50 looks like what I'm after - now is there someway I 
can call this and pull
out the number of total occurances, not just the number of documents 
hits? (say if computer
and dog occur near each other several times in the same document).

You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms(field)[doc]) at the end to 
get rid of the lengthNorm() and field boost (if any).

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Shui Cheung Yip/JerseyCity/iNautix is out of the office.

2004-12-23 Thread syip
I will be out of the office starting  12/23/2004 and will not return until
12/28/2004.

 I will respond to your message when I return. For CashEdge Dev issues,
please contact  Aravind Ravi Subramania or  Subramaniam Sundaram,  In case
of emergency, I can be reached through OSS.

Thank you.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exception: cannot determine sort type

2004-12-23 Thread Erik Hatcher
On Dec 23, 2004, at 6:15 PM, Kauler, Leto S wrote:
Erik Hatcher wrote:
*Everything* in Lucene is indexed as a string.  But how a
date looks as
a string is a topic unto itself.  I prefer to use MMDD as a date
formatted as a string (but when sorting, this could be treated as a
numeric).
Will RangeQuery still work with that?  We do have separate date fields
which are indexed like the following code, but a move to the MMDD
format might be good as then we could apply a blanket String-type sort.
public static Date parseDate( String s )
   DateFormat dateFormat = new SimpleDateFormat(-MM-dd hh:mm:ss);
   return dateFormat.parse(s);
}
doc[0].add(Field.Keyword(field, parseDate( dateInString )));
Using MMDD works better for RangeQuery than Field.Keyword(String, 
Date) does.  Using the built-in Date field goes down to the millisecond 
level.  If you have lots of documents on the same day, but different 
milliseconds, you end up with lots of terms.  RangeQuery expands into a 
BooleanQuery OR'd with all the matching terms.  BooleanQuery has a 
built-in default of 1,024 allowed clauses, otherwise you get a 
TooManyClauses exception.

MMDD is a numeric, and to sort by that field I'd recommend you use 
a numeric type as it'll use much less memory.  But certainly doing some 
tests between using a numeric vs. String sorting type is advisable and 
see how it performs with each.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Erik Hatcher
On Dec 24, 2004, at 12:40 AM, Andrew Cunningham wrote:
3) and then:
   word in document count = 
hits.score(k)/Similarity.decodeNorm(reader.norms(contents)[k])
You should use hits.id(k), not k, as the index to 
reader.norms(contents).

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]