from:"Morus Walter"

Re: Boost doesn't works

2005-02-28 Thread Morus Walter

Claude Libois writes:
 Hello. I'm using Lucene for an application and I want to boost the title of
 my documents.
 For that I use the setBoost method that is applied on the title field.
 However when I look with luke(1.6) I don't see any boost on this field and
 when
 I do a search the score isn't change. What's wrong?

How do you search?
I guess you cannot see a change unless you combine searches in different 
fields, since scores are normalized.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Boost doesn't works

2005-02-28 Thread Morus Walter

Claude Libois writes:
 The explanation given by the IndexSearcher indicate me that the boost of my
 title is
 1.0 where  it should be 10.0.
 I really don't understand what it's wrong.

AFAIK you cannot get the boost of a field from the index because it's 
not stored as such.
It's calculated in the fields length norm or something like that during
indexing. Search the list archives for details.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance with one index vs. many indexes

2005-02-27 Thread Morus Walter

Jochen Franke writes:
 Topic: Search performance with large numbers of indexes vs. one large index
 
 
 My questions are:
 
 - Is the size of the wordlist the problem?
 - Would we be a lot faster, when we have a smaller number
 of files per index?

sure. 
Look:
Index lookup of a word is O(ln(n)) where n is the number of words.
Index lookup of a word in k indexes having m words is O( k ln(m) )
In the best case all word lists are distict (purely theoretical), 
that is n = k*m or m = n/k
For n = 15 Mio, k = 800
ln(n) = 16.5
k*ln(n/k) = 7871
In a realistic case, m is much bigger since word lists won't be distinct.
But it's the linear factor k that bites you.
In the worst case (all words in all indices) you have
k*ln(n) = 13218.8

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: help with boolean expression

2005-02-27 Thread Morus Walter

Omar Didi writes:
 I have a problem understanding how would lucene iterpret this boolean 
 expression : A AND B OR C .
 it neither return the same count as when I enter (A AND B) OR C nor A AND (B 
 OR C). 
 if anyone knows how it is interpreted i would be thankful.
 thanks

A AND B OR C creates a query that requires A and B. C influcenes the 
score, but is neither sufficient nor required for a match.

IMO query parser is broken for queries mixing AND and OR without explicit
braces.
My favorite sample is `a AND b OR c AND d' which equals `a AND b AND c AND d'
in query parser.

I suggested a patch some time ago, but it's still pending in bugzilla.
http://issues.apache.org/bugzilla/show_bug.cgi?id=25820

Don't know if it's still usable with current sources.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Sorting date stored in milliseconds time

2005-02-27 Thread Morus Walter

Ben writes:
 
 I store my date in milliseconds, how can I do a sort on it? SortField
 has INT, FLOAT and STRING. Do I need to create a new sort class, to
 sort the long value?
 
Why do you need that precicion?
Remember: there's a price to pay. The memory required for sorting and
the time to set up the sort cache depends on the number of different terms,
dates in your case.
I can hardly think of an application where seconds are relevant, what do
you need milliseconds for?

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: select where from query type in lucene

2005-02-18 Thread Morus Walter

Miles Barr writes:
 On Fri, 2005-02-18 at 03:58 +0100, Miro Max wrote:
  how can i search for content where type=document or
  (type=document OR type=view).
  actually i can do it with: (type:document OR
  type:entry) AND queryText as QueryString.
  but does exist any other better way to realize this?

[...] 
 
 Another alternative is to put each type in it's own index and use a
 MultiSearcher to pull in the types you want.
 
If the change rate of the index and the number of commonly used
type combinations aren't too large, cached filters might be another 
alternative.
Of couse the filter would have to be recreated whenever the index changes.
The advantage is, that you save searching for the types for each query
where the filter is reused while you can keep all documents within one 
index.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Concurrent searching re-indexing

2005-02-17 Thread Morus Walter

Paul Mellor writes:
 
 1. If IndexReader takes a snapshot of the index state when opened and then
 reads the files when searching, what would happen if the files it takes a
 snapshot of are deleted before the search is performed (as would happen with
 a reindexing in the period between opening an IndexSearcher and using it to
 search)?
 
On unix, open files are still there, even if they are deleted (that is,
there is no link (filename) to the file anymore but the file's content
still exists), on windows you cannot delete open files, so Lucene 
AFAIK (I don't use windows) postpones the deletion to a time, when the 
file is closed.
 
 2. Does a similar potential problem exist when optimising an index, if this
 combines all the segments into a single file?
 
AFAIK optimising creates new files.

The only problem that might occur, is opening a reader during index change
but that's handled by a lock.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: sounds like spellcheck

2005-02-09 Thread Morus Walter

Aad Nales writes:
 
 Steps 2 and 3 have been discussed at length in this forum and have even 
 made it to the sandbox. What I am left with is 1.
 
 My thinking is processing a series of replacement statements that go like:
 --
 g sounds like ch if the immediate predecessor is an s.
 o sounds like oo if the immediate predecessor is a consonant
 --
 
 But before I takes this to the next step I am wondering if anybody has 
 created or thought up alternative solutions?
 
An implementation of a rule based system to create such a pronounciation
form, can be found in a library called makelib that is part of an editor
named leanedit.
Unfortunatley the website seems to be down.
The lib is LGPL. If you're interested, I can send you a copy of the 
sources. The only ruleset available is german though.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Disk space used by optimize

2005-02-06 Thread Morus Walter

Bernhard Messer writes:
 
 However, three times the space sounds a bit too much, or I make a
 mistake in the book. :)
   
 
 there already was  a discussion about disk usage during index optimize. 
 Please have a look to the developers list at: 
 http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1797569 
 http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1797569
 where i made some measurements about the disk usage within lucene.
 At that time i proposed a patch which was reducing disk total used disk 
 size from 3 times to a little more than 2 times of the final index size. 
 Together with Christoph we implemented some improvements to the 
 optimization patch and finally commit the changes.
 
Hmm. In the case that the index is used (open reader), I doubt your patch 
makes a difference. In that case the disk space used by the non optimized 
index will still be used even if the files are deleted (on unix/linux).
What happens, if disk space run's out during creation of the compound index?
Will the non compound files be a usable index?
Otherwise you risk to loose the index.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: document numbers

2005-01-31 Thread Morus Walter

Hi Jonathan,

 Yet another burning question :-).  Can someone explain how the document 
 numbers in Lucene documents work?  For example, the TermDocs.doc() 
 method returns the current doc number.  How can I get this doc number 
 if I just have a Document?
 
I don't think you can.
A document does not even have to be indexed yet.

So either you're dealing with some document found in the index, then you 
should have the document number already, or you have a document independently
from the index, then you have to analyze the documents content and count
yourself.

Note that term vector support might be useful if you're interested in more
than one term (but that requires the document number again).

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: closing an IndexSearcher

2005-01-20 Thread Morus Walter

Hi Cocula,
 
 And now here is a code that works : the only differance with the previous one 
 is the QueryParser call before new IndexWriter. The QueryParser .parse 
 statement seems to close the IndexReader but I really can't figure how.
  
I rather suspect your OS/filesystem to delay the effect of the close.
QueryParser does not even know about your searcher.

What OS are you using?

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: English and French documents together / analysis, indexing, searching

2005-01-20 Thread Morus Walter

[EMAIL PROTECTED] writes:
 
  you could try to create a more complex query and expand it into both 
  languages using different analyzers. Would this solve your problem ?
 
 Would that mean I would have to actually conduct two searches (one in 
 English and one in French) then merge the results and display them to 
 the user?
No. You could do a ( ( french-query ) or ( english-query ) ) construct using
one query. So query construction would be a bit more complex but querying
itself wouldn't change.

The first thing I'd do in your case would be to look at the differences
in the output of english and french snowball stemmer.
I don't speak any french, but probably you might even use both stemmers
on all texts.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-20 Thread Morus Walter

Owen Densmore writes:

 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) 
 apparently produces non-word stems .. i.e. not really human readable.  
 (Example: generate, generates, generated, generating  - generat) 
 Although in typical queries this is not important because the result of 
 the search is a document list, it *would* be important if we use the 
 stems within a graphical navigation interface.
  So the question is: Is there a way to have the stemmer produce 
 english
  base forms of the words being stemmed?
 
rule based stemmers such as porter/snowball cannot do that.
But there are (commercial) dictionary based tools that can. E.g. the
canoo lemmatizer.
You might also have a look at egothors stemmer, that are word list based.

HTH
Morus



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Best way to find if a document exists, using Reader ...

2005-01-14 Thread Morus Walter

Praveen Peddi writes:
 
 Does it makes sense to call docFreq or termDocs (which ever is faster) before 
 calling delete?
 
IMO no.

calling termDocs is what Reader.delete(Term) does:

  public final int delete(Term term) throws IOException {
TermDocs docs = termDocs(term);
if (docs == null) return 0;
int n = 0;
try {
  while (docs.next()) {
delete(docs.doc());
n++;
  }
} finally {
  docs.close();
}
return n;
  }

(the advantage of OSS is, that you can look into it's sources)

So it already uses termDocs to see if there's anything to do.
I doubt that using docFreq would be much faster. In both cases the
term is searched and -- if you don't have to delete anything -- not
found. If it's found, docFreq might be faster, but in that case you have
to delete and use termDocs anyway.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: IndexSearcher and number of occurence

2005-01-13 Thread Morus Walter

Bertrand VENZAL writes:
 
 Im quite new in this mailing list. I ve many difficulties to find the
 number of a word (occurence) in a document, I need to use indexSearcher
 because of the query but the score returning is not wot i m looking for.
 I found in the mailing List the class TermDoc but it seems to work only
 with indexReader.
 
The use of a searcher does not prevent the use of a reader (in fact
the searcher relys on a reader).
So I'd use the searcher to find the document and a reader to get the
frequency using IndexReader.termDocs.
Depending on how many frequencies your interested in, the term vector
support might be of interest.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HELP! Directory is NOT getting closed!

2005-01-12 Thread Morus Walter

Joseph Ottinger writes:
 
 According to IndexWriter.java, line 246 (in 1.4.3's codebase), if closeDir
 is set, it's supposed to close the directory. That's fine - but that leads
 me to believe that for some reason, closeDir is *not* set.
 
 Why? Under what circumstances would this not be true, and under what
 circumstances would you NOT want to close the Directory?
 
From the sources, you can see, that is is true only, if the directory
is created by the IndexWriter itself. If you provide a directory to
the IndexWriter you have to close it yourself.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Check to see if index is optimized

2005-01-07 Thread Morus Walter

Crump, Michael writes:

 
 Is there a simple way to check and see if an index is already optimized?
 What happens if optimize is called on an already optimized index - does
 the call basically do a noop?  Or is it still and expensive call?
 
Why don't you just try that? E.g. using luke. Or three lines of code...

You will find, that calling optimize for an optimized index does
not change the index. (optimized means just one segement and no
deleted documents)

So I guess the answer for your first question can be found in the sources
of optimize:

  public synchronized void optimize() throws IOException {
flushRamSegments();
while (segmentInfos.size()  1 ||
   (segmentInfos.size() == 1 
(SegmentReader.hasDeletions(segmentInfos.info(0)) ||
 segmentInfos.info(0).dir != directory ||
 (useCompoundFile 
  (!SegmentReader.usesCompoundFile(segmentInfos.info(0)) ||
SegmentReader.hasSeparateNorms(segmentInfos.info(0))) {
  int minSegment = segmentInfos.size() - mergeFactor;
  mergeSegments(minSegment  0 ? 0 : minSegment);
}
  }

segmentInfos is private in IndexWriter, so I suspect you cannot check
that without modifying lucene.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Deleting index for DB indexing

2004-12-30 Thread Morus Walter

mahaveer jain writes:

 I am using lucene for my DB indexing. I have 2 columns which are Keyword. 
 Now I want to delete my index based on this 2 keyword. 
  
 Is it possible ? If no. What is other alternative ?
  
You can delete documents based on document number from an index reader.
You can get document numbers from searches.
So if you can search documents to be deleted based on your keywords, there
should be no problem deleting them...

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: QueryParser, default operator

2004-12-29 Thread Morus Walter

Paul writes:

 the following code
  QueryParser qp = new QueryParser(itemContent, analyzer);
  
 qp.setOperator(org.apache.lucene.queryParser.QueryParser.DEFAULT_OPERATOR_AND);
  Query query = qp.parse(line, itemContent, analyzer);
 doesn't produce the expected result because a query foo bar results in:
  itemContent:foo itemContent:bar
 where as a foo AND bar results in
  +itemContent:foo +itemContent:bar
 
 If I understand the default operator correctly than the first query
 should have been expanded to the same as the latter one, isn't it?
 
try qp.parse(line).
parse(String query, String field, Analyzer analyzer) is a static method
that create it's own instance of QP, that does not know anything about
the settings of your qp object.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: (Offtopic) The unicode name for a character

2004-12-22 Thread Morus Walter

Hi Peter,
 
 The Question:
 In Java generally, Is there an easy way to get the unicode name of a 
 character?  (e.g. LATIN SMALL LETTER A from 'a')
 
...
 
 I'm considering taking the unicode name for each character I encounter 
 and regexping it against something like:
 ^LATIN .* LETTER (.) WITH .*$
 ... to try and extract the single A-Z|a-z character.
 
There used to be a list (ASCII) on some ftp server at unicode.org.
I have a version 'UnicodeData.txt' here.
It lists ~ 12000 characters in the form
01A4;LATIN CAPITAL LETTER P WITH HOOK;Lu;0;L;N;LATIN CAPITAL LETTER P 
HOOK;;;01A5;
01A5;LATIN SMALL LETTER P WITH HOOK;Ll;0;L;N;LATIN SMALL LETTER P 
HOOK;;01A4;;01A4

If you cannot find that list somewhere I can mail you a copy.

It would be a nice contribution if you could add your filter to lucenes
sandbox, once it's finished.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Synonyms for AND/OR/NOT operators

2004-12-21 Thread Morus Walter

Erik Hatcher writes:
 On Dec 21, 2004, at 3:04 AM, Sanyi wrote:
  What is the simplest way to add synonyms for AND/OR/NOT operators?
  I'd like to support two sets of operator words, so people can use 
  either the original english
  operators and my custom ones for our local language.
 
 There are two options that I know of: 1) add synonyms during indexing 
 and 2) add synonyms during querying.  Generally this would be done 
 using a custom analyzer.

I guess you missunderstood the question.

I think he want's to know how to create a query parser understanding 
something like 'a UND b' as well as 'a AND b' to support localized 
operator names (german in this case).

AFAIK that can only be done by copying query parsers javacc-source and
adding the operators there.
Shouldn't be difficult, though it's a bit ugly since it implies code
duplication. And there will be no way of choosing the operators dynamically
at runtime. One will need to have different query parsers for different
languages.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Synonyms for AND/OR/NOT operators

2004-12-21 Thread Morus Walter

Sanyi writes:
 Well, I guess I'd better recognize and replace the operator synonyms to their 
 original format
 before passing them to QueryParser. I don't feel comfortable tampering with 
 Lucene's source code.
 
Apart from knowing how to compile lucene (including the javacc code
generation) you should only need to change

DEFAULT TOKEN : {
  AND:   (AND | ) 
| OR:(OR | ||) 
| NOT:   (NOT | !) 

to
DEFAULT TOKEN : {
  AND:   (AND | insert your version of and here | ) 
| OR:(OR | insert your version of or here | ||) 
| NOT:   (NOT | insert your version of not here | !) 

in jakarta-lucene/src/java/org/apache/lucene/queryParser/QueryParser.jj

Replacing the operators before query might be hard to do, if you want
to handle cases like »a AND b OR c«, which is a query for a 
phrase a AND b or the token c, correctly.

Morus



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Queries difference

2004-12-20 Thread Morus Walter

Alex Kiselevski writes:
 
 Hello, I want to know is there a difference between queries:
 
 +city(+London Amsterdam) +address(1_street 2_street)
 
 And
 
 +city(+London) +city(Amsterdam) +address(1_street)  +address(2_street)
 
I guess you mean city:(... and so on.

The first query searches documents containing 'London' in city, scoring
results also containing Amsterdam higher, and containing 1_street or 2_street
in address.
The second query searches for documents containing both London and Amsterdam
in city and 1_street and 2_street in address.
Note the the + before London in the second query doesn't mean anything.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NUMERIC RANGE BOOLEAN

2004-12-16 Thread Morus Walter

Erik Hatcher writes:

 TooManyClauses exception occurs when a query such as a RangeQuery 
 expands to more than 1024 terms.  I don't see how this could be the 
 case in the query you provided - are you certain that is the query that 
 generated the error?
 
Why not: the terms might be 0003 0003.1 0003.11 ...

So the question is, how do his terms look like...

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Unexpected TermEnum behavior

2004-12-08 Thread Morus Walter

Chris Hostetter writes:
 
 I thought it was documented in the TermEnum interface, but looking at it
 now I realize that not only does the TermEnum javadoc not explain it
 very well, but the class FilteredTermEnum (which implements TermEnum)
 acctually documents the oposite behavior...
 
   public Term term()
 
   Returns the current Term in the enumeration. Initially
   invalid, valid  after next() called for the first time.
 
That's a documentation bug. Fixed in CVS.

http://issues.apache.org/bugzilla/show_bug.cgi?id=32353

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: hits.length() changes during delete process.

2004-12-06 Thread Morus Walter

David Townsend writes:
 
 So the short question is, should the hits object be changing and what is the 
 best way to delete all the results of a search (it's a range query so I can't 
 use delete(Term term)? 
 
The hits object caches only part of the hits (initially the first 100 (?)). 
This cache is extended if further hits are accessed by repeating the search. 
Since you deleted part of the hits at this point, your hits object changes.
You should be able to get around this by either scanning the hits objects
from end to start instead of start to end or deleting with a different
index reader. In the latter case the searcher should not see the deletions.
Reversing the order might be preferable, since it implies only one search
repetition.
(both suggestions untested)

The best way would probably be, to avoid a hit object anyway and delete
the documents at the level where the hits object is created. Have a look
at the sources for details. (also untested; I never needed more than 
term based deletions)

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexReader close method

2004-12-06 Thread Morus Walter

Helen Warren writes:
 
 //close the IndexReader object
 myReader.close();
 
 //return results
 return hits;
 
 The myReader.close() line causes the IOException to be thrown. To try 

Are you sure it's the myReader.close() that fails?
I'd suspect that to fail as soon as you want to do anything meaningful
with the hits objects you return. You need an open searcher/reader for that 
and in general it should be the one, you used during search. This is 
assuming hits is an instance of class org.apache.lucene.search.Hits.
The method Document doc(int n) relys on the searcher used for search 
not being closed.
So I'd suspect the IOException to be thrown later.
Of course removing the myReader.close(); will prevent the exception.
You cannot close the reader as long as you want to access search results.

 In this case, the reader appears to close without error but even after 
 I've called myReader.close() I can execute the maxDoc() method on that 
 object and return results. Anybody shed any light?
 
yes. the source ;-)
maxDoc does not access the index files but returns an integer stored
in the class itself.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Morus Walter

Hoss writes:
 
 (c) Filtering.  Filters in general make a lot of sense to me.  They are a
 way to specify (at query time) that only a certain subset of the index
 should be considered for results.  The Filter class has a very straight
 forward API that seems very easy to subclass to get the behavior I want.
 The Query API on the other hand ... I freely admit, that I can't make
 heads or tails out of it.  I don't even know where I would begin to try
 and write a new subclass of Query if I wanted to.
 
 I would think that most people who want to do a numeric range
 restriction on their data, probably don't care about the Scoring benefits
 of RangeQuery.  Looking at the code base, the way DateFilter works seems
 like it provides an ideal solution to any sort of Range restriction (not
 just Dates) that *should* be more efficient then using RangeQuery when
 dealing with an unbounded value set. (Both approaches need to iterate over
 all of the terms in the specified field using TermEnum, but RangeQuery has
 to build up an set of BooleanQuery objects for each matching term, and
 then each of those queries have to help score the documents -- DateFilter
 on the other hand only has to maintain a single BitSet of documents that
 it finds as it iterates)
 
IMO there's another option, at least as long as the number of your documents
isn't too high.
Sorting already creates a list of all field values for some field that 
will be used during the search for sorting.
Nothing prevents you from using that aproach for search restriction also.
The advantage is, that you can create that list once and use it for different
ranges until the index is changed whereas a filter can only represent
one range.
The disadvantate is, that you have to keep one value for each document in
memory instead of one bit in a filter.

I did that (before the sort code was introduced) for date queries in order
to be able to sort and restrict searches on dates.
But I haven't thought about how a general API for such a solution might look 
like so far.

Of course it depends on a number of questions, which way is preferable.
How often is the index modified, are range queries usually done for the
same or different ranges, how many documents are indexed and so on.

Morus
  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Help on the Query Parser

2004-11-23 Thread Morus Walter

Terence Lai writes:
 
 Look likes that the wildcard query disappeared. In fact, I am expecting 
 text:java* developer to be returned. It seems to me that the QueryParser 
 cannot handle the wildcard within a quoted String.
 
That's not just QueryParser. 
Lucene itself doesn't handle wildcards within phrases.
You could have a query text:java* developer if '*' isn't removed by the 
analyzer. But it would only search for the token 'java*' not any expansion of 
that. I guess this is not, what you want.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using multiple analysers within a query

2004-11-22 Thread Morus Walter

Kauler, Leto S writes:
 
 Would anyone have any suggestions on how this could be done?  I was
 thinking maybe the QueryParser would have to be changed/extended to
 accept a separator other than colon :, something like = for example
 to indicate this clause is not to be tokenised.  

I suggested that in a recent discussion and Erik Hatcher objected that
it isn't a good idea, to require that users know which field to query
in which way. I guess he is right.
If your query isn't entered by users, you shouldn't use query parser in
most cases anyway.

 Or perhaps this can all
 be done using a single analyser?
 
Look at PerFieldAnalyzerWrapper. 
You will probably have to write a keyword analyzer (unless you can use
whitespace analyzer in your case).

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using multiple analysers within a query

2004-11-22 Thread Morus Walter

Erik Hatcher writes:

  If your query isn't entered by users, you shouldn't use query parser in
  most cases anyway.
 
 I'd go even further and say in all cases.
 
If you use lucene as a search server you have to provide the query somehow.
E.g. we have an php application, that sends queries to a lucene search
servlet.
In this case it's justifiable to serialize the query into query parser
syntax on the client side and have query parser read the query again on
the server side.
I don't recall any problems with the aproach since we clean up the user
before constructing the query.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: WildcardTermEnum skipping terms containing numbers?!

2004-11-19 Thread Morus Walter

Sanyi writes:
  If there's a bug, it should be tracked down, not worked around...
 
 Sure, but I'm working with 20million records and it takes about 25 hours to 
 re-index, so I'm
 looking for ways that doesn't require reindexing.
 
why reindex?

 My code was:
 
   WildcardTermEnum wcenum = new WildcardTermEnum(reader, term);
   
   while (wcenum.next()) {
   terms.add(new WeightedTerm(termgroup,wcenum.term().text()));
   //System.out.println(wcenum.term().text());
   }
 
 And it skipped lots of things it shouldn't have skipped.

As stated at the end of my mail, I'd expect that to skip the first
term in the enum.
Is that, what you miss or do you loose more than one term?

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: WildcardTermEnum skipping terms containing numbers?!

2004-11-18 Thread Morus Walter

Sanyi writes:
 Enumerating the terms using WildcardTermEnum and an IndexReader seems to be 
 too buggy to use.

If there's a bug, it should be tracked down, not worked around...

But it looks ok to me:

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.*;
import org.apache.lucene.document.*;
import org.apache.lucene.store.*;
import org.apache.lucene.search.*;

public class LuceneTest {

public static void main(String[] args) throws Exception {

RAMDirectory dir = new RAMDirectory();

IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);

Document doc = new Document();

doc.add(new Field(foo, blabla etc.. etc... c0la c0ca caca ccca, 
true, true, true));

writer.addDocument(doc);

writer.close();

IndexReader reader = IndexReader.open(dir);

WildcardTermEnum enum = new WildcardTermEnum(reader, new Term(foo, 
c??a));

do {
System.out.println(enum.term().text());
} while ( enum.next() );

WildcardQuery wq = new WildcardQuery(new Term(foo, c??a));

Query q = wq.rewrite(reader);

System.out.println(q.toString());

reader.close();
}
}

gives
c0ca
c0la
caca
ccca
foo:c0ca foo:c0la foo:caca foo:ccca

The only bug I see is in the docs, that claims enum.term() to be invalid
before the first call to next() which does not seem to be the case.
So if you use
while ( enum.next() ) {
...
}
you will loose the first term, whatever it is.
Looking at the sources I find that this behaviour is shared by 
FuzzyTermEnum. Both implementations of the abstract FilteredTermEnum class
call setEnum at the end of the constructor, which prepares the first
result.

Morus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: problems search number range

2004-11-18 Thread Morus Walter

[EMAIL PROTECTED] writes:
 
 i need to solve this search:
 number: -10
 range: -50 TO 5
 
 i need help.. 
 i dont find anything using google.. 
 
If your numbers are in the interval MIN/MAX and MIN0 you can shift
that to a positive interval 0 ... (MAX-MIN) by subtracting MIN from
each number.

Alternatively you have to find a string represantation providing the
correct order for signed integers.
E.g.
-0010
-0001
0
1
00020
should work (in the range -..9), since '0' has a higher ascii 
(unicode) code than '-'.
Of course the analayzer has to preserve the '-' and the '-' should not
be eaten by the query parser in case you use it. I don't know if there are
problems with that, but I suspect that at least for the query parser.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: problems search number range

2004-11-18 Thread Morus Walter

[EMAIL PROTECTED] writes:
 
 this solution was the first that i tried.. but this does not run correctly.. 
 because:
 
 when we try to sort this number in alphanumeric order we obtain that number 
 -0010 is higher than -0001
 
right. I failed to see that.
So you would have to use a complement for negative numbers as well e.g. using
-9989 for -10, -9998 for -1, ...

But shifting the interval is easier of course.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching and indexing from different processes (applications)

2004-11-16 Thread Morus Walter

K Kim writes:

 I just started to play around with Lucene.  I was
 wondering if searching and indexing can be done
 simultaneously from different processes (two different
 processes.)  For example, searching is serviced from a
 web appliation, while indexing is done periodically
 from a stand-alone application.
 
 What would be the best way to implement this?  
 
simply do it.
The only things you have to keep in mind, is
a) you cannot have more than one process/thread writing to lucene
b) an index reader/search will not see updates unless it's closed
   and reopened.

So all you need is your web app and your indexing process and some
way to inform the web app after indexing, that it should reopen the index.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Phrase search for more than 4 words throws exception in QueryParser

2004-11-11 Thread Morus Walter

Sanyi writes:
 
 How to perform phrase searches for more than four words?
 
 This works well with 1.4.2:
 aa bb cc dd
 I pass the query as a command line parameter on XP: \aa bb cc dd\
 QueryParser translates it to: text:aa text:bb text:cc text:dd
 Runs, searches, finds proper matches.
 
 This throws exeption in QueryParser:
 aa bb cc dd ee
 I pass the query as a command line parameter on XP: \aa bb cc dd ee\
 The exception's text is:
 : org.apache.lucene.queryParser.ParseException: Lexical error at line 1, 
 column
 13.  Encountered: EOF after : \aa bb cc dd
 
Works for me on linux:
java -cp lucene.jar org.apache.lucene.queryParser.QueryParser 'a b c d e f g h 
i j k l m n o p q r s t u v w x y z'
a b c d e f g h i j k l m n o p q r s t u v w x y z

Must be an XP command line problem.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: stopword AND validword throws exception

2004-11-10 Thread Morus Walter

Sanyi writes:
 
 This query works as expected:
 validword AND stopword
 (throws out the stopword part and searches for validword)
 
 This query seems to crash:
 stopword AND validword
 (java.lang.ArrayIndexOutOfBoundsException: -1)
 
 Maybe it can't handle the case if it had to remove the very first part of the 
 query?!
 Can anyone else test this for me? How can I overcome this problem?
 
see bug:
http://issues.apache.org/bugzilla/show_bug.cgi?id=9110

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: stopword AND validword throws exception

2004-11-10 Thread Morus Walter

Sanyi writes:
 Thanx for your replies guys.
 
 Now, I was trying to locate the latest patch for this problem group, and 
 the last thread I've
 read about this is:
 http://issues.apache.org/bugzilla/show_bug.cgi?id=25820
 It ends with an open question from Morus:
 If you want me to change the patch, let me know. That no big deal.
 
 Did you change the patch since then?
 
No. But this is an independent issue from the `stopword AND word' problem.
The `stopword AND word' problem just has to be taken care of in that context
also.
Bug 25820 basically is about better handling of AND and OR in a query.
Currently `a AND b OR c AND d'  equals  `a AND b AND c AND d' in query 
parser.

 Can I simply download the latest compiled development version of lucene.jar 
 and will it fix my
 problem?
 
If there are no current nightly builds, I guess you will have to get the
sources it from cvs directly.

But the fix seems to be included in 1.4.2.
see 
http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.96.2.4
item 5

Morus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: A TokenFilter to split words and numbers

2004-11-04 Thread Morus Walter

william.sporrong writes:

 Does it have something to do with the
 QueryParser guessing what kind of query it is by examining the string and
 thus presumes that the first string should not be parsed into a PhraseQuery?

QueryParser creates a PhraseQuery for words that are tokenized to more
than one token.
You should see that in the serialized query.
  
 
 Anyways if there is a correct way to accomplish what I want could anyone
 please give me a hint? One way I thought about is preparsining the query and
 construct several subqueries i.e PhraseQuerys and so on and then combine
 them in a BooleanQuery but I guess there is a nicer solution?
 
I guess you could overwrite the getFieldQuery method of query parser
and change the way queries are generated.
  
 
 I have a similar problem with another Filter Iäm trying to implement that
 should remove certain suffixes and replace them with a wildcard (
 bilar-bil*).
 
If you expect bil* to be executed as a wildcard/prefix query, this
cannot work. The query parser parses the query, not the analyzer output.
Again you might introduce such behaviour in getFieldQuery.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: jaspq: dashed numerical values tokenized differently

2004-11-02 Thread Morus Walter

Daniel Taurat writes:
 Hi,
 I have just another stupid parser question:
 There seems to be a special handling of the dash sign - different from
 Lucene 1.2 at least in Lucene 1.4.RC3
 StandardAnalyzer.
 
 Examples (1.4RC3):
 
 A document containing the string dash-test is matched by the following
 search expressions:
 dash
 test
 dash*
 dash-test
 It is _not_ matched by the following search expressions:
 dash-*
 dash-t*
 
 If the string after the dash consists of digits, the behavior is
 different.
 E.g., a document containing the string dash-123 is matched by:
 dash*
 dash-*
 dash-123
 It is not matched by:
 dash
 123
 
 Question:
 Is this, esp. the different behavior when parsing digits and characters,
 intentional and how can it be explained?
 Regards,
 
Query parser was changed to treat '-' within words as part of the word.
Before that change a query 'dash-test' was parsed as 'dash AND NOT test'.
Now QP reads one word 'dash-test' which is analyzed. If the analyzer
splits that to more than one token (standard analyzer does) a phrase
query is created.
The difference you see comes from standard analyzer which tokenizes
dash-test dash-123 to tokens dash, test and dash-123.
Prefix queries aren't analyzed.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Locks and Readers and Writers

2004-11-01 Thread Morus Walter

[EMAIL PROTECTED] writes:
 Hi Christoph,
 
 Thats what I thought.  But what I'm seeing is this:
 - open
 reader for searching
 (the reader is opening an index on a remote machine
 (via UNC) which takes a couple seconds)
 - meanwhile the other service opens
 an IndexWriter and adds a document
 (the index writer determines that it needs
 to merge so it tries to get a lock.  since the reader is still opening, the
 IO exception is thrown)
 
 I believe that increasing the merge factor will
 reduce the opportunity for this to occur.  But it will still occur at some
 point.
 
I'm not sure what you mean by `opening an index on a remote machine (via 
UNC)' but have you made sure that lock files are put in the same directory
for both processes (see the mailing list archive for details)?
Also note, that lucene's locking is known not to work on NFS (also see the
list archive). I don't know if it works on SMB mounts.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching for a phrase that contains quote character

2004-10-29 Thread Morus Walter

Daniel Naber writes:
 On Thursday 28 October 2004 19:03, Justin Swanhart wrote:
 
  Have you tried making a term query by hand and testing to see if it
  works?  
 
  Term t = new Term(field, this is a \test\);
  PhraseQuery pq = new PhraseQuery(t);
 
 That's not a proper PharseQuery, it searches for *one* 
 term this is a test which is probably not what one wants. You 
 have to add the terms one by one to a PhraseQuery.
 
Will spoke of a keyword field, in which case he would want to search
for one term.
Using a TermQuery make more sense, though.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: new version of NewMultiFieldQueryParser

2004-10-29 Thread Morus Walter

Bill Janssen writes:
  Try to see the behavior if you want to have a single term query 
  juat something like: robust .. and print out the query string ...
 
 Sure, that works fine.  For instance, if you have the three default
 fields title, authors, and contents, the one-word search
 robust expands to
 
title:foobar authors:foobar contents:foobar
 
 just as it should.
 
   Try to see what is happening with Prefix, Wild, and Fuzzy searches ...
 
 Good point.  My older version (see below) found these, but the new one
 doesn't.  Oh, well, back to the working version.  I knew there was some
 reason getFieldQuery wasn't sufficient.
 
wouldn't it be better to go on and overwrite the methods creating these 
types of queries too?

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Locks and Readers and Writers

2004-10-28 Thread Morus Walter

Christoph Kiehl writes:
 
 AFAIK you should never open an IndexWriter and an IndexReader at the 
 same time. You should use only one of them at a time but you may open as 
 many IndexSearchers as you like for searching.
 
You cannot open an IndexSearcher without opening an IndexReader (explicitly
or implicitly).

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Null or no analyzer

2004-10-21 Thread Morus Walter

Erik Hatcher writes:

 however perhaps it should be.  Or perhaps there are other options to 
 solve this recurring dilemma folks have with Field.Keyword indexed 
 fields and QueryParser?
 
I think one could introduce a special syntax in query parser for
keyword fields. Query parser wouldn't analyze them at all in this case.
Something like 
field#Keyword
or
field#keyword containing blanks

I haven't thought through all consequences for
field#(keywordA keywordB otherfield:noKeyword)
but I think it should be doable.

Doesn't make query parser simpler, on the other hand.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Null or no analyzer

2004-10-20 Thread Morus Walter

Aviran writes:
 You can use WhiteSpaceAnalyzer
 
Can he? If Elections 2004 is one token in the subject field (keyword), 
this will fail, since WhiteSpeceAnalyzer will tokenize that to `Elections' 
and `2004'.
So I guess he has to write an identity analyzer himself unless there is
one provided (which doesn't seem to be the case).
The only alternatives are not using query parser or extending query parser
for a key word syntax, as far as I can see.

Morus
 
 -Original Message-
 From: Rupinder Singh Mazara [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, October 19, 2004 11:23 AM
 To: Lucene Users List
 Subject: Null or no analyzer
 
 
 Hi All
 
   I have a question regarding selection of Analyzer's during query parsing
 
 
   i have three field in my index db_id, full_text, subject
   all three are indexed, however while indexing I specified to lucene to
 index db_id and subject but not tokenize them
 
   I want to give a single search box in my application to enable searching
 for documents
   some query can look lile  motor cross rally this will get fed to
 QueryParser to do the relevent parsing
 
   however if the user enters  Jhon Kerry  subject:Elections 2004 I want to
 make sure that No analyzer is used fro the subject field ? how can that be
 done.
 
   this is because I expect the users to know the subject from a List of
 controlled vocabularies and also I am searching for  documents that have the
 exact subject I tried using the PerFieldAnalyzerWrapper, but how do I get
 hold a Analyzer that  does nothing but pass the text trough to the Searcher
 ?
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: QueryParsing

2004-10-19 Thread Morus Walter

Rupinder Singh Mazara writes:
 hi erik and everyone else
 
  ok i will buy the book ;)
 but this still does not solve the problem of
  why String x = \jakarta apache\~100; is being transalted as a
 PhraseQuery
   FULL_TEXT:jakarta apache~100
 
  is the correct query beining formed ?  or is there something wrong with the
  Proximity Search topic in the URL
 http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
 
A proximity search is done by a PhraseQuery with a slop.
The slop makes the PhraseQuery to perform a proximity search (so you can
argue that the name is problematic).
That's what query parser creates.

SpanQueries where introduced later. Maybe you can get the effect of a
proximity search by SpanQueries also, but that's not handled by the query
parser.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: StopWord elimination pls. HELP

2004-10-18 Thread Morus Walter

Miro Max writes:

 String cont = rs.getString(x);
 d.add(Field.Text(cont, cont));
 writer.addDocument(d);
 
 to get results from a database into lucene index. but
 when i check println(d) i can see the german stopwords
 too. how can i eliminate this?
 
Stopwords in an analyzer don't make the stopwords disappear from the document,
they only prevent them from beeing indexed.
So you will allways see stopwords in the document (before indexing and,
if the field is stored, when the document is retrieved from the index).

A meaningful check, if stopwords are recognized, would be to search for
a stopword. You shouldn't find anything...

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How extract a Field.Text(String, String) field to process it with a Stylesheet?

2004-10-15 Thread Morus Walter

Otis Gospodnetic writes:
 That's likely because you used an Analyzer that stripped the XML (, ,
 etc.) from the original text.  If you want to preserve the original
 text, use an Analyzer that doesn't throw your XML away.  You can write
 your own Analyzer that doesn't discard anything, for instance.
 
An analyzer doesn't change the stored content. Only the indexed tokens.
So if something threw away the tags (or just the spectial characters) it
must have been before Field.Text(String, String) was called.
This of course wouldn't be surprising, since indexing xml often means
to extract the text from an xml document and index that text.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: WildCardQuery

2004-10-05 Thread Morus Walter

Robinson Raju writes:
 The way i have done is , 
 if there is a wildcard , Use WildCardQuery , 
 else other.
 Here searchFields is an array which contains the column names . search
 string is the value to be searched.
 
 if ((searchString.indexOf(IOOSConstants.ASTERISK)  -1)
 || (searchString.indexOf(IOOSConstants.QUESTION_MARK)  -1))
 {
 WildcardQuery wQuery = new WildcardQuery(new Term(
 searchFields[0], searchString));
 booleanQuery.add(wQuery, true, false);
 if (searchFields.length  1)
 {
 WildcardQuery wQuery2 = new WildcardQuery(new Term(
 searchFields[1], searchString));
 booleanQuery.add(wQuery2, true, false);
 }
 }
 else
 {
 Query query = MultiFieldQueryParser.parse(searchString,
 searchFields, flags, analyzer);
 booleanQuery.add(query, true, false);
 }
 Query queryfilter = MultiFieldQueryParser.parse(filterString,
 filterFields, flags, analyzer);
 QueryFilter queryFilter = new QueryFilter(queryfilter);
 hits = parallelMultiSearcher.search(booleanQuery, queryFilter);
 
 In the meanwhile , i thought i would tokenize the string based on
 space if the input contains spaces and then add them one by one into
 booleanQuery. But this gave a StringIndexOutOfBoundsException.
 
 So am still trying...
 Thanks for your help . would appreciate greately if you could give me
 more pointers .
 
Did you look at the output of query.toString(defaultfield)?
That's usually the best way to see, if a constructed query is what you 
expect it to be.

Why isn't creating wildcard queries left to the query parser?

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: BooleanQuery - Too Many Clases on date range.

2004-10-04 Thread Morus Walter

Chris Fraschetti writes:
 So i decicded to move my epoch date to the  20040608 date which fixed
 my boolean query problem in regards to my current data size (approx
 600,000) 
 
 but now as soon as I do a query like ...  a*
 I get the boolean error again. Google obviously can handle this query,
 and I'm pretty sure jguru.com can handle it too.. any ideas? With out
 without a date dange specified i still get teh  TooManyClauses error. 
 I tired cranking the maxclauses up to Integer.MaxInt, but java gave me
 a out of memory error. Is this b/c the boolean search tried to
 allocate that many clauses by default or because my query actually
 needed that many clauses?  

boolean search allocates clauses for all tokens having the prefix or
matching the wildcard expression.

 Why does it work on small indexes but not
 large? 
Because there are fewer tokens starting with a.

 Is there any way to have the parser create as many clauses as
 it can and then search with what it has? w/o recompiling the source?
 
You need to create your own version of Wildcard- and Prefix-Query
that takes a maximum term number and ignores further clauses.
And you need a variant of the query parser that uses these queries.

This can be done, even without recompiling lucene, but you will have to
do some programming at the level of lucene queries.
Shouldn't be hard, since you can use the sources as a starting point.

I guess this does not exist because the lucene developer decided to prefer
a query error rather than uncomplete results.

Morus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: different analyzer all produce the same index?

2004-10-04 Thread Morus Walter

sergiu gordea writes:
 Daan Hoogland wrote:
 
 H all,
 
 I try to create different indices using different Analyzer-classes. I 
 tried standard, german, russian, and cjk. They all produce exactly the 
 same index file (md5-wise). There are over 280 pages so I expected at 
 least some differences.
 
   
 
 Take a look in the lucene source code... Maybe you will find the answer ...
 I asume that all the pages you indexed were written in English, 
 therefore is normal that german, russian and cjk analyzers to
 create identic indexex, but htey should be different  than english one 
 (StandardAnalyzer)
 
german analyzer definitely won't leave english text as it is, since it
does algorithmic stemming.
E.g. your text get's
tak a look in the luc sourc cod mayb you will find the answ i asum tha all the pag you 
indexed wer writt in english therefor is normal tha germa russia and cjk analyx to 
crea identic indexex but htey should be diff tha english one standardanalyx
  while std analyzer does not stem at all and gives
take a look in the lucene source code maybe you will find the answer i asume that all 
the pages you indexed were written in english therefore is normal that german russian 
and cjk analyzers to create identic indexex but htey should be different than english 
one standardanalyzer

I'd rather suspect some problem with the indexing code.
So my advice is, to check what the analyzer produces.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Seraching in Keyword Field

2004-09-30 Thread Morus Walter

Bernhard Messer wrote 

 Hi,
 
 try that query:
 
 MyKeywordField:ABC
 
Why should that help?
foo:(bla) and foo:bla create the same query:

java -classpath lucene-1.4.1/lucene-1.4.1.jar 
org.apache.lucene.queryParser.QueryParser 'foo:(bla)'
foo:bla
java -classpath lucene-1.4.1/lucene-1.4.1.jar 
org.apache.lucene.queryParser.QueryParser 'foo:bla'
foo:bla

As often the necessary step is to look at what query parser produced
using query.toString()

I guess SimpleAnalyzer lowercases the term and prevents entries 'ABC' from 
beeing found.
Using an apropriate PerFieldAnalyzerWrapper might help.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: list of removed stop words

2004-09-29 Thread Morus Walter

Chris Fraschetti writes:
 Is there a way to via the parser or the query retrieve a list of the
 stop words removed by the analyzer? or should i just check my query
 against .STOPWORDS and do it myself?
 
Query parser does not provide that info.
Of course you might consider adding this inside query parser. Doing the
check yourself outside QP means, that have to parse a second time...

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: online and offline Directory

2004-09-28 Thread Morus Walter

Ernesto De Santis writes:
 Hi Aviran
 
 Thanks for response.
 
 I forgot important information for you understand my issue.
 
 My process do some like this:
 The index have contents from differents sources, identified for a special
 field 'source'.
 Then the index have documents with source: S1 or source: S2 ... etc.
 
 When I reindex the source S1, first delete all documents with source: S1, in
 otherwise I have the index with repeated content. Then add the new index
 result.
 In the middle of process the IndexSearcher use an incomplete index.
 
 Is posible do it like a data base transaction?
 
It's not like a data base transcation but any index reader/searcher that
was opened before the changes won't see them until it's closed and reopened.
AFAIK that also applies to deletions though I never checked that.

So you have two options: a) use a second index for indexing, move the
indexes after the indexing is done and make sure indexreader/searcher
are closed and reopened after the move.
b) use one index and make sure that you do not open any index reader/searcher
during the update. Searches may only use already opened reader/searcher.

I guess it depends on index size, update frequency and so on, which 
szenario is easier to handle.
Given that the index isn't too large and update frequency is rather low, 
I'd use a second index. But you'll need to copy that index and should 
consider the time and disc IO needed for that.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Strange search results with wildcard - Bug?

2004-09-24 Thread Morus Walter

Ulrich Mayring writes:
 Daniel Naber wrote:
  
  AND always refers to the terms on both sides, +/- only refers to the term 
  on the right. So a AND b - +a +b is correct.
 
 *slap forehead* - you're right. Wasn't there something about operator 
 precedence way back when ;-)
 
Yes. January. And it's still in bugzilla. :-(

But it would not make a difference in this case, since AND has higher 
precedence, so 
a OR b AND c 
is
a OR (b AND c)
which is correctly done as
a (+b +c) 
in boolean queries.
a +b +c is different, since it won't find documents containing only a.
Occurences of a only modify score in this case.

Morus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Strange search results with wildcard - Bug?

2004-09-23 Thread Morus Walter

Ulrich Mayring writes:
 Hi all,
 
 first, here's how to reproduce the problem:
 
 Go to http://www.denic.de/en/special/index.jsp and enter obscure 
 service in the search field. You'll get 132 hits. Now enter obscure 
 service* - and you only get 1 hit.
 
 The above website is running Lucene 1.3rc3, but I was able to reproduce 
 this locally with 1.4.1. Here are my local results with controlled 
 pseudo documents, perhaps you can see a pattern:
 
 searching for 00700* gets two documents:
 007001 action and 007002 handle
 
 
 searching for handle gets two documents:
 007002 handle and 011010 handle
 
 
 searching for 00700* handle gets two documents:
 007002 handle and 011010 handle
 But where is 007001 action?
 
 
 searching for handle 00700* gets two documents:
 007001 action and 007002 handle
 But where is 001010 handle?
 
 
 We're using the MultiFieldQueryParser and the Snowball Stemmers, if that 
 makes any difference.
 
Your number/handle samples look ok to me if the default operator is AND.

Note that wildcard expressions are not analyzed so if service is 
stemmed to anything different from service, it's not surprising that
service* doesn't find it.

I think you should look at a) what's the analyzed form of your terms
and b) how does the rewritten query look like (there's a rewrite method
for query that expands wildcard queries into basic queries).

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Strange search results with wildcard - Bug?

2004-09-23 Thread Morus Walter

Ulrich Mayring writes:
 
 Will do, thank you very much. However, how do I get at the analyzed form 
 of my terms?
 
Instanciate the analyzer, create a token stream feeding your input,
loop over the tokens, output the results.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Combining Lucene and database functionality

2004-09-22 Thread Morus Walter

Marco Schmidt writes:
 I'm trying to find out whether Lucene is an option for a project of 
 mine. I have texts which also have a date and a list of numbers 
 associated with each of them. These numbers are ID values which connect 
 the article to certain categories. So a particular article X might 
 belong to categories 17, 49 and 112. A search for all articles 
 containing foo bar and belonging to categories 100 to 140 should 
 return X (because it also contains foo bar). Is it possible to do this 
 with Lucene and if it is, how? I've read about the concept of fields in 
 Lucene, but it seems to me that you can only store text in them, not 
 integers, let alone list of integers. None of the tutorials I've seen 
 deals with more complex queries like that. Basically what I want to 
 accomplish could be done nicely with databases with full text search 
 capability, if that full text search wasn't so awful.
 
Where's the problem?
100 is a text as well as an integer (one has to keep in mind, that treating
it as text changes sort order, which may require leading 0 to compensate).
Lucene does not understand the words you index anyway.

So if a document has a field `category' with content '017 049 112' and 
some `text' field with content 'bla fasel foo bar' and you do a range 
query 100 - 140 on category (search all documents containing any word, 
that is alphanumerically sorted between 100 and 140) and a apropriate 
query on text it will find, what you want.

There are some caveats like choosing an apropriate analyzer or considering
the maximum number of terms the range query covers, but in principle there
is no difference between a text field containing words and a category 
field containing categories.

Morus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: range and content query

2004-09-20 Thread Morus Walter

Chris Fraschetti writes:
 can someone assist me in building or deny the possibility of combing a
 range query and a standard query?
 
 say for instance i have two fields i'm searching on... one being the a
 field with an epoch date associated with the entry, and the
 content  so how can I make a query to select a range of thos
 epochs, as well as search through the content? can it be done in one
 query, or do I have to perform a query upon a query, and if so, what
 might the syntax look like?
 
if you create the query using the API use a boolean query to combine
the two basic queries.

If you use query parser use AND or OR.

Note that range queries are expanded into boolean queries (OR combined)
which may be a problem if the number of terms matching the range is too
big. Depends on your date entries and especially how precise they are.
Alternatively you might consider using a filter.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: range and content query

2004-09-20 Thread Morus Walter

Chris Fraschetti writes:
 I've more or less figured out the query string required to get a range
 of docs.. say date[0 TO 10]assuming my dates are from 1 to 10 (for
 the sake of this example) ... my query has results that I don't
 understand. if i do from 0 TO 10, then I only get results matching
 0,1,10  ... if i do 0 TO 8, i get all results ... from 0 to 10...   if
 i do   1 TO 5  ... then i get results 1,2,3,4,5,10  ... very strange.
 
that's not strange. Lucene indexes strings and compares strings. Not numbers.
So the order is
1
10
101
11
2
20
21
3
4
and so on

I't up to you to make your number look a way that it will work, e.g.
use leading '0' to get
001
002
003
004
010
011
020
021
...

I think there's a page in the wiki about these issues.

 here is how my query looks...
 query: +date_field:[1 TO 5]
 
 here is how the date was added...
 Document doc = new Document();
 doc.add(Field.UnIndexed(arcpath_field, filename));
 doc.add(Field.Keyword(date_field, date));
 doc.add(Field.Text(content_field, content));
 writer.addDocument(doc);
 
 I tried Field.Text for the date and also received the same results.
 Essentially I have a loop to add 11 strings... indexes 0 to 10... and
 add doc0, 0, some text  for each..  and the results i get as as
 explained above... any ideas?
 
 Here is my simple searching code.. i'm currently not searching for any
 text... i just want to test the range feature right now
 
 query_string =  +(+DATE_FIELD+:[+start_date+ TO +end_date+]);
 Searcher searcher = new IndexSearcher(index_path);
 QueryParser parser = new QueryParser(CONTENT_FIELD, new StandardAnalyzer());
 parser.setOperator(QueryParser.DEFAULT_OPERATOR_OR);
 Query query = parser.parse(query_string);
 System.out.println(query: +query.toString());
 Hits hits = searcher.search(query);
 

It's a bad practice to create search strings that have to be decomposed
by query parser again, if have the parts already at hand.
At least in most cases.
I don't know the details how and when query parser will call the analyzer
and what standard analyzer does with numbers.
What does query.toString() output?

But the main problem seems to be your misunderstanding of searching numbers
in lucene. They are just strings and are treated by their lexical 
representation not their numeric value.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-20 Thread Morus Walter

David Spencer writes:
  
  could you put the current version of your code on that website as a java
 
 Weblog entry updated:
 
 http://searchmorph.com/weblog/index.php?id=23
 
thanks
 
 Great suggestion and thanks for that idiom - I should know such things 
 by now. To clarify the issue, it's just a performance one, not other 
 functionality...anyway I put in the code - and to be scientific I 
 benchmarked it two times before the change and two times after - and the 
 results were suprising the same both times (1:45 to 1:50 with an index 
 that takes up  200MB). Probably there are cases where this will run 
 faster, and the code seems more correct now so it's in.
 
Ahh, I see, you check the field later.
The logging made me think, you index all fields you loop over, in which
case one might get unwanted words into the ngram index.
 
 
  
  
  An interesting application of this might be an ngram-Index enhanced version
  of the FuzzyQuery. While this introduces more complexity on the indexing
  side, it might be a large speedup for fuzzy searches.
 
 I also thinking of reviewing the list to see if anyone had done a Jaro 
 Winkler fuzzy query yet and doing that
 
I went into another direction, and changed the ngram index and search
to use a simliarity that computes 

   m * m / ( n1 * n2)

where m is the number of matches and n1 is the number of ngrams in the
query and n2 is the number of ngrams in the word.
(At least if I got that right; I'm not sure if I understand all parts
of the similarity class correctly)

After removing the document boost in the ngram index based on the 
word frequency in the original index I find the results pretty good.
My data is a number of encyclopedias and dictionaries and I only use the
headwords for the ngram index. Term frequency doesn't seem relevent
in this case.

I still use the levenshtein distance to modify the score and sort according
to  score / distance  but in most cases this does not make a difference.
So I'll probably drop the distance calculation completely.

I also see few difference between using 2- and 3-grams on the one hand
and only using 2-grams on the other. So I'll presumably drop the 3-grams.

I'm not sure, if the similarity I use, is useful in general, but I 
attached it to this message in case someone is interested.
Note that you need to set the similarity for the index writer and searcher
and thus have to reindex in case you want to give it a try.

Morus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: QueryParser.parse() and Lucene1.4.1

2004-09-17 Thread Morus Walter

Polina Litvak writes:
 Hi Daniel,
 
 I just downloaded the latest version of Lucene and tried the whole thing
 again: I ran my code first with lucene-1.3-final.jar, getting the query
 Field:(A AND -(B)) parsed into +Field:A -Field:B, and then I ran exactly
 the same code with lucene-1.4.1.jar and got the output parsed into
 Field:A Field:- Field:B.
 
 I also read Lucene's documentation (http://cvs.apache.org/viewcvs.cgi/*
 checkout*/jakarta-lucene/CHANGES.txt?rev=1.85), and it does mention a
 change to the + and - operators:
 
 13. Changed QueryParser.jj to allow '-' and '+' within tokens:
 http://issues.apache.org/bugzilla/show_bug.cgi?id=27491
 (Morus Walter via Otis)
 
This change is unlikely to introduce the behaviour you describe, since
it affects '-' within words only, not at start.
So there is a change for a-b between 1.3 and 1.4
1.3 gives a -b
1.4 gives a b or one token a-b (depending on the analyzer) as it treats
the - as part of a word.

 
 So is this behaviour a bug, or Lucene1.4 is not backwards compatible?
 
Your behaviour cannot be seen from the test code (as Daniel already said):

java -cp lucene-1.3-final/lucene-1.3-final.jar  
org.apache.lucene.queryParser.QueryParser 'Field:(A AND -(B))'
+Field:a -Field:b

java -cp lucene-1.4-final/lucene-1.4-final.jar 
org.apache.lucene.queryParser.QueryParser 'Field:(A AND -(B))'
+Field:a -Field:b

java -cp lucene-1.4.1/lucene-1.4.1.jar org.apache.lucene.queryParser.QueryParser 
'Field:(A AND -(B))'
+Field:a -Field:b

So either you have a different query or something in your code is responsable
for the problem.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-16 Thread Morus Walter

Hi David,
 
 Based on this mail I wrote a ngram speller for Lucene. It runs in 2 
 phases. First you build a fast lookup index as mentioned above. Then 
 to correct a word you do a query in this index based on the ngrams in 
 the misspelled word.
 
 Let's see.
 
 [1] Source is attached and I'd like to contribute it to the sandbox, esp 
 if someone can validate that what it's doing is reasonable and useful.
 
great :-)
 
 [4] Here's source in HTML:
 
 http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
 
could you put the current version of your code on that website as a java
source also? At least until it's in the lucene sandbox.


I created an ngram index on one of my indexes and think I found an issue
in the indexing code:

There is an option -f to specify the field on which the ngram index will
be created. 
However there is no code to restrict the term enumeration on this field.

So instead of 
final TermEnum te = r.terms();
i'd suggest
final TermEnum te = r.terms(new Term(field, ));
and a check within the loop over the terms if the enumerated term
still has fieldname field, e.g.
Term t = te.term();
if ( !t.field().equals(field) ) {
break;
}

otherwise you loop over all terms in all fields.


An interesting application of this might be an ngram-Index enhanced version
of the FuzzyQuery. While this introduces more complexity on the indexing
side, it might be a large speedup for fuzzy searches.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: (n00b) Meaning of Hits.id (int)

2004-09-09 Thread Morus Walter

Peter Pimley writes:
 
 My documents are not stored in their original form by lucene, but in a 
 seperate database.  My lucene docs do however store the primary key, so 
 that I can fetch the original version from the database to show the user 
 (does that sound sane?)
 
yes.

 I see that the 'Hits' class has an id (int) method, which sounds 
 interesting.  The javadoc says Returns the id for the nth document in 
 this set..  However, I can't find any mention anywhere else about 
 Document ids.  Could anybody explain what this is?
 
It's lucenes internal id or document number which allows you to access
the document and its stored fields.

See 
IndexSearcher.doc(int i)
or
IndexReader.document(int n)

The docs just don't name the parameter 'id'.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: *term search

2004-09-08 Thread Morus Walter

sergiu gordea writes:
 
 
  Hi all,
 
 I want to discuss a little problem, lucene doesn't support *Term like 
 queries.
 I know that this can bring a lot of results in the memory and therefore 
 it is restricted.
 
That's not the reason for the restriction. That's possible with a* also.
The problem is, that lucene has to check all terms to see if they end
with Term. That makes the performance pretty poor.
A prefix allows to restrict the search on words with this prefix 
efficiantly, since the wordlist is orderd.
 
  So my question is if there is a simple solution for implementing the 
 funtionality mentioned above.
Sure.
Just follow the way, wildcard query is implemented.

Actually I'm not sure if the restriction you mention is in the wildcard
query itself or only in the query parser. In the latter case, you might
just create the query yourself.

A better way for postfix queries is to create an additional search field
where all words are reversed and search for mreT* on that field.

Depends on the size of your index, how important such an optimization is.

Morus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Negative Boost

2004-08-05 Thread Morus Walter

Daniel Naber writes:
 On Wednesday 04 August 2004 13:19, Terry Steichen wrote:
 
  I can't get negative boosts to work with QueryParser.  Is it possible to do
  so?
 
 Isn't that the same as using a boost  1, e.g. 0.1? That should be possible.
 
no.
a^-1 OR b
A boost of -1 means that the score gets smaller if a document contains a 
with that boost appears. So it's somehow similar to NOT a, though less strict.
A boost of 0.1 means that the score is increased less for an occurance of a.

Usually one just want's the latter, but it's not the same.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Negative Boost

2004-08-04 Thread Morus Walter

Terry Steichen writes:
 I can't get negative boosts to work with QueryParser.  Is it possible to do so?
 
If you change QueryParser ;-)

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Misbehaving query string

2004-07-20 Thread Morus Walter

Bill Tschumy writes:
 I would think the following strings passed to the QueryParser should 
 yield the same results:
 
 #1:   +telescope AND !operate
 
 #2:   (+telescope) AND (!operate)
 
 However the first string seems to give the correct results while the 
 second gives zero hits.  Am I misunderstanding something or is there 
 a bug?

The first query creates a boolean query with a required and a prohibited term.
The second one, creates one boolean query for the !operate term, containing
only one prohibited term and combines this with a query for telescope where
both subqueries are required (don't ask me, if telescope makes a term query
or a boolean query, I suspect the former).
But lucene doesn't search boolean queries only containing prohibited terms.
So the !operate boolean query gives you an empty result, which leads to the
empty result of the whole query.

I don't know if there's a reason, why the boolean query doesn't throw an
exception in this case. Silently not working doesn't seem a good way of 
handling this.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: ArrayIndexOutOfBoundsException if stopword on left of bool clause w/ StandardAnalyzer

2004-07-15 Thread Morus Walter

Claude Devarenne writes:
 
 My question is: should the queryParser catch that there is no term  
 before trying to add a clause when using a StandardAnalyzer?  Is this  
 even possible? Should the burden be on the application to either catch  
 the exception or parse the query before handing it out to the  
 queryParser?
 
Yes. Yes. No.
There are fixes in bugzilla that would make query parser read that query
as title:bla and simply drop the stop word.

see http://issues.apache.org/bugzilla/show_bug.cgi?id=9110
http://issues.apache.org/bugzilla/show_bug.cgi?id=25820

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Tool for analyzing analyzers

2004-05-28 Thread Morus Walter

Hi Mark,

 I've had this running OK from the command line and in Eclipse on XP.
 I suspect it might be because you're running a different OS? The Classfinder tries 
 to split the system property
 java.class.path  on the ; character but I forgot different OSes have different 
 seperators.
 
 Let me know your setup details and I'll try fix the classloader issue.
 
I have the same problems and am running on linux using ':' to separate
the class path...

BTW: I tried to compile your sources but you left out the part in thinlet.
  2928 Sun Oct 12 19:47:56 CEST 2003 thinlet/AppletLauncher.class
  2643 Sun Oct 12 19:47:56 CEST 2003 thinlet/FrameLauncher.class
 74823 Sun Oct 12 19:47:56 CEST 2003 thinlet/Thinlet.class
Was that intentional?

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

2004-05-19 Thread Morus Walter

Claude Devarenne writes:
 Hi,
 
 I have over 60,000 documents in my index which is slightly over a 1 GB 
 in size.  The documents range from the late seventies up to now.  I 
 have indexed dates as a keyword field using a string because the dates 
 are in MMDD format.  When I do range queries things are OK as long 
 as I don't exceed the built-in number of boolean clauses, so that's a 
 range of 3 years, e.g. 1979 to 1981.  The users are not only doing 
 complex queries but also want to query over long ranges, e.g. [19790101 
 TO 19991231].
 
 Given these requirements, I am thinking of doing a query without the 
 date range, bring the unique ids back from the hits and then do a date 
 query in the SQL database I have that contains the same data.  Another 
 alternative is to do the query without the date range in Lucene and 
 then sort the results within the range.  I still have to learn how to 
 use the new sorting code and confessed I did not have time to look at 
 it yet.
 
 Is there a simpler, easier way to do this?
 
I think it would be worth to take a look at the sorting code.

The idea of the sorting code is to have an array of the dates for each doc
in memory and access this array for sorting.
Now sorting isn't the only thing one might use this array for.
Doing a range check is another.
So you might extend the sorting code by a range selection.

There is no code for this in lucene and you have to create your own searcher
but it gives you a fast way to search and sort by date.

I did this independently from the new sorting code (I just started a little
to early) and it works quite well.
The only drawback from this (and the new sorting code) is, that it requires
an array of field values that must be rebuilt each time the index changes.
Shouldn't be a problem for 6 documents.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Internal full content store within Lucene

2004-05-19 Thread Morus Walter

Kevin Burton writes:
 
 How much interest is there for this?  I have to do this for work and 
 will certainly take the extra effort into making this a standard Lucene 
 feature. 
 
Sounds interesting.
How would you handle deletions?

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: multivalue fields

2004-05-17 Thread Morus Walter

Alex McManus writes:
 
  Maybe your fields are too long so that only part of it gets indexed (look
 at IndexWriter.maxFieldLength).
 
 This is interesting, I've had a look at the JavaDoc and I think I
 understand. The maximum field length describes the maximum number of unique
 terms, not the maximum number of words/tokens. Therefore, even if I have a
 4Gb field, I could quite safely have a maxFieldLength of, say, 100k words
 which should safely handle the maximum number of unique words, rather than
 800 million which would be needed to handle every token.
 
 Is this correct? 

A short look at the source says no.

maxFieldLength is handed to DocumentWriter where one finds

  TokenStream stream = analyzer.tokenStream(fieldName, reader);
  try {
for (Token t = stream.next(); t != null; t = stream.next()) {
  position += (t.getPositionIncrement() - 1);
  addPosition(fieldName, t.termText(), position++);
  if (++length  maxFieldLength) break;
}
  } finally {
stream.close();
  }

so it's the number of terms not the number of different tokens.

 
 Is 100k a worrying maxFieldLength, in terms of how much memory this would
 consume?
 
Depends on the size of your documents ;-)
I use 25 without problems, but my documents are not as big (4
tokens). I just want to make sure, not to loose any text for indexing.

 Does Lucene issue a warning if this limit is exceeded during indexing (it
 would be quite worrying if it was silently discarding terms)?
 
no.
I guess the idea behind this limit is, that the relevant terms should occur
in the first n words and indexing the rest just increases index size.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: multivalue fields

2004-05-15 Thread Morus Walter

Ryan Sonnek writes:
 using lucene 1.3-final, it appears to only search the first field with that name.  
 here's the code i'm using to construct the index, and I'm using Luke to check that 
 the index is created correctly.  Everything looks fine, but my search returns empty. 
  do i have to use a special query to work with multivalue fields?  is there a 
 testcase in the source that performs this kind of work that I could look at?

Don't know what goes wrong on your side, but this works just fine.
Maybe your fields are too long so that only part of it gets indexed
(look at IndexWriter.maxFieldLength).

A test program
  
import org.apache.lucene.document.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.QueryParser;

class LuceneTest 
{
static String[] docs = {
a c, 
b d,
 
c e, 
d f, 

};

static String[] queries = {
a,
b,
c,
d,
b OR c
};

public static void main(String argv[]) throws Exception {
Directory dir = new RAMDirectory();
String[] stop = {};
Analyzer analyzer = new StandardAnalyzer(stop);
IndexWriter writer = new IndexWriter(dir, analyzer, true);

// index documents (2 fields text each)
for ( int i=0; i  docs.length; i+=2 ) {
Document doc = new Document();
doc.add(Field.Text(text, docs[i]));
doc.add(Field.Text(text, docs[i+1]));
writer.addDocument(doc);
}
writer.close();

Searcher searcher = new IndexSearcher(dir);
for ( int i=0; i  queries.length; i++ ) {
Query query = QueryParser.parse(queries[i], text, analyzer);
Hits hits = searcher.search(query);
System.out.println(Query:  + query.toString(text));
System.out.println(   + hits.length() +  documents found);
for ( int j=0; j  hits.length(); j++ ) {
Document doc = hits.doc(j);
System.out.println(\t+hits.id(j) + :  + doc.get(text) + \t 
+ hits.score(j));
//System.out.println(  + searcher.explain(query, hits.id(j)));
}
}
}
}

shows that search takes place in both fields.
Query: a
  1 documents found
0: b d  0.5
Query: b
  1 documents found
0: b d  0.5
Query: c
  2 documents found
0: b d  0.2972674
1: d f  0.2972674
Query: d
  2 documents found
0: b d  0.2972674
1: d f  0.2972674
Query: b c
  2 documents found
0: b d  0.581694
1: d f  0.0759574


But note, that this affects scoring as concatenation would.
So I think Otis answer is a bit missleading.
If you don't want the effects on scoring you AFAIK need to use different 
documents or fields.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: query

2004-04-22 Thread Morus Walter

Rosen Marinov writes:
  Short answer: it depends.
  
  Questions for you to answer:
  What field type and analyzer did you use during indexing?  What 
  analyzer used with QueryParser?  What does the generated Query.toString 
  return?
 
 in both cases SimpleAnalyzer
 QueryParser.parse(\abc\) throws an exception and i can't see what does
 Query.toString return in this case
 
 what analizer should i use if i want to execute folowing queries:
simple key word seach (+bush -president , etc)
range queries including  characters in searching values
 
The problem is, that Phrases are defined as
| QUOTED: \ (~[\])+ \
in the query parser.
So you cannot have a  inside (even escaped).
I guess that's a bug.
It should read something like
| QUOTED: \ (~[\] | \\\)+ \
(untested).

But that shouldn't apply to QueryParser.parse(\abc\) (parsing abc).
Only to QueryParser.parse(abc) (parsing \abc\).

If you used SimpleAnalyzer (same for StandardAnalyzer) quotes got stripped
anyway. 
Since you cannot search for things, that didn't got indexed, searching
for `foo bar bla' and `foo bar bla' will be the same.

The answer to your second question 
 Is there more sly way to get the doc with exact maching this title? (for info: my 
 titles are unique)
is to skip query parser and create the query as a phrase query yourself.
But this requires tokenization in the same way as it was done when indexing
as well. Otherwise you might end with no results.
If you have a lot of exact title queries, it might be worth to consider
having a keyword field (that means no tokenization) for this data (in that
case, you won't have to care about tokenizers and might create the query
as a single TermQuery). There's no support for keyword queries in the 
query parser though.

HTH
Morus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to order search results by Field value?

2004-03-25 Thread Morus Walter

Erik Hatcher writes:
 Why not do the unique sequential number replacement at index time 
 rather than query time?
 
how would you do that? This requires to know the ids that will be added
in future.
Let's say you start with strings 'a' and 'b'. Later you add a document
with 'aa'. How do you know that you should make 'a' 1 and 'b' 3 to be
prepared for 'aa'?

To me Erics suggestion makes sense.
The problem might be however: you have to sort all values, while keeping
the strings means that you sort only the hits.
And you should be aware that you have to rebuild the array each time the
index changes.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Query syntax on Keyword field question

2004-03-24 Thread Morus Walter

Hi Chad,

 But I assume this fix won't come out for some time.  Is there a way I can get this 
 fix sooner?  
 I'm up against a deadline and would very much like this functionality. 

Just get lucenes sources, change the line and recompile.
The difficult part is to get a copy of JavaCC 2 (3 won't do), but I think
this can be found in the archives.

  
 And to go one more step with the KeywordAnalyzer that I wrote, changing this method 
 to skip the escape:
 protected boolean isTokenChar(char c)
 {
  if (c == '\\')
  {
 return false;
  }
  else
  {
 return true;
  }
   }
 The test then returns with a space:
  healthecare.domain.lucenesearch.KeywordAnalyzer:
   [HW-NCI_TOPICS] 
 query.ToString = +category:HW -NCI_TOPICS +space
 junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is 
 Expected:+category:HW\-NCI_TOPICS +space
 Actual  :+category:HW -NCI_TOPICS +space   note space where escape was.

Sure. If \ isn't a token char, it end's the token.
So you will have to look for a different way of implementing the
analyzer. Shouldn't be that difficult since you have only one token.

Maybe it should be the job of the query parser to remove the escape character
(would make more sense to me at least) but that would be another change
of the query parser...

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Query syntax on Keyword field question

2004-03-24 Thread Morus Walter

Chad Small writes:
 I'm getting this with 3.2:
  
 javacc-check:
 BUILD FAILED
 file:D:/applications/lucene-1.3-final/build.xml:97:
   ##
   JavaCC not found.
   JavaCC Home: /applications/javacc-3.2/bin
   JavaCC JAR: D:\applications\javacc-3.2\bin\bin\lib\javacc.jar
   Please download and install JavaCC from:
   http://javacc.dev.java.net
   Then, create a build.properties file either in your home
   directory, or within the Lucene directory and set the javacc.home
   property to the path where JavaCC is installed. For example,
   if you installed JavaCC in /usr/local/java/javacc-3.2, then set the
   javacc.home property to:
   javacc.home=/usr/local/java/javacc-3.2
   If you get an error like the one below, then you have not installed
   things correctly. Please check all your paths and try again.
   java.lang.NoClassDefFoundError: org.javacc.parser.Main
   ##
  
 even though I put a build.properties file in my root lucene directory with this in 
 it:
 javacc.home=/applications/javacc-3.2/bin
  
I never tried javacc 3.2 but I thought there were issues with query parser
and/or standard analyzer.
Seems I'm wrong or outdated.

In your case the problem seems to be installation of javacc.

I guess the /bin directory should not be part of javacc.home.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Query syntax on Keyword field question

2004-03-23 Thread Morus Walter

Chad Small writes:
 Here is my attempt at a KeywordAnalyzer - although is not working?  Excuse the 
 length of the message, but wanted to give actual code.
  
 With this output:
  
 Analzying HW-NCI_TOPICS
  org.apache.lucene.analysis.WhitespaceAnalyzer:
   [HW-NCI_TOPICS] 
  org.apache.lucene.analysis.SimpleAnalyzer:
   [hw] [nci] [topics] 
  org.apache.lucene.analysis.StopAnalyzer:
   [hw] [nci] [topics] 
  org.apache.lucene.analysis.standard.StandardAnalyzer:
   [hw] [nci] [topics] 
  healthecare.domain.lucenesearch.KeywordAnalyzer:
   [HW-NCI_TOPICS] 
  
 query.ToString = category:HW -nci topics +space
 
 junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is 
 Expected:+category:HW-NCI_TOPICS +space
 Actual  :category:HW -nci topics +space
  

Well query parser does not allow `-' within words currently.
So before your analyzer is called, query parser reads one word HW, a `-'
operator, one word NCI_TOPICS.
The latter is analyzed as nci topics because it's not in field category
anymore, I guess.

I suggested to change this. See
http://issues.apache.org/bugzilla/show_bug.cgi?id=27491

Either you escape the - using category:HW\-NCI_TOPICS in your query
(untested. and I don't know where the escape character will be removed)
or you apply my suggested change.

Another option for using keywords with query parser might be adding a
keyword syntax to the query parser.
Something like category:key(HW-NCI_TOPICS) or category=HW-NCI_TOPICS.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem with search results

2004-03-06 Thread Morus Walter

Doug Cutting writes:
 Morus Walter wrote:
  Now I think this can be fixed in the query parser alone by simply allowing
  '-' within words.
  That is change
  #_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR ) 
  to
  #_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR | - ) 
  
  As a result, query parser will read '-' within words (such as tft-monitor
  or Sysh1-1) as one word, which will be tokenized by the used analyzer
  and end up in a term query or phrase query depending if it create one ore
  more tokens.
 
 Other characters which are also candidates for this sort of treatment 
 include /, @, ., ', and +.
 
_TERM_START_CHAR is
| #_TERM_START_CHAR: ( ~[  , \t, \n, \r, +, -, !, (, ), 
 :, ^, [, ], \, {, }, ~, *, ? ]

so / @ . ' are already allowed in terms.
(:, ^, ~, * and ? cannot be added, parenthesis don't make sense.)
So I end up with
#_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR | - | + ) 

The regression tests show no error, so I entered that in bugzilla.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Storing numbers

2004-03-05 Thread Morus Walter

[EMAIL PROTECTED] writes:
 Hi!
 
 I want to store numbers (id) in my index:
 
   long id = 1069421083284;
   doc.add(Field.UnStored(in, String.valueOf(id)));  
 
 But searching for id:1069421083284 doesn't return any hits.

If your field is named 'in' you shouldn't search in 'id'. Right?

 
 Well, did I misunderstand something? UnStored is the number is stored but not 
 index (analyzed), isn't it? Anyway, Field.Text doesn't work either.
 
Well, indexing and analyzing are different things.
UnStored means, the number is not stored (as the name says) but indexed.
And IIRC it's analyzed before indexing. Shouldn't make a difference for
a single number.

What I'd use in this case is an unstored keyword (given that you really don't
want to have the id returned from lucene, which is the consequence of
not storing).
I'm not sure if there's a method to create such a field, but you can do it
by setting the flags directly.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Best Practices for indexing in Web application

2004-03-03 Thread Morus Walter

Michael Steiger writes:
  
  Depends on your application, but if you can, it's better to keep IndexSearcher
  open until the index changes.
  Otherwise you will have to open all the index files for each search.
 
 Good tip. So I have to synchronize (logically) my search routine with 
 any updates and if the index changes I have to close the Searcher and 
 reopen it.
 
Right. The hard part is, that you shouldn't close the searcher when there
still is access the that searcher.
E.g. if you have a szenario
- do search
- index changes
- access search results
you cannot close the searcher until you accessed all search results.
But that can be done by a little bit of reference counting.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Best Practices for indexing in Web application

2004-03-02 Thread Morus Walter

Michael Steiger writes:
 
 I am using an IndexSearcher for querying the index but for deletions I 
 need to use the IndexReader. I now know that I can have Readers and a 
 Writer open concurrently but IndexReader.delete can only be used if no 
 Writer is open.
 
You should be aware that an IndexSearcher uses a readonly IndexReader.
So you can't ignore it for your considerations.
 
 I want to open the IndexSearcher only while searching and close it 
 afterwards.
 
Depends on your application, but if you can, it's better to keep IndexSearcher
open until the index changes.
Otherwise you will have to open all the index files for each search.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: java.io.tmpdir as lock dir .... once again

2004-03-02 Thread Morus Walter

Otis Gospodnetic writes:
 This looks nice.
 However, what happens if you have two Java processes that work on the
 same index, and give it different lock directories?
 They'll mess up the index.
 
Is that different to having two java processes using different java.io.tempdir?
I had that problem (one running in tomcat and one from the command line).
I don't think that making the need to choose the same directory for the
lock more explicit will increase the problems.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem with search results

2004-03-02 Thread Morus Walter

Otis Gospodnetic writes:
 And if you do not use QueryParser, then things work?
 If so, then this is likely caused by the fact that your Term contains a
 'special' character, '-'.
 
Actually I was going to suggest a fix for '-' within words in the
query parser.

The was a suggested fix, that changed both StandardAnalyzer and QueryParser,
which was rejected, I guess because of the StandardAnalyzer change.

Now I think this can be fixed in the query parser alone by simply allowing
'-' within words.
That is change
#_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR ) 
to
#_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR | - ) 

As a result, query parser will read '-' within words (such as tft-monitor
or Sysh1-1) as one word, which will be tokenized by the used analyzer
and end up in a term query or phrase query depending if it create one ore
more tokens.
So with StandardAnalyzer a query
tft-monitor would get a phrase query tft monitor and Sysh1-1 a term query
for Sysh1-1. 
Searching tft-monitor as a phrase tft monitor is not exact but the best
aproximation possible once you indexed tft-monitor as tokens tft and monitor.

The effect of '-' not occuring within a word is not changed, so
tft -monitor will still search for 'tft AND NOT monitor'.

Is that a change that would be acceptable?
I didn't find the time to look at the regression tests though.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Re:can't delete from an index using IndexReader.delete()

2004-02-20 Thread Morus Walter

Dhruba Borthakur writes:
 Hi folks,
 
 I am using the latest and greatest Lucene jar file and am facing a problem 
 with
 deleting documents from the index. Browsing the mail archive, I found that 
 the
 following email (June 2003) listed the exact problem that I am encountering.
 
 In short: I am using Field.text(id, value) to mark a document. Then I 
 use
 reader.delete(new Term(id, value)) to remove the document: this
 call returns 0 and fails to delete the document. The attached sample program
 shows this behaviour.
 
You don't tell us how your ids look like, but Field.text(id, value)
tokenizes value, that is splits value into whatever the analyzer considers
to be a token, and creates a term for each token. 
Whereas new Term(id, value) creates one term containing value.

So I guess your ids are considered several token by the analyzer you use
and therefore they won't be matched by the term you construct for the delete.

Using keyword fields instead of text fields for the id should help.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

open files under linux

2004-02-20 Thread Morus Walter

Rasik Pandey writes:
 
 As a side note, regarding the Too many open files issue, has anyone noticed that 
 this could be related to the JVM? For instance, I have a coworker who tried to run a 
 number of optimized indexes in a JVM instance and a received the Too many open 
 files error. With the same number of available file descriptors (on linux ulimit = 
 ulimited), he split the number of indicies over too JVM instances his problem 
 disappeared.  He also tested the problem by increasing the available memory to the 
 JVM instance, via the -Xmx parameter, with all indicies running in one JVM instance 
 and again the problem disappeared. I think the issue deserves more testing to 
 pin-point the exact problem, but I was just wondering if anyone has already 
 experienced anything similar or if this information could be of use to anyone, in 
 which case we should probably start a new thread dedicated to this issue.
 
The limit is per process. Two JVM make two processes.
(There's a per system limit too, but it's much higher; I think you find
it in /proc/sys/fs/file-max and it's default value depends on the amount
of memory the system has)

AFAIK there's no way of setting openfiles to unlimited. At least neither
bash nor tcsh accepts that.
But it should not be a problem to set it to very high values.
And you should be able to increase the system wide limit by writing to
/proc/sys/fs/file-max as long as you have enough memory.

I never used this, though.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Limiting hit count

2004-02-13 Thread Morus Walter

[EMAIL PROTECTED] writes:
 On Friday 13 February 2004 12:18, Julien Nioche wrote:
  If you want to limit the set of Documents you're querying, you should
  consider using Filter objects and send it to the searcher along with your
  Query.
 
 Hm, hard to find information about Filters...I actually only want the first 
 hit:
   
 public BitSet bits(IndexReader reader) throws IOException
 {
   BitSet bs = new BitSet(1); 
   bs.set(1);
   return bs;
 }
 
 ...doesn't work (i.e. returns nothing rather than all hits).
 
Well that means that you only want document with document id 1 given that
it matches the query.

A filter provides means to restrict *query* to certain documents, not
results. And it won't have influcene on the performance (except for the
time it takes to create the filter and that it slows down things a little 
bit).

As far as results are concerned Lucenes hits object will only hold a 
limited number of result (IIRC 200) and repeat the query if you
access more (look at the search implementation for details) as Julien
already stated.

What's the reason for your question?
Usually lucene executes queries very fast. I typically have a few ms.
So there's few reason to speed this up.
Accessing results is much slower, especially if there are a lot of results
and you access them all.
E.g. query: 1 ms, reading three fields for 50 results: 22 ms.
The index is smaller than the machines memory (~ 3/4 GB Index size, 1 GB RAM).

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: a search like Google

2004-02-12 Thread Morus Walter

Nicolas Maisonneuve writes:
 hy, 
 i have a index with the fields :
 title 
 author
 content 
 
 i would make the same search type than Google  ( a form with a textfiel). When the 
 user search i love lucene (it's not a phrase query  but just the text in the 
 textfield ), i would like search  in all the index fields but with a specific weight 
 boost for each field. In this example title weight=2, author=1 content=1
 
 the results would be (i suppose  the default operator is and) :  (title:i^2 
 author:i content:i) +(title:love^2 author:love content:love) +(title:lucene^2 
 author:lucene content:lucene)
 
 but must i modify the QueryParser  or is there a different way for do this ?
 ( because i modified the QueryParser and it's work but if there is a cleaner way to 
 do this , i take it ! )
 
If you want to use query parser you can parse the query with different
default fields, set boost factors on the resulting queries and join them
with a boolean query.
This will give you
(+title:i +title:love +title:lucene)^2 (+author:i +author:love +author:lucene) 
(+content:i +content:love +content:lucene)

I don't if there are subtle differences between your query and this one, but
it should be basically the same.

Apart from the boost factors, that's AFAIK what multi field query parser
does. Maybe it would be usefull to extend multi field query parser to 
handle different boosts factors.

If you just want to allow search terms and none of the other constructs 
query parser handles, I would use David Spencer suggestion though.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Date Range support

2004-02-12 Thread Morus Walter

tom wa writes:
 From: Erik Hatcher
  On Jan 29, 2004, at 5:08 AM, tom wa wrote:
   I'm trying to create an index which can also be searched with date 
   ranges. My first attempt using the Lucene date format ran in to 
   trouble after my index grew and I couldn't search over more than a few 
   days.
the suggestion 
   seemed to be to use strings of the format MMdd. Using that format 
   worked great until I remembered that my search needs
to be able to support different timezones. Adding the hour to my 
   field causes the same problem above and my queries stop working when 
   using a range of about 2 months.
  
  When you say you couldn't search and that it stopped working, do you 
  mean it was just unacceptably slow?
 
 
 (Sorry it's taken me a while to reply)
 
 It wasn't slow, my timeout is far greater than the time it takes to come back with 
 no hits.
 
 A small example of a query would be (date: [200306081900 TO 200306201200]) AND 
 (text: sometext) and this will return zero hits. The index contains about 1000 items 
 for each 24hr period and the total number of documents was about 150k. I had the 
 same results when using Lucene's built in date format too. If you think it should be 
 able to cope with what I am trying to do then I'll take another look.
 
An alternative to using date ranges or date filters is to use an aproach
similar to the recently introduced sort on a integer field (cvs only, so far).

That is, 
- create an array of the dates of all documents
- extend the low level search, in a way that it uses this array and a
  upper and lower limit to do an additional selection (that's similar 
  to what the filter does)

The advantage over a filter is, that you can use the same array for arbitrary
date ranges while a filter is specific to a date range.

OTOH the array needs to be newly created whenever the index changes. The 
cost depends on the number of different dates and the array size of course.
I did some tests and found, that it takes less than .1 seconds on a
P4 2400 Mhz to create such an array for ~ 10 documents, ~ 1 different
dates.
So it depends a bit on how often your index changes, if that's a good way.

Another disadvantage is, that you will have to dig a little bit deeper into 
lucenes search classes.
And memory usage might get a problem, once you exceed a few million documents.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: What is the status of Query Parser AND / OR ?

2004-02-11 Thread Morus Walter

Daniel B. Davis writes:
 There was a lot of correspondence during December about this.
 Is there any further resolution?
 
There's a patch and I hope it will find it's way into the lucene 
sources.

see: http://issues.apache.org/bugzilla/show_bug.cgi?id=25820

Seems I missed the mail about Otis latest comment.
Sorry about that, I'll take a look at these issues ASAP.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Query madness with NOTs...

2004-01-23 Thread Morus Walter

Otis Gospodnetic writes:
 Redirecting to lucene-user
 
 --- Jim Hargrave [EMAIL PROTECTED] wrote:
  Can anyone tell me why these two queries would produce different
  results:
   
  +A -B
   
  A -(-B) 
 
 A and +A are not the same thing when you have multiple terms in a
 query.
 
Hmm. As far as I understood boolean queries so far
a -b and +a -b 
should be the same (while a b -c and +a +b -c are different of course).

a -(-b) on the other side contains a boolean query only searching for -b.
Lucene can not handle this type of query. I'm not sure what happens in this
case. But AFAIK you should never use a boolean query containing only 
prohibited terms in a query. 
If I test this, I don't get any results for a -(-b) and the same result
for 'a' and 'a +(-b)'.
The query parser patch I added yesterday to bugzilla, drops such queries.

  Also, we are having a hard time understanding why the Query parser
  takes this
  query: A AND NOT B and returns this +A +(-B). Shouldn't this be
  +A -B?
 
a AND NOT b IS parsed to +a -b by lucenes standard query parser.
Don't know where you found +a +(-b).

+a +(-b) would be wrong in the above sense.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene search result no stable

2004-01-21 Thread Morus Walter

Ardor Wei writes:
 
 What might be the problem? How to solve it?
 Any suggestion or idea will be appreciated.
 
The only problem with locking I saw so far is that you have
to make sure that the temp dir is the same for all applications.
Lucene 1.3 stores it's lock in the directory that is defined by the
system property java.io.tmpdir.
I had one component running under tomcat and one from the shell
and they used different temp dirs which is fatal in this case.

Apart from this it depends pretty much on your environment.
I'm using lucene on linux on local filesystems. Other operating
systems or network filesystems may influence locking.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Query Term Questions

2004-01-21 Thread Morus Walter

Erik Hatcher writes:
 
  TS==I've not been able to get negative boosting to work at all.  Maybe
  there's a problem with my syntax.
  If, for example, I do a search with green beret^10, it works just 
  fine.
  But green beret^-2 gives me a
  ParseException showing a lexical error.
 
 Have you tried it without using QueryParser and boosting a Query using 
 setBoost on it?  QueryParser is a double-edged sword and it looks like 
 it only allows numeric characters (plus . followed by numeric 
 characters).  So QueryParser has the problem with negative boosts, but 
 not Query itself.

He said he wants to have one term less important than others (at least
that's what I understood).
That's done by positive boost factors smaller than 1.0 (e.g. 0.5 or 0.1) 
and might be called 'negative boosting' (such as breking is a form of 
negative acceleration).

If you use negative boost factors you would even decrease the score of
a match (not only increase it less) and risk of ending with a negative
score. I don't think that would be a good idea.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

QueryParser and stopwords

2004-01-20 Thread Morus Walter

Hi,

I'm currently trying to get rid of query parser problems with stopwords
(depending on the query, there are ArrayIndexOutOfBoundsExceptions,
e.g. for stop AND nonstop where stop is a stopword and nonstop not).

While this isn't hard to fix (I'll enter a bug and patch in bugzilla), 
there's one issue left, I'm not sure how to deal with:

What should the query parser return for a query string containing only
stopwords?

And when I think about this, there's another one:
stop AND NOT nonstop
creates a boolean query, only containing prohibited terms, which
AFAIK cannot be used in a search. How to deal with this?

Currently it returns an empty BooleanQuery.
I think it would be more useful to return null in this case.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing of deep structured XML

2004-01-18 Thread Morus Walter

Goulish, Michael writes:
 
 To really preserve the relationships in arbitrarily 
 structured XML, you pretty much need to use a database 
 that directly supports an XML query language like 
 XQuery or XPath.
 
If searching within regions is enough (something e.g. sgrep 
(http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html) or OpenText/PAT does),
I think this can be done on top of lucene.

Basically you need to index region start and region end markers.
In order to search a term within a region, you can use TermPositions
to loop over all matches of the term and all start and end markers of
the region to check where you find a match within this region.

Of course search logic for region search is quite different to lucenes
document queries.
There are two types of results (match points and regions) and the
basic operations include match points/region in region, region containing
match points/region, joins and intersection of match points or regions.
I don't know if and how this could be integrated with lucenes normal
queries. But of course one could get a list of matching documents from
results of region searches.
If you (ab)use lucenes token position to store the character position
of the token, you could also extract the regions text from a stored copy.

I'm currently doing some experiments with such kind of queries using lucene
and find it performs quite well.

You won't be able to distinguish between parents and other ancestors 
though and there won't be any support for searching siblings.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Ordening documents

2004-01-17 Thread Morus Walter

Peter Keegan writes:
 What is the returned order for documents with identical scores?
have a look at the source of the lessThan method in 
org.java.lucene.search.HitQueue:

protected final boolean lessThan(Object a, Object b) {
ScoreDoc hitA = (ScoreDoc)a;
ScoreDoc hitB = (ScoreDoc)b;
if (hitA.score == hitB.score)
  return hitA.doc  hitB.doc; 
else
  return hitA.score  hitB.score;
}

sorting is done by this method.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Philosophy(??) question

2004-01-13 Thread Morus Walter

Scott Smith writes:
 I have some documents I'm indexing which have multiple languages in them
 (i.e., some fields in the document are always English; other fields may be
 other languages).  Now, I understand why a query against a certain field
 must use the same analyzer as was used when that field was indexed
 (stemming, stop words, etc.).  It seems like different fields could use
 different analyzers and the world would still be a happy place.  However,
 since the analyzer() is passed in as part of the IndexWriter, that can't
 happen.  Is there a way to do this (other than having multiple indexes which
 is a problem trying to do combined searches)?  Or am I missing something
 more subtle?  Sorry if I'm plowing old ground.
 
AFAIK you need to write one analyzer that acts different based on the
the 'fieldName' parameter in the tokenStream method.
I haven't done that though.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

1 2 >

1 - 100 of 126 matches

Mail list logo