Re: sorted search

2005-02-24 Thread Daniel Naber
On Thursday 24 February 2005 19:01, Yura Smolsky wrote:

       sort.setSort( new SortField[] { new SortField (modified,
 SortField.STRING, true) } );

You should store the date as a number, e.g. days since 1970 (or weeks if 
that is precise enough) and then tell the sort that it's an integer. 
DateField always stores the date in milliseconds which leads to a large 
number of terms, it also turns the date into a string, both makes searching 
and especially sorting slower.

Regards
 Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser 1.8 isn't parsing phrases

2005-02-19 Thread Daniel Naber
On Saturday 19 February 2005 15:26, Ben wrote:

 When I try to search for phrases using the MultiFieldQueryParser v1.8
 from CVS, it gives me NullPointerException.

This has just been fixed in SVN (I assume you mean SVN, CVS still exists 
but is read only and probably not updated anymore).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using the highlighter from the sandbox with a prefix query.

2005-02-17 Thread Daniel Naber
On Thursday 17 February 2005 08:37, lucuser4851 wrote:

  We have been using the highlighter from the lucene sandbox, which works
 very nicely most of the time. However when we try and use it with a
 prefix query (which is what you get having parsed a wild-card query), it
 doesn't return any highlighted sections. Has anyone else experienced
 this problem, or found a way around it?

You need to call rewrite() on the query before you pass it to the highlighter.

Regards
 Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene cuts the search results ?

2005-02-15 Thread Daniel Naber
On Tuesday 15 February 2005 09:39, Pierre VANNIER wrote:

          String fragment = highlighter.getBestFragment(stream,
 introduction);

The highlighter breaks up text into same-size chunks (100 characters by 
default). If the matching term now appears just at the end or at the start of 
such a chunk you'll get no context and it looks as if text was cut off.

Regards
 Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and multiple languages

2005-01-20 Thread Daniel Naber
On Thursday 20 January 2005 21:08, aurora wrote:

 Now let's said I have an index with documents in multiple languages and
  analyzed by an assortment of analyzers. When user enter a query, what
 analyzer should be used?

Use q1 OR q2, where q1 is the query parsed with the analyzer for language 
1, q2 is the query parsed with the analyzer for language 2 (and so on). If 
there are conflicts you could also add a required term query to each 
subquery, like language:en^0 so that, for example, the English analyzer 
query only searches on documents that have been identified as English.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene2.0 and transaction support

2005-01-20 Thread Daniel Naber
On Thursday 20 January 2005 22:39, John Wang wrote:

  When is lucene 2.0 scheduled to be released? Is there a javadoc
 somewhere so we can check out the new APIs?

There's no release date. You'll need to check out the CVS and run ant 
javadocs to see the API.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Demo webapp + pdf

2005-01-19 Thread Daniel Naber
On Wednesday 19 January 2005 15:54, Vlachogiannis Evangelos wrote:

 I can
 search for html and txt files. I would like to ask how can I make that
 looking also for pdf ?

This is answered in the FAQ:
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-f3f5e8305b63cf17373953a50d6460e731bf2cfb

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WordNet code updated, now with query expansion -- Re: SYNONYM + GOOGLE

2005-01-12 Thread Daniel Naber
On Wednesday 12 January 2005 01:47, David Spencer wrote:

 Amusingly then, documents with the terms liberal wienerwurst match
 big dog! :)

There's something like frequency information in WordNet, it could probably 
be used to ignore the uncommon meanings.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SQL Distinct sintax in Lucen

2005-01-11 Thread Daniel Naber
On Tuesday 11 January 2005 23:05, Carlos Franco Robles wrote:

 I'm starting to use lucene and I wonder if it is possible to make a
 query syntax to ask for one string which can be in two different fields
 and filter duplicated results like with distinct in SQL syntax.

Lucene only knows documents and doesn't know what duplicate could mean. 
The easiest thing is to iterate over the result set and do the filtering 
yourself.

regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 1.4.3 breaks 1.4.1 QueryParser functionality

2005-01-04 Thread Daniel Naber
On Tuesday 04 January 2005 23:53, Bill Janssen wrote:

   protected Query getFieldQuery (String field,
  Analyzer a,
  String queryText)
 throws ParseException

You're right, the problem is that we should call the deprecated method for 
example in getFieldQuery(String field, String queryText, int slop). 
However, there's a simple workaround: just remove the analyzer parameter 
from your method.

Regards
 Daniel

-- 
http://www.danielnaber.de


Re: addIndexes() Question

2004-12-23 Thread Daniel Naber
On Thursday 23 December 2004 00:45, Ryan Aslett wrote:

 When all machines and all threads are finished, I should have a slew of
 index slices that I want to combine together to create one index.

You should simply skip this step and instead search the small indices with 
a ParallelMultiSearcher. This should scale much better than one huge index 
(note that ranking is currently messed up with (Parllel)MultiSearcher, see 
the bug reports for a proposed fix).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exception: cannot determine sort type

2004-12-23 Thread Daniel Naber
On Thursday 23 December 2004 05:25, Kauler, Leto S wrote:

 java.lang.RuntimeException: no terms in field Title_Sort - cannot
 determine sort type

Is it a certain query that causes this? Does it really only happen under 
load or does the same query also give this without load?

 We could specify the sort type as String but we do have some Date fields
 too.  Are dates actually indexed as strings?

If you're using DateField: yes. But you don't have to use that class, you 
can save dates however you want.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word co-occurrences counts

2004-12-23 Thread Daniel Naber
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:

 1.To be able to return the number of times the word appears in all
 the documents (which it looks like lucene can do through IndexReader)

If you're referring to docFreq(Term t) , that will only return the number 
of documents that contain the term, ignoring how often the term occurs in 
these documents.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: CFS file and file formats

2004-12-22 Thread Daniel Naber
On Wednesday 22 December 2004 23:41, Steve Rajavuori wrote:

 Thanks. I am trying to repair a corrupted 'segments' file.

Why are you sure it's corrupted? Are the *.cfs file and the other files 
types mixed in one directory? Then that's the problem: if you have *.cfs, 
segments, and deletable, nothing else should exist in that directory or 
Lucene will get confused.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aramorph Analyzer

2004-12-16 Thread Daniel Naber
On Thursday 16 December 2004 11:59, Safarnejad, Ali (AFIS) wrote:

 Actually, one thing worth mentioning about the search, is when searching
 for whole phrases, if there is any ambiguous words in the phrase, then the
 Search fails to find the document, even if the phrase was copied and pasted
 from the original document.

Analyzers that provide ambiguous terms (i.e. a token with more than one term 
at the same position) don't work in Lucene 1.4. This feature has only 
recently been added to CVS. The workaround would be to backport that change.

Regards
 Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why does the StandardTokenizer split hyphenated words?

2004-12-16 Thread Daniel Naber
On Thursday 16 December 2004 13:46, Mike Snare wrote:

  Maybe for a-b, but what about English words like half-baked?

 Perhaps that's the difference in thinking, then. I would imagine that
 you would want to search on half-baked and not half AND baked.

A search for half-baked will find both half-baked and half baked (the 
phrase). The only thing you'll not find if halfbaked.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Daniel Naber
On Wednesday 15 December 2004 19:29, Mike Snare wrote:

 In my case, the words are keywords that must remain as is, searchable
 with the hyphen in place. It was easy enough to modify the tokenizer
 to do what I need, so I'm not really asking for help there. I'm
 really just curious as to why it is that a-1 is considered a single
 token, but a-b is split.

a-1 is considered a typical product name that needs to be unchanged 
(there's a comment in the source that mentions this). Indexing 
hyphen-word as two tokens has the advantage that it can then be found 
with the following queries:
hypen-word (will be turned into a phrase query internally)
hypen word (phrase query)
(it cannot be found searching for hyphenword, however).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Daniel Naber
On Wednesday 15 December 2004 21:14, Mike Snare wrote:

 Also, the phrase query
 would place the same value on a doc that simply had the two words as a
 doc that had the hyphenated version, wouldn't it? This seems odd.

Not if these words are spelling variations of the same concept, which 
doesn't seem unlikely.

 In addition, why do we assume that a-1 is a typical product name but
 a-b isn't?

Maybe for a-b, but what about English words like half-baked?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Daniel Naber
On Tuesday 14 December 2004 20:13, Monsur Hossain wrote:

 My concern is that this just shifts the scaling issue to Lucene, and I
 haven't found much info on how to scale Lucene vertically. 

You can easily use MultiSearcher to search over several indices. If you 
want the distribution to be more transparent, have a look at Nutch.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


[ANNOUNCE] Lucene 1.4.3 released

2004-12-07 Thread Daniel Naber

I'd like to officially announce Lucene 1.4.3. This release fixes two bugs, 
the list of changes is so short that I will simply paste it here:

 1. The JSP demo page (src/jsp/results.jsp) now properly escapes error
messages which might contain user input (e.g. error messages about 
query parsing). If you used that page as a starting point for your
own code please make sure your code also properly escapes HTML
characters from user input in order to avoid so-called cross site
scripting attacks.
  
  2. QueryParser changes in 1.4.2 broke the QueryParser API. Now the old 
 API is supported again.

The source code and binaries can be downloaded from
http://www.apache.org/dyn/closer.cgi/jakarta/lucene/

Project website:
http://jakarta.apache.org/lucene/docs/index.html

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Help on the Query Parser

2004-11-24 Thread Daniel Naber
On Wednesday 24 November 2004 08:16, Morus Walter wrote:

 Lucene itself doesn't handle wildcards within phrases.

This can be added using PhrasePrefixQuery (which is slightly misnamed):
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/PhrasePrefixQuery.html

Regards
 Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: JDBCDirectory to prevent optimize()?

2004-11-23 Thread Daniel Naber
On Tuesday 23 November 2004 00:06, Kevin A. Burton wrote:

 I'm wondering about the potential for a generic JDBCDirectory for
 keeping the lucene index within a database.

Such a thing already exists: http://ppinew.mnis.com/jdbcdirectory/, but I 
don't know about its scalability.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser: [stopword] AND something throws Exception

2004-11-12 Thread Daniel Naber
On Friday 12 November 2004 17:52, Peter Pimley wrote:

 [this is using lucene-1.4-final]

Please try 1.4.2.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-12 Thread Daniel Naber
On Friday 12 November 2004 21:28, Luke Francl wrote:

  That's the point: there is no query optimizer in Lucene.

 Would it be possible to write one? I would be very interested in this
 feature.

There are two different issues: first, reorder the query so that those 
terms with less matches appear first, because as soon as the first term 
with 0 matches occurs, search stops. There will probably be a 
non-so-difficult implementation for that, but this will have more overhead 
than it saves time I guess.

The other thing is that prefix queries get expanded first, then the search 
happens. And that TooManyQueries exception happens when expanding the 
query, not during search. I'm not sure, but I think that's difficult to 
change, at least in a clean way.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Daniel Naber
On Thursday 11 November 2004 20:57, Sanyi wrote:

 What I'm saying is that there is no reason for the optimizer to expand
 wild* to more than 1024 variations

That's the point: there is no query optimizer in Lucene.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stopword AND validword throws exception

2004-11-10 Thread Daniel Naber
On Wednesday 10 November 2004 10:46, Sanyi wrote:

 This query seems to crash:
 stopword AND validword
 (java.lang.ArrayIndexOutOfBoundsException: -1)

I think this has been fixed in the development version (which will become 
Lucene 1.9).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Filters for Openoffice File Indexing available (Java)

2004-11-10 Thread Daniel Naber
On Monday 08 November 2004 11:30, Joachim Arrasz wrote:

 So now we are looking for search and index Filters for Lucene, that
 were able to integrate out OpenOffice Files also into search result.

I don't know of any existing solutions, but it's not so difficult to write 
one: Extract the ZIP file using Java's built-in ZIP classes and parse 
content.xml and meta.xml. I'm not sure if whitespace issues might become 
tricky, e.g. two paragraphs could be in the file as 
pone/pptwo/p, but for indexing a whitespace needs to be inserted 
between them (p was just an example, I don't know what OpenOffice.org 
actually uses).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Filters for Openoffice File Indexing available (Java)

2004-11-10 Thread Daniel Naber
On Wednesday 10 November 2004 15:18, Joachim Arrasz wrote:

  Why should i parse
 meta.xml? I thaught content.xml should be enough.

It contains the file's title, keywords, and author etc (those are not in 
content.xml).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?

2004-11-05 Thread Daniel Naber
On Friday 05 November 2004 18:03, Chuck Williams wrote:

 The Lucene index is not in CVS -- neither the directory nor the files.
 But it is a subdirectory of a directory that is in CVS,

Does this patch help? 
http://issues.apache.org/bugzilla/show_bug.cgi?id=31747

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting by score and an additional field

2004-11-04 Thread Daniel Naber
On Thursday 04 November 2004 03:52, Chris Fraschetti wrote:

 I can only get it to sort by one or the other... but when it does one,
 it does sort correctly, but together in {score, custom_field} only the
 first sort seems to apply.

Do you use real documents for that test? The score is a float value and 
it's hardly ever the same for two documents (unless you use very short 
test documents), so that's why the second field may not be used for 
sorting.

regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: new version of NewMultiFieldQueryParser

2004-10-29 Thread Daniel Naber
On Friday 29 October 2004 20:42, Bill Janssen wrote:

 Try running the program at
 ftp://ftp.parc.xerox.com/transient/janssen/SearchTest.java, and see
 how that works for you. Seems to work fine with Java 1.4.2 and Lucene
 1.4.1, for me.

That seems to be the old version that doesn't implement getFieldQuery. The 
new version (which you pasted in an email on 2004-10-27) doesn't seem to 
work with e.g. prefix queries. I call it like this:

String[] fields = new String[2];
fields[0] = title;
fields[1] = body;
NewMultiFieldQueryParser qp = new NewMultiFieldQueryParser(fields, new
WhitespaceAnalyzer());
Query q = qp.parse(+gut* +test);
System.out.println(q);  

I get:

+%%:gut* +(title:test body:test)

Or am I mixing up the versions?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching for a phrase that contains quote character

2004-10-28 Thread Daniel Naber
On Thursday 28 October 2004 19:03, Justin Swanhart wrote:

 Have you tried making a term query by hand and testing to see if it
 works? 

 Term t = new Term(field, this is a \test\);
 PhraseQuery pq = new PhraseQuery(t);

That's not a proper PharseQuery, it searches for *one* 
term this is a test which is probably not what one wants. You 
have to add the terms one by one to a PhraseQuery.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search.jhtml ?

2004-10-28 Thread Daniel Naber
On Thursday 28 October 2004 15:01, Willy De Waele wrote:

 Executing the demos as a bat file (Windows) is working fine, but
 using lucene as a web 'application' is not working ...

I think that Search.jhtml is totally outdated, please try src/jsp instead.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching for a path

2004-10-28 Thread Daniel Naber
On Friday 29 October 2004 00:22, Bill Tschumy wrote:

 I get zero hits. Why are these not equivalent? I think it has
 something to do with the fact that the url needs to be quoted so I
 search for an exact match.

When you manually build the query there's no need to have quotes around it. 
Can you try without the quotes?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Poor Lucene Ranking for Short Text

2004-10-27 Thread Daniel Naber
On Wednesday 27 October 2004 20:20, Kevin A. Burton wrote:

 http://www.peerfear.org/rss/permalink/2004/10/26/PoorLuceneRankingForSho
rtText/

(Kevin complains about shorter documents ranked higher)

This is something that can easily be fixed. Just use a Similarity 
implementation that extends DefaultSimilarity and that overwrites 
lengthNorm: just return 1.0f there. You need to use that Similarity for 
indexing and searching, i.e. it requires reindexing.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Poor Lucene Ranking for Short Text

2004-10-27 Thread Daniel Naber
On Wednesday 27 October 2004 22:47, Kevin A. Burton wrote:

 If the current behavior is all that happens this is fine... this way I
 can just get this behavior for new documents that are added.

You'll have to try it out, I'm not sure what exactly will happen.

 Also... why isn't this the default?

You'll probably end up with many documents having exactly the same ranking. 
And those documents will then be sorted in a random order (not really, 
they will by sorted by internal ID I think, but that's no useful order for 
most use cases).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aliasing problem

2004-10-26 Thread Daniel Naber
On Tuesday 26 October 2004 19:22, Abhay Saswade wrote:

 I tried following but no luck
 I have written alias filter which returns 2 more tokens for doom3 as 3
 and doom

 I construct query +GAME:doom3
 QueryParser returns +GAME:doom3 3 doom

Your approach is correct, but QueryParser doesn't yet support analyzers 
which return more than one token at a position. There's already a patch 
about this in the bug tracking system.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting on multiple fields

2004-10-18 Thread Daniel Naber
On Monday 18 October 2004 21:25, Angelov, Rossen wrote:

 The
 first one represents date in format mmddMMHHSS and the second one
 are the article headlines.

The headlines are probably tokenized, right? Sorting then won't work, I 
think the API documentation contains some details about this.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting on multiple fields

2004-10-18 Thread Daniel Naber
On Monday 18 October 2004 23:39, Angelov, Rossen wrote:

 Is there any workaround for sorting on tokenized fields?

Just save the field a second time under a different name and use 
Field.Keyword() for that. Then you can use it for sorting, and still use 
the original field for searching.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StopWord elimination pls. HELP

2004-10-17 Thread Daniel Naber
On Sunday 17 October 2004 05:23, Miro Max wrote:

 d.add(Field.Text(cont, cont));
 writer.addDocument(d);

 to get results from a database into lucene index. but
 when i check println(d) i can see the german stopwords
 too. how can i eliminate this?

Field.Text(field, cont) where cont is a String will also store the 
original text, additionally to indexing it. toString() will then show the 
stored text. In the index you won't have any stopwords.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Special field values

2004-10-13 Thread Daniel Naber
On Wednesday 13 October 2004 08:45, Michael Hartmann wrote:

 The field should store a vector of values that
 indicate whether or not a term exists in a document or not.

You can just add more than one field with the same name but different 
values per document, then searching for single values should work.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting and score ordering

2004-10-13 Thread Daniel Naber
On Wednesday 13 October 2004 19:53, Chris Fraschetti wrote:

 Is there a way I can (without recompiling) ... make the score have
 priority and then my sort take affect when two results have the same
 rank?

You can just (explicitly) sort by score and use some other field as a 
second sort key.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting and score ordering

2004-10-13 Thread Daniel Naber
On Wednesday 13 October 2004 20:44, Chris Fraschetti wrote:

 I haven't seen an example on how to apply two sorts to a search.. can
 you help me out with that?

Check out the documentation for Sort(SortField[] fields) and SortField.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory leak in Lucene 1.4?

2004-10-06 Thread Daniel Naber
On Wednesday 06 October 2004 19:48, Steve Rajavuori wrote:

 Is anyone aware of memory leaks in Lucene 1.4? I have an application
 that has been running fine with Lucene 1.2. I can write indexes for days
 and it never consumes more than 14 Meg of memory.

There was a leak with searching that got fixed in 1.4.2. I'm not aware of 
leaks in indexing. Please try to build a test case that demonstrates the 
problem, then it's usually quite easy to fix.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question regarding using Lucene or not

2004-10-04 Thread Daniel Naber
On Monday 04 October 2004 22:22, you wrote:

 1. How difficult it is to implement our own Similarity class that can do
 the things we want ?

It should be very easy. The API is described here: 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html
I think in your case all methods (except one) that return a float can just 
return 1.0f. The one that doesn't return 1 then returns a value that 
represents the difference to the perfect value (well, more like 
1/difference).

 2. If there are more than one field that are percentage match like HP,
 can we also specify which field gets the preference while search.

If you implement the method mentioned above so that it always ranks some 
field higher than another, that should be possible.

But it you've only got 1000 documents (and that number won't increase) you 
could also just search for HP:cargo, put all matches in your own Match 
objects and then sort these via your own implementation of Java's 
compareTo().

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question regarding using Lucene or not

2004-10-02 Thread Daniel Naber
On Saturday 02 October 2004 02:06, [EMAIL PROTECTED] wrote:

 The parameters are both string and numeric. For example, the model
 should be Cargo and its HP value should be 55,000 or near it . If we
 specify tolerance value of 5000 then it should search for all the data
 files where model node is Cargo (definitive match) and HP value is
 between 50,000 to 60,000 with the one having 55,000 coming as the 100%
 match.

That's possible with Lucene, you'll need to parse the XML files and put the 
required data into the Lucene index. Then you can search with a query like 
this:

+model:cargo^0 +hp:[5 TO 6] hp:55000^10

This will match all document which contain cargo in the model field and a 
value of 5 to 6 in the hp field. Matches with hp 55000 will be 
boosted so they appear on top. However, matches 5 to 54999 and 50001 
to 6 will have the same ranking. To change that you will need to 
implement your own variation of Lucene's Similarity class.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using Proximity for Ranking

2004-09-24 Thread Daniel Naber
On Friday 24 September 2004 15:27, Olena Medelyan wrote:

 I know that I can
 use the slop operator for phrase search (red fox~3), but what I need
 should work for partial matching as well.

You can use the value of Integer.MAX_VALUE instead of 3 in your example, 
something like:

+red +fox +red fox~2147483647^10

Nutch does that I think (you might want to search the archives), but I have no 
clue how fast/slow it is.

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: problem with get/setBoost of document fields

2004-09-22 Thread Daniel Naber
On Wednesday 22 September 2004 18:44, Bastian Grimm [Eastbeam GmbH] wrote:

 if i set the d1 and f1 boost to 1.0f (default) the score returned by
 the HitCollector is 0.3xxx - shouldn't it be exactly 1.0 ?

See the documentation for getBoost:

Note: this value is not stored directly with the document in the index. 
Documents returned from IndexReader.document(int) and Hits.doc(int) may 
thus not have the same value present as when this field was indexed.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: range query problems

2004-09-17 Thread Daniel Naber
On Friday 17 September 2004 19:37, Derek Baker wrote:

 However, if I create a range query that I would expect to find that
 value, I get nothing. The range query string is: adzer:[# TO 0] (minus
 the quotes). As far as I can tell, this query string should find any
 value in the adzer fields that starts with a -.

Did you try building that query manually? Maybe even starting from null 
instead of #.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser.parse() and Lucene1.4.1

2004-09-16 Thread Daniel Naber
On Thursday 16 September 2004 19:38, Polina Litvak wrote:

  I ran my code first with lucene-1.3-final.jar, getting the query
 Field:(A AND -(B)) parsed into +Field:A -Field:B

This code:
Query query = QueryParser.parse(Field:(AAA AND -(BBB)), field, new 
StandardAnalyzer());
System.out.println(query);

Will print this with Lucene 1.4.1:
+Field:aaa -Field:bbb

It will not work with A instead of AAA because A is a stopword. In 
other words, I still cannot reproduce your problem.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: boosting fields in MultiFieldQueryParser with different factors

2004-09-15 Thread Daniel Naber
On Wednesday 15 September 2004 18:06, Fiebig, Swen (init) wrote:

 is there a way to boost the different fields of a MultiFieldQueryParser
 with different factors? Or at least in the resulting Query?

The easiest way is probably to subclass MultiFieldQueryParser and implement 
a method that modifies the boosts (you can copy most of it from 
MultiFieldQueryParser.parse() and then call setBoost() on the queries).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser.parse() and Lucene1.4.1

2004-09-15 Thread Daniel Naber
On Wednesday 15 September 2004 21:58, Polina Litvak wrote:

 Does anyone know how to work around this new feature ?

I can't remember any changes in this area, but I just tried with the 
current version from CVS and the output is the one which you want.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: OutOfMemory example

2004-09-14 Thread Daniel Naber
On Tuesday 14 September 2004 08:32, Ji Kuhn wrote:

 The error is thrown in exactly the same point as before. This morning I
 downloaded Lucene from CVS, now the jar is lucene-1.5-rc1-dev.jar, JVM
 is 1.4.2_05-b04, both Linux and Windows.

Now I can reproduce the problem. I first tried running the code inside 
Eclipse, but the Exception doesn't occur there. It does occur on the 
command line.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemory example

2004-09-13 Thread Daniel Naber
On Monday 13 September 2004 15:06, Ji Kuhn wrote:

 I think I can reproduce memory leaking problem while reopening
 an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My
 JVM is:

Could you try with the latest Lucene version from CVS? I cannot reproduce 
your problem with that version (Sun's Java 1.4.2_03, Linux).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Addition to contributions page

2004-09-13 Thread Daniel Naber
On Friday 10 September 2004 15:48, Chas Emerick wrote:

 PDFTextStream should be added to the 'Document Converters' section,
 with this URL  http://snowtide.com , and perhaps this heading:
 'PDFTextStream -- PDF text and metadata extraction'. The 'Author'
 field should probably be left blank, since there's no single creator.

I just added it.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread Daniel Naber
On Thursday 09 September 2004 18:52, Doug Cutting wrote:

 I have not been
 able to construct a two-word query that returns a page without both
 words in either the content, the title, the url or in a single anchor.
 Can you?

Like this one?

konvens leitseite 

Leitseite is only in the title of the first match (www.gldv.org), konvens 
is only in the body.

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-09 Thread Daniel Naber
On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

 I am facing an out of memory problem using Lucene 1.4.1.

Could you try with a recent CVS version? There has been a fix about files 
not being deleted after 1.4.1. Not sure if that could cause the problems 
you're experiencing.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Spam:too many open files

2004-09-07 Thread Daniel Naber
On Tuesday 07 September 2004 17:41, [EMAIL PROTECTED] wrote:

 A note to developers, the code checked into lucene CVS ~Aug 15th, post
 1.4.1, was causing frequent index corruptions. When I reverted back to
 version 1.4 I no longer am getting the corruptions.

Here are some changes from around that day:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java

Could you check which of those might have caused the problem? I guess 
there's not much the developers can do without the problem being 
reproducible.

regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Range query problem

2004-08-26 Thread Daniel Naber
On Thursday 26 August 2004 11:02, Alex Kiselevski wrote:

 I have a strange problem with range query PERIOD:[1 TO 9]
 It works only if the second parameter is equals or less than 9
 If it's greater than 9 , it finds no documents

You have to store your numbers so that they will appear in the right order 
when sorted lexicographically, e.g. save 1 as 01 if you save numbers up to 
99, or as 0001 if you save numbers up to . You also have to use this 
format for searching I think.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemoryError

2004-08-17 Thread Daniel Naber
On Wednesday 18 August 2004 00:30, Terence Lai wrote:

   if (fsDir != null) {
 try {
   is.close();
 } catch (Exception ex) {
 }
   }

You close is here again, not fsDir. Also, it's a good idea to never ignore 
exceptions, you should at least print them out, even if it's just a 
close() that fails.

Regards
 Daniel

-- 
http://www.danielnaber.de


Re: wildcard uppercase

2004-08-12 Thread Daniel Naber
On Thursday 12 August 2004 22:30, Kipping, Peter wrote:

 As you can see it's been lower cased and I get no hits. Looks like
 something is lowercasing the wildcard query. How can I make it not do
 that?

Try QueryParser's setLowercaseWildcardTerms(boolean).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Benchmark of filesystem cache for index vs RAMDirectory...

2004-08-08 Thread Daniel Naber
On Sunday 08 August 2004 03:40, Kevin A. Burton wrote:

 Would a HashMap implementation of RAMDirectory beat out a cached
 FSDirectory?

It's easy to test, so it's worth a try. Please try if the attached patch 
makes any difference for you compared to the current implementation of 
RAMDirectory.

Regards
 Daniel

-- 
http://www.danielnaber.de
Index: RAMDirectory.java
===
RCS file: /home/cvs/jakarta-lucene/src/java/org/apache/lucene/store/RAMDirectory.java,v
retrieving revision 1.16
diff -u -r1.16 RAMDirectory.java
--- RAMDirectory.java	7 Aug 2004 11:19:28 -	1.16
+++ RAMDirectory.java	8 Aug 2004 09:01:19 -
@@ -18,8 +18,11 @@
 
 import java.io.IOException;
 import java.io.File;
+import java.util.HashMap;
 import java.util.Hashtable;
 import java.util.Enumeration;
+import java.util.Iterator;
+import java.util.Set;
 
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.store.InputStream;
@@ -31,7 +34,7 @@
  * @version $Id: RAMDirectory.java,v 1.16 2004/08/07 11:19:28 dnaber Exp $
  */
 public final class RAMDirectory extends Directory {
-  Hashtable files = new Hashtable();
+  HashMap files = new HashMap();
 
   /** Constructs an empty [EMAIL PROTECTED] Directory}. */
   public RAMDirectory() {
@@ -93,9 +96,11 @@
   public final String[] list() {
 String[] result = new String[files.size()];
 int i = 0;
-Enumeration names = files.keys();
-while (names.hasMoreElements())
-  result[i++] = (String)names.nextElement();
+Set names = files.keySet();
+for (Iterator iter = names.iterator(); iter.hasNext();) {
+  String element = (String) iter.next();
+  result[i++] = element;
+}
 return result;
   }
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analyzing and Querying

2004-08-06 Thread Daniel Naber
On Friday 06 August 2004 08:37, Tino Schöllhorn wrote:

 I am aware that the Lucene Query-Api supports wildcards, but as far as I
 know I cannot add a * in front of a query-term.

That should be possible, but it will be slow if you have many terms. Another 
idea is to additionally index the word in reverse order: bergbahn - 
nhabgreb, then a query for nhab* will find all words that end with bahn (you 
can use a prefix query then, which is not as slow as a WildcardQuery but 
still slow).

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Analyzing and Querying

2004-08-06 Thread Daniel Naber
On Friday 06 August 2004 13:28, Magnus Johansson wrote:

 Splitting compound words can be done quite effectively simply by using
 a large wordlist. I have done this for swedish.

It is, however, difficult to get right for German. On the one hand there are 
compounds in German with more than two parts, on the other hand there are 
extra characters in the middle of some compound words (e.g. Arbeit + Aufwand 
= ArbeitSaufwand). Also, the compounds have their inflectional endings, e.g. 
the plural of Bergbahn is Bergbahnen. At http://lemmi.intrafind.org you can 
see a demo that deals with almost all cases, even things like dazugekauftes 
(but it's not freely available).

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Weighted queries

2004-08-06 Thread Daniel Naber
On Friday 06 August 2004 16:54, Eric Jain wrote:

(title:foo^4 OR abstract:foo^2 OR content:foo) AND
(title:bar^4 OR abstract:bar^2 OR content:bar)

That's not the way MultiFieldQueryParser will rewrite your query. To get this 
kind of query you have to parse it with QueryParser and then iterate 
recursivly (in case of BooleanQuery) over it, using Java's instanceof. Each 
term needs to be replaced with a BooleanQuery over all the fields you want to 
search in.

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Negative Boost

2004-08-04 Thread Daniel Naber
On Wednesday 04 August 2004 13:19, Terry Steichen wrote:

 I can't get negative boosts to work with QueryParser.  Is it possible to do
 so?

Isn't that the same as using a boost  1, e.g. 0.1? That should be possible.

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TermFreqVector Beginner Question

2004-07-29 Thread Daniel Naber
On Thursday 29 July 2004 17:31, Matt Galloway wrote:

  Field.Text(String name, Reader value, boolean storeTermVector)
Field.UnStored(String name, String value, boolean storeTermVector)

DO NOT store the contents of the field

This part of the API is known to be difficult and will be fixed for Lucene 2.0 
(which is the next version). Till then, I'll try to remember to extend the 
documentation.

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: updating the index created for database search

2004-07-26 Thread Daniel Naber
On Monday 26 July 2004 11:37, lingaraju wrote:

 2) When I fetch the rows from the database in order to update or insert in
 index how to know which record is modified in database and which record is
 not present is index

Your database will need a last modified column. Then you can select those 
rows that have been modified since the last update and for each row check if 
it's in the Lucene index. If it is, delete it there and re-add the new 
version. If it's not, add it. To delete documents you will probably need to 
iterate over all your IDs in the Lucene index and check if they are still in 
the database. If that's too inefficient you could check if you can do it the 
way the file system indexer (IndexHTML in Lucene's demo) does it.

BTW, please don't cross-post to both lists.

Regards
 Daniel
 
-- 
Daniel Naber, IntraFind Software AG, Tel. 089-8906 9700


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: updating the index created for database search

2004-07-26 Thread Daniel Naber
On Monday 26 July 2004 13:31, lingaraju wrote:

 If it is new record  through which class we have to check that record is
 present in the index

Just search for the id with a TermQuery. If you get a hit, the record is in 
the index already.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RangeQuery on Numeric values

2004-07-23 Thread Daniel Naber
On Friday 23 July 2004 16:58, Terence Lai wrote:

 I am currently using Lucene 1.4 Final. I want to construct a query that
 matches a numeric range. I believe that the RangeQuery defined in Lucene
 API uses the string comparision. It does not work for numeric contents.
 Does anyone know how to create a numeric range query?

RangeQuery works for numbers, too, but you have to store them in the index so 
they are sorted correctly when sorted alphabetically. For example, if your 
numbers are between 0 and 1000 you have to store them as strings  to 
1000.

Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene cutomized indexing

2004-07-20 Thread Daniel Naber
On Tuesday 20 July 2004 17:28, John Wang wrote:

I have asked to make the Lucene API less restrictive many many many
 times but got no replies.

I suggest you just change it in your source and see if it works. Then you can 
still explain what exactly you did and why it's useful. From the developers 
point-of-view having things non-final means more stuff is exposed and making 
changes is more difficult (unless one accepts that derived classes may break 
with the next update).

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene cutomized indexing

2004-07-20 Thread Daniel Naber
On Tuesday 20 July 2004 18:12, John Wang wrote:

 They make sure during deployment their versions
 gets loaded before the same classes in the lucene .jar.

I don't see why people cannot just make their own lucene.jar. Just remove 
the final and recompile. Finally, Lucene is Open Source.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: join two indexes

2004-07-20 Thread Daniel Naber
On Tuesday 20 July 2004 19:19, Sergio wrote:

 i want to join two lucene indexes but i dont know how to do that.

There are two addIndexes methods in IndexWriter which you can use to 
write your own small merge tool (a ready-to-use tool for index merging 
doesn't exist AFAIK).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene vs. MySQL Full-Text

2004-07-20 Thread Daniel Naber
On Tuesday 20 July 2004 21:29, Tim Brennan wrote:

 Does anyone out there have
 anything more concrete they can add?

Stemming is still on the MySQL TODO list: 
http://dev.mysql.com/doc/mysql/en/Fulltext_TODO.html

Also, for most people it's easier to extend Lucene than MySQL (as MySQL is 
written in C(++?)) and there are more powerful queries in Lucene, e.g. 
fuzzy phrase search.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query across multiple fields scenario not handled by MultiFieldQueryParser

2004-07-19 Thread Daniel Naber
On Sunday 18 July 2004 18:39, Thomas Plümpe wrote:

 Does anybody here know which changes I
 would have to make to QueryParser.jj to get the functionality described?

I haven't tried it but I guess you need to change the getXXXQuery() methods so 
they return a BooleanQuery. For example, getFieldQuery currently might return 
a TermQuery; you'll need to change that so it returns a BooleanQuery with two 
TermQuerys. These two queries would have the same term, but a different 
field.

Another approach is to leave QueryParser alone and modify the query after it 
has been parsed by recursively iterating over the parsed query, replacing 
e.g. TermQuerys with BooleanQuerys (just like described above).

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene shouldn't use java.io.tmpdir

2004-07-12 Thread Daniel Naber
On Monday 12 July 2004 09:04, Morus Walter wrote:

 Lucene might work around this by creating a directory in java.io.tmpdir
 setting apropriate permission (can that be done with java os
 independently?) and put the lock there.

But if everybody can delete your lock files, that would be a security 
problem. Deleting stale locks isn't a problem, but how would one decide if 
a lock is stale?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Browse by Letter within a Category

2004-07-12 Thread Daniel Naber
On Monday 12 July 2004 17:48, O'Hare, Thomas wrote:

 Does Lucene have a beginning of line query syntax, like the regular
 expression ^ symbol? For example,
 
 title:^A*

If your title isn't tokenized the ^ is implicit, I think. As usual, if 
your title is tokenized you can easily add another field with the same 
value as title, but in untokenized form.

 What is the best way to sort by a date? I currently have a date field
 that is used for searching in the format MMDD as a Field.Keyword. 

Lucene 1.4 added an IndexSearcher.search() method that takes a Sort() 
object which lets you sort by any field. Your date field can be used for 
that, as it has the correct format (because sorting it alphabetically will 
give you the right order already).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exact match search

2004-07-12 Thread Daniel Naber
On Monday 12 July 2004 21:17, [EMAIL PROTECTED] wrote:

 I want to match documents that exactly equal a certain value, not just
 contain it.

Just don't tokenize your Fields, and make sure that the query also doesn't 
get tokenized (the easiest way to ensure that is probably to not use 
QueryParser but just build a TermQuery directly from the user's input).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Browse by Letter within a Category

2004-07-09 Thread Daniel Naber
On Friday 09 July 2004 04:27, O'Hare, Thomas wrote:

 Searcher.search(category:\Products\ AND title:\A*\, new
 Sort(title));

You can only sort on fields which are not tokenized I think. So add an extra 
field with the title, but untokenized, just for sorting. Also, A* might 
slow down the query execution so you might want to add another field which 
just contains the first letter so there's no need for the asterisk.

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene shouldn't use java.io.tmpdir

2004-07-09 Thread Daniel Naber
On Friday 09 July 2004 16:15, Armbrust, Daniel C. wrote:

 (since no full path was given with the error - has this been fixed?) and C)

That's fixed in Lucene 1.4.

 I think the locks should go back in the index, and we should fall back or
 give an option to put them elsewhere for the case of the read-only index.

There's already a Java system property that let's you specify the lock 
directory.

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: demo indexing problems on linux

2004-06-18 Thread Daniel Naber
On Thursday 17 June 2004 21:10, Morris Mizrahi wrote:

 When I run org.apache.lucene.demo.IndexHTML on Linux the indexer works
 fine when I am creating a new index (e.g. using -create -index option).
 But when I run the indexer again (-index without the -create option) for
 updates it does not properly update the index.

Morris,

what exactly happens when you run the update? Does it miss files that have 
been modified? I just tried it on Linux and it works fine. Files that have 
been modified (according to their file date) are deleted and then added 
again to the index.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: a search like Google

2004-02-13 Thread Daniel Naber
On Friday 13 February 2004 07:43, Morus Walter wrote:

 If you want to use query parser you can parse the query with different
 default fields, set boost factors on the resulting queries and join them
 with a boolean query.
 This will give you
 (+title:i +title:love +title:lucene)^2 (+author:i +author:love
 +author:lucene) (+content:i +content:love +content:lucene)

This will not match documents that contain only I love in the title and 
lucene in the body. Doesn't seem like a problem for this example, as it 
looks like a phrase query, but I don't think that this is Google's 
behavior. I worked around that by prepending the title to the body field. 
Let me know if someone has a better solution.

The correct query would have to look something like this (leavon out the 
boosting):

+(title:i body:i) +(title:love body:love) +(title:lucene body:lucene)

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Matching a single document

2004-02-09 Thread Daniel Naber
On Sunday 08 February 2004 20:07, Chris Kimm wrote:

  Two possible
 approaches, both of which seem ungainly, are 1) creating a temporary
 index for each document being indexed 

You can use a RAMDirectory for indexing so nothing needs to be written to 
disk. This should be quite fast.

 or 2) Writing a class that matches 
 document Fields with Query Terms.  This second approach would require a
 way to extract individual Terms from Queries.  Is that possible?

Yes, you need to recursively iterate over all parts of the query. For 
example, a boolean query may consist of other boolean queries. You need to 
go down until you've got a TermQuery, which holds the term itself (or you 
could use rewrite() to make this easier).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search DB data

2004-01-07 Thread Daniel Naber
On Wednesday 07 January 2004 17:08, Kevin Price wrote:

 I will have users enter in their search text.  The data that needs to be
 searched is stored in a DB.   So I would like to do a query against the
 Database, and then somehow use Lucene to parse those results for all of
 the searching options (Boolean operators, multiple words,
 wildcards,fuzzy, etc).

Lucene works on its own index files. So the usual way is to index the data 
in the database with Lucene and then query Lucene. Parsing the database 
results during search time with Lucene doesn't seem to be possible/useful. 
The only thing one might want to use at runtime when the database does the 
actual query is maybe the query parser (in order to build an SQL query 
from the user's input).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexHTML example on Jakarta Site

2004-01-02 Thread Daniel Naber
On Friday 02 January 2004 20:50, Leo Galambos wrote:

 IMHO Lucene is library/API and unless you are
 a JAVA developer, it does not fit your needs.

One reason for the confusion might be that the homepage states that Lucene 
is a full-featured text search engine. IMHO this should be replaced by 
a powerful Java library for full-text indexing or something like that.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]