Configurable indexing of an RDBMS, has it been done before?

2005-02-07 Thread David Spencer
Many times I've written ad-hoc code that pulls in data from an RDBMS and 
builds a Lucene index. The use case is a typical database-driven dynamic 
website which would be a hassle to spider (say, due to tricky 
authentication).

I had a feeling this had been done in a general manner but didn't see 
any code in the sandbox, nor did any searches turn it up.

I've spent a few mins thinking this thru - what I'd expect is to be able 
to configure is:

1. JDBC Driver + conn params
2. Query to do a 1 time full index
3. Query to show new records
4. Query to show changed records
5. Query to show deleted records
6. Query columns to Lucene Field name mapping
7. Type of each field name (e.g. the equivalent of the args to the 
Field ctr)

So a simple example, taking item 2 is
query: select url, name, body from foo
(now the column to field mapping)
col 1 = url
col 2 = title
col 3 = contents
(now the field types for each named field)
url = Field( ...store=true, index=false)
  title = Field( ...store=true, index=true)
   contents = Field( ...store=false, index=true)

And voilla, nice, elegant, data driven indexing.
Does it exist?
Should it? :)
PS
 I know in the more general form, query needs to be replaced by 
queries above, and the updated query may need some time stamp variable 
expansion, and possibly the queries need paging to deal w/ lamo DBs 
like mysql that don't have cursors for large result sets...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: PHP-Lucene Integration

2005-02-07 Thread Sanyi
Hi!

Can you please explain how did you implement the java and php part to let them 
communicate through
this bridge?
The brige's project summary talks about java application-server or a 
dedicated java process
and I'm not into Java that much.
Currenty I'm using a self-written command-line search program and it outputs 
its results to the
standard output.
I guess your solution must be better ;)

If the communication parts of your code aren't top secret, can you please 
share them with me/us?

Regards,
Sanyi




__ 
Do you Yahoo!? 
Read only the mail you want - Yahoo! Mail SpamGuard. 
http://promotions.yahoo.com/new_mail 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-07 Thread Aad Nales
Yep,
This is how we do it.
We have a search.xml that maps database fields to search fields and a 
parameter part that describes the 'click for detailed result url' and 
the parameter names (based on the search fields). In this xml we also 
describe how the different fields should be stored we have for instance 
a number of large text fields for we use the unstored option.

The framework that we have build around has an element that we call 
detailer. This detailer creates a lucene Document with the fields as 
specified in the search.xml

To illustrate here is the code that specifies the detailer for a forum.
-- XML -
documenttype id=FORUM index=general defaultfield=body
fields
field property=messageid searchfield=messageid type=unindexed 
key=true/
field property=instanceid searchfield=instanceid type=unindexed /
field property=subject searchfield=title type=split maxwords=8 /
field property=body searchfield=default type=split maxwords=20 /
field property=aka_username searchfield=username type=keyword /
field property=modifiedDateAsDate searchfield=modifieddate 
type=keyword /
/fields
action uri=/forum/viewMessage.do 
image=/htmlarea/images/cops_insert_threadlink.gif
parameter property=messageid name=messageid/
parameter property=instanceid name=instanceid/
/action
analyzer 
classname=org.apache.lucene.analysis.standard.StandardAnalyzer/
/documenttype
 END XML ---

Please note:
Messageid is the keyfield here when we search the index we use a 
combined TYPE + KEY id to filter out double hits on the same document 
(not unusual in for instance a long forum thread).

Per type of document we also specifify what picture to show in the 
result (image), and we specify in what index the result should be 
written and what the general search field is (if the user submits a 
query without search all and without a field specified).

We have added the 'split' key word which makes it possbile to search a 
long text but only store a bit in the resulting hit.

The reindex is pretty straightforward we build a series of detailers for 
all possible document types and we run through the database and call the 
right detailer from a HashMap.

We have not included the JDBC stuff since the application is always 
running in Tomcat-Struts and since we cache most of the database reads. 
(a completely differnt story).

Queries on new and changed records seem to only make sense if asked in a 
context of time. (Right?). We have not needed it yet. The mapping can be 
query from a singleton java class. (SearchConfiguration).

We are currently adding functionality to store 'user structured data' 
best imagined as user defined input forms that are described in XML and 
are then stored as XML in the database. We query these documents using 
Lucene. These documents end up in the same index but this is quite 
manageable by using specialized detailers. For these document the type 
is more important then for the 'normally' stored documents. For this 
latter situation the search logic assumes that the query is 
appropriately configured by the application.

I am not sure if this is the kind of solution that you are looking for, 
but everything we produce is 100% open source.

Cheers,
Aad
David Spencer wrote:
Many times I've written ad-hoc code that pulls in data from an RDBMS 
and builds a Lucene index. The use case is a typical database-driven 
dynamic website which would be a hassle to spider (say, due to tricky 
authentication).

I had a feeling this had been done in a general manner but didn't see 
any code in the sandbox, nor did any searches turn it up.

I've spent a few mins thinking this thru - what I'd expect is to be 
able to configure is:

1. JDBC Driver + conn params
2. Query to do a 1 time full index
3. Query to show new records
4. Query to show changed records
5. Query to show deleted records
6. Query columns to Lucene Field name mapping
7. Type of each field name (e.g. the equivalent of the args to the 
Field ctr)

So a simple example, taking item 2 is
query: select url, name, body from foo
(now the column to field mapping)
col 1 = url
col 2 = title
col 3 = contents
(now the field types for each named field)
url = Field( ...store=true, index=false)
title = Field( ...store=true, index=true)
contents = Field( ...store=false, index=true)

And voilla, nice, elegant, data driven indexing.
Does it exist?
Should it? :)
PS
I know in the more general form, query needs to be replaced by 
queries above, and the updated query may need some time stamp 
variable expansion, and possibly the queries need paging to deal w/ 
lamo DBs like mysql that don't have cursors for large result sets...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To 

Retrieve all documents - possible?

2005-02-07 Thread Karl Koch
Hi,

is it possible to retrieve ALL documents from a Lucene index? This should
then actually not be a search...

Karl

-- 
Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Starts With x and Ends With x Queries

2005-02-07 Thread Erik Hatcher
On Feb 7, 2005, at 2:07 AM, sergiu gordea wrote:
Hi Erick,

In order to prevent extremely slow WildcardQueries, a Wildcard term 
must not start with one of the wildcards code*/code or 
code?/code.

I don't read that as saying you cannot use an initial wildcard 
character, but rather as if you use a leading wildcard character you 
risk performance issues.  I'm going to change must to should.
Will this change available in the next realease of lucene? How do you 
plan to implement this? Will this be available as an atributte of  
QueryParser?
I'm not changing any functionality.  WildcardQuery will still support 
leading wildcard characters, QueryParser will still disallow them.  All 
I'm going to change is the javadoc that makes it sound like 
WildcardQuery does not support leading wildcard characters.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Retrieve all documents - possible?

2005-02-07 Thread Bernhard Messer
you could use something like:
int maxDoc = reader.maxDoc();
for (int i = 0; i  maxDoc; i++) {
   Document doc = reader.document(i);
}
Bernhard
Hi,
is it possible to retrieve ALL documents from a Lucene index? This should
then actually not be a search...
Karl
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Retrieve all documents - possible?

2005-02-07 Thread Kelvin Tan
Don't forget to test if a document is deleted with reader.isDeleted(i)

On Mon, 07 Feb 2005 12:09:35 +0100, Bernhard Messer wrote:
 you could use something like:

 int maxDoc = reader.maxDoc();
 for (int i = 0; i  maxDoc; i++) {
 Document doc = reader.document(i);
 }

 Bernhard

 Hi,

 is it possible to retrieve ALL documents from a Lucene index?
 This should then actually not be a search...

 Karl


 
 - To unsubscribe, e-mail: lucene-user-
[EMAIL PROTECTED] For additional commands, e-mail:
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Retrieve all documents - possible?

2005-02-07 Thread Andrzej Bialecki
Karl Koch wrote:
Hi,
is it possible to retrieve ALL documents from a Lucene index? This should
then actually not be a search...
You are right. Just use the IndexReader.document(int).
--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Similarity coord,lengthNorm

2005-02-07 Thread Michael Celona
I have varying length text fields which I am searching on.  I would like
relevancy to be dictated predominantly by the number of terms in my query
that match.  Right now I am seeing a high relevancy for a single word
matching in a small document even though all the terms in my query don't
match.  Does, anyone have an example of a custom Similarity sub class which
overrides the coord and lengthNorm methods.

 

Thanks..

Michael 



RE: Similarity coord,lengthNorm

2005-02-07 Thread Michael Celona
Would fixing the lengthNorm to 1 fix this problem?

Michael

-Original Message-
From: Michael Celona [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 07, 2005 8:48 AM
To: Lucene Users List
Subject: Similarity coord,lengthNorm

I have varying length text fields which I am searching on.  I would like
relevancy to be dictated predominantly by the number of terms in my query
that match.  Right now I am seeing a high relevancy for a single word
matching in a small document even though all the terms in my query don't
match.  Does, anyone have an example of a custom Similarity sub class which
overrides the coord and lengthNorm methods.

 

Thanks..

Michael 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity coord,lengthNorm

2005-02-07 Thread Erik Hatcher
On Feb 7, 2005, at 8:53 AM, Michael Celona wrote:
Would fixing the lengthNorm to 1 fix this problem?
Yes, it would eliminate the length of a field as a factor.
Your best bet is to set up a test harness where you can try out various 
tweaks to Similarity, but setting the length normalization factor to 
1.0 may be all you need to do, as the coord() takes care of the other 
factor you're after.

Erik
Michael
-Original Message-
From: Michael Celona [mailto:[EMAIL PROTECTED]
Sent: Monday, February 07, 2005 8:48 AM
To: Lucene Users List
Subject: Similarity coord,lengthNorm
I have varying length text fields which I am searching on.  I would 
like
relevancy to be dictated predominantly by the number of terms in my 
query
that match.  Right now I am seeing a high relevancy for a single word
matching in a small document even though all the terms in my query 
don't
match.  Does, anyone have an example of a custom Similarity sub class 
which
overrides the coord and lengthNorm methods.


Thanks..
Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Similarity coord,lengthNorm

2005-02-07 Thread Chuck Williams
Hi Michael,

I'd suggest first using the explain() mechanism to figure out what's
going on.  Besides lengthNorm(), another factor that is likely skewing
your results in my experience is idf(), which Lucene typically makes
very large by squaring the intrinsic value.  I've found it helpful to
flatten lengthNorm(), tf() and idf() relative to what is used in
DefaultSimilarity.  There is a comparative evaluation of Similarity's
going on now.  You might consider looking at these:

Bug 32674 has a WikipediaSimilarity posted that you might want to try.
You might want to flatten lengthNorm() even further (e.g. all the way to
1.0), but I'd suggest trying it as is first.  If you try it, please post
your assessment.  Here's the link:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32674

You also might find it interesting to read the thread entitled RE:
Scoring benchmark evaluation.  Was RE: How to proceed with Bug 31841 -
MultiSearcher problems with Similarity.docFreq() ? on lucene-dev, as
this contains a discussion of many of the issues.

Good luck,

Chuck

   -Original Message-
   From: Erik Hatcher [mailto:[EMAIL PROTECTED]
   Sent: Monday, February 07, 2005 6:51 AM
   To: Lucene Users List
   Subject: Re: Similarity coord,lengthNorm
   
   
   On Feb 7, 2005, at 8:53 AM, Michael Celona wrote:
Would fixing the lengthNorm to 1 fix this problem?
   
   Yes, it would eliminate the length of a field as a factor.
   
   Your best bet is to set up a test harness where you can try out
various
   tweaks to Similarity, but setting the length normalization factor to
   1.0 may be all you need to do, as the coord() takes care of the
other
   factor you're after.
   
   Erik
   
   
Michael
   
-Original Message-
From: Michael Celona [mailto:[EMAIL PROTECTED]
Sent: Monday, February 07, 2005 8:48 AM
To: Lucene Users List
Subject: Similarity coord,lengthNorm
   
I have varying length text fields which I am searching on.  I
would
like
relevancy to be dictated predominantly by the number of terms in
my
query
that match.  Right now I am seeing a high relevancy for a single
word
matching in a small document even though all the terms in my query
don't
match.  Does, anyone have an example of a custom Similarity sub
class
which
overrides the coord and lengthNorm methods.
   
   
   
Thanks..
   
Michael
   
   
   
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query Analyzer

2005-02-07 Thread Ravi
How do I set the analyzer when I build the query in my code instead of
using a query parser ?

Thanks in advance
Ravi. 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query Analyzer

2005-02-07 Thread Erik Hatcher
On Feb 7, 2005, at 11:29 AM, Ravi wrote:
How do I set the analyzer when I build the query in my code instead of
using a query parser ?
You don't.  All terms you use for any Query subclasses you instantiate 
must match exactly the terms in the index.  If you need an analyzer to 
do this then you're responsible for doing it yourself, just as 
QueryParser does underneath.  I do this myself in my current 
application like this:

private Query createPhraseQuery(String fieldName, String string, 
boolean lowercase) {
RossettiAnalyzer analyzer = new RossettiAnalyzer(lowercase);
TokenStream stream = analyzer.tokenStream(fieldName, new 
StringReader(string));

PhraseQuery pq = new PhraseQuery();
Token token;
try {
  while ((token = stream.next()) != null) {
  pq.add(new Term(fieldName, token.termText()));
  }
} catch (IOException ignored) {
  // ignore - shouldn't get an IOException on a StringReader
}
if (pq.getTerms().length == 1) {
// optimize single term phrase to TermQuery
return new TermQuery(pq.getTerms()[0]);
}
return pq;
}
Hope that helps.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: PHP-Lucene Integration

2005-02-07 Thread [EMAIL PROTECTED]
Howdy,
For starters, compile and install the java bridge (and if necessary 
recompile PHP and Apache2) and make sure it works (there's a test php 
file supplied).

Then, here's a simplified part of my code, just to give you an example 
how it works. This is the part that does the searching, indexing is done 
in a similar way.

PHP:
...some code here for HTML page setup etc...
$lucene_dir = $GLOBALS[lucene_dir];
java_set_library_path(/path/to/your/custom/lucene-classes.jar);
$obj = new Java(searcher); // searcher is the custom written class 
that does actual searching and data output
$writer = new Java(java.io.StringWriter);
$obj-setWriter($writer);
$obj-initSearch($lucene_dir);
$obj-getQuery($query); // $query is the user supplied query from the 
HTML form, not visible here

// get the last exception
$e = java_last_exception_get();
if ($e) {
   // print error
   echo $e-toString();
} else {
   echo $writer-toString();
   $writer-flush();
   $writer-close();
}
java_last_exception_get();
// clear the exception
java_last_exception_clear();
-
JAVA (custom written class located in the 
/path/to/your/custom/lucene-classes.jar):

import ...whatever is needed here for the class...
public class searcher {
  IndexReader reader = null;
  IndexSearcher s= null;  //the searcher used to 
open/search the index
  Query q= null;  //the Query created by the 
QueryParser
  BooleanQuery query  = new BooleanQuery();
  Hits hits  = null;  //the search results
  
  public Writer out;

  public void setWriter(Writer out) {
   this.out=out;
  }
 public void initSearch(String indexName) throws Exception {
   try {
   File indexFile= new File(indexName);
   Directory activeDir   = 
FSDirectory.getDirectory(indexFile, false);
   if(reader.isLocked(activeDir)) {
   //out.write(Lucene index is locked, waiting 5 
sec.);
   Thread.sleep(5000);
   }
   reader = IndexReader.open(indexName);
   s = new IndexSearcher(reader);
   //out.write(Index opened);
   } catch (Exception e) {
   throw new Exception(e.getMessage());
   }
  }

  public void getQuery(String queryString) throws Exception {
   int totalhits   = 0;
   Analyzer analyzer = new StandardAnalyzer();
  
   String[] queryFields = 
{field1,field2,field3,field4,field5};
   float[] boostFields = {10, 6, 2, 1, 1};

   try {
   for ( int i = 0; i  queryFields.length; i++)
   {
   q = QueryParser.parse(queryString, queryFields[i], 
analyzer);
   if (boostFields[i]  1)
   q.setBoost(boostFields[i]);
   query.add(q, false, false);
   }
   } catch (ParseException e) {
   throw new Exception(e.getMessage());
   }

   try {
   hits = s.search(query);
   } catch (Exception e) {
   throw new Exception(e.getMessage());
   }
  
   totalhits = hits.length();

   if (totalhits == 0) { // if we find 
no hits, tell the user
   out.write(brI'm sorry I couldn't find your query:  + 
queryString);
   } else {

   for (int i = 0; i  totalhits; i++) {
   Document doc = hits.doc(i);
   String field1 = doc.get(field1);
   String field2 = doc.get(field2);
   String field3 = doc.get(field3);
   String field4 = doc.get(field4);
   String field5 = doc.get(field5);
   out.write(Field1:  + field1 + , Field2:  + field2 + , 
Field3:  + field3 + , Field4:  + field4 + , Field5:  + field5 + 
br);
   }
 }
}


Sanyi said the following on 2/7/2005 3:54 AM:
Hi!
Can you please explain how did you implement the java and php part to let them 
communicate through
this bridge?
The brige's project summary talks about java application-server or a dedicated 
java process
and I'm not into Java that much.
Currenty I'm using a self-written command-line search program and it outputs 
its results to the
standard output.
I guess your solution must be better ;)
If the communication parts of your code aren't top secret, can you please 
share them with me/us?
Regards,
Sanyi

		
__ 
Do you Yahoo!? 
Read only the mail you want - Yahoo! Mail SpamGuard. 
http://promotions.yahoo.com/new_mail 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Query Analyzer

2005-02-07 Thread Ravi
That worked. Thanks a lot. 

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 07, 2005 11:39 AM
To: Lucene Users List
Subject: Re: Query Analyzer


On Feb 7, 2005, at 11:29 AM, Ravi wrote:

 How do I set the analyzer when I build the query in my code instead of
 using a query parser ?

You don't.  All terms you use for any Query subclasses you instantiate 
must match exactly the terms in the index.  If you need an analyzer to 
do this then you're responsible for doing it yourself, just as 
QueryParser does underneath.  I do this myself in my current 
application like this:

 private Query createPhraseQuery(String fieldName, String string, 
boolean lowercase) {
 RossettiAnalyzer analyzer = new RossettiAnalyzer(lowercase);
 TokenStream stream = analyzer.tokenStream(fieldName, new 
StringReader(string));

 PhraseQuery pq = new PhraseQuery();
 Token token;
 try {
   while ((token = stream.next()) != null) {
   pq.add(new Term(fieldName, token.termText()));
   }
 } catch (IOException ignored) {
   // ignore - shouldn't get an IOException on a StringReader
 }

 if (pq.getTerms().length == 1) {
 // optimize single term phrase to TermQuery
 return new TermQuery(pq.getTerms()[0]);
 }

 return pq;
 }

Hope that helps.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PHP-Lucene Integration

2005-02-07 Thread Owen Densmore
Wow, thanks all for the great spectrum of possibilities.  We'll be 
doing a design review in a week or two with the client and we'll find 
out what way would be best for their site.  I'll report back then.

Thanks again, what a group!
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Reconstruct segments file?

2005-02-07 Thread Ian Soboroff
Doug Cutting [EMAIL PROTECTED] writes:

 Ian Soboroff wrote:
 I've looked over the file formats web page, and poked at a known-good
 segments file from a separate, similar index using od(1) and such.  I
 guess what I'm not sure how to do is to recover the SegSize from the
 segment I have.

 The SegSize should be the same as the length in bytes of any of the 
 .f[0-9]+ files in the segment.  If your segment is in compound format 
 then you can use IndexReader.main() in the current SVN version to list 
 the files and sizes in the .cfs file, including its contained .f[0-9]+ 
 files.

Thanks, Doug, that is a huge help.

BTW, the fileformats.html page on the Lucene web site is incorrect
with regards to the segments file.  The description should read:

Segments -- Format, Version, Counter, SegCount, 
 SegName, SegSize^SegCount

That is, the Counter field is missing.  The Counter field is a UInt32.
Counter is used to generate the next segment name (see
IndexWriter.newSegmentName()).

Speaking of Counter, I have a dumb question.  If the segments are
named using an integer counter which is incremented, what is the point
in converting that counter into a string for the segment filename?
Why not just name the segments e.g. 1.frq, etc.?

Ian




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reconstruct segments file?

2005-02-07 Thread Doug Cutting
Ian Soboroff wrote:
Speaking of Counter, I have a dumb question.  If the segments are
named using an integer counter which is incremented, what is the point
in converting that counter into a string for the segment filename?
Why not just name the segments e.g. 1.frq, etc.?
The names are prefixed with an underscore, since it turns out that some 
filesystems have trouble (DOS?) with certain all-digit names.  Other 
than that, they are integers, just with a large radix.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-07 Thread David Spencer
Nice, very similar to what I was thinking of, where the most significant 
difference is probably just that I was thinking of a batch indexer, not 
one embedded in a web container. Probably a worthwhile contribution to 
the sandbox.


Aad Nales wrote:
Yep,
This is how we do it.
We have a search.xml that maps database fields to search fields and a 
parameter part that describes the 'click for detailed result url' and 
the parameter names (based on the search fields). In this xml we also 
describe how the different fields should be stored we have for instance 
a number of large text fields for we use the unstored option.

The framework that we have build around has an element that we call 
detailer. This detailer creates a lucene Document with the fields as 
specified in the search.xml

To illustrate here is the code that specifies the detailer for a forum.
-- XML 
-
documenttype id=FORUM index=general defaultfield=body
fields
field property=messageid searchfield=messageid type=unindexed 
key=true/
field property=instanceid searchfield=instanceid type=unindexed /
field property=subject searchfield=title type=split maxwords=8 /
field property=body searchfield=default type=split maxwords=20 /
field property=aka_username searchfield=username type=keyword /
field property=modifiedDateAsDate searchfield=modifieddate 
type=keyword /
/fields
action uri=/forum/viewMessage.do 
image=/htmlarea/images/cops_insert_threadlink.gif
parameter property=messageid name=messageid/
parameter property=instanceid name=instanceid/
/action
analyzer 
classname=org.apache.lucene.analysis.standard.StandardAnalyzer/
/documenttype
 END XML ---

Please note:
Messageid is the keyfield here when we search the index we use a 
combined TYPE + KEY id to filter out double hits on the same document 
(not unusual in for instance a long forum thread).

Per type of document we also specifify what picture to show in the 
result (image), and we specify in what index the result should be 
written and what the general search field is (if the user submits a 
query without search all and without a field specified).

We have added the 'split' key word which makes it possbile to search a 
long text but only store a bit in the resulting hit.

The reindex is pretty straightforward we build a series of detailers for 
all possible document types and we run through the database and call the 
right detailer from a HashMap.

We have not included the JDBC stuff since the application is always 
running in Tomcat-Struts and since we cache most of the database reads. 
(a completely differnt story).

Queries on new and changed records seem to only make sense if asked in a 
context of time. (Right?). We have not needed it yet. The mapping can be 
query from a singleton java class. (SearchConfiguration).

We are currently adding functionality to store 'user structured data' 
best imagined as user defined input forms that are described in XML and 
are then stored as XML in the database. We query these documents using 
Lucene. These documents end up in the same index but this is quite 
manageable by using specialized detailers. For these document the type 
is more important then for the 'normally' stored documents. For this 
latter situation the search logic assumes that the query is 
appropriately configured by the application.

I am not sure if this is the kind of solution that you are looking for, 
but everything we produce is 100% open source.

Cheers,
Aad
David Spencer wrote:
Many times I've written ad-hoc code that pulls in data from an RDBMS 
and builds a Lucene index. The use case is a typical database-driven 
dynamic website which would be a hassle to spider (say, due to tricky 
authentication).

I had a feeling this had been done in a general manner but didn't see 
any code in the sandbox, nor did any searches turn it up.

I've spent a few mins thinking this thru - what I'd expect is to be 
able to configure is:

1. JDBC Driver + conn params
2. Query to do a 1 time full index
3. Query to show new records
4. Query to show changed records
5. Query to show deleted records
6. Query columns to Lucene Field name mapping
7. Type of each field name (e.g. the equivalent of the args to the 
Field ctr)

So a simple example, taking item 2 is
query: select url, name, body from foo
(now the column to field mapping)
col 1 = url
col 2 = title
col 3 = contents
(now the field types for each named field)
url = Field( ...store=true, index=false)
title = Field( ...store=true, index=true)
contents = Field( ...store=false, index=true)

And voilla, nice, elegant, data driven indexing.
Does it exist?
Should it? :)
PS
I know in the more general form, query needs to be replaced by 
queries above, and the updated query may need some time stamp 
variable expansion, and possibly the queries need paging to deal w/ 
lamo DBs like mysql that don't have cursors for large result 

Fwd: SearchBean?

2005-02-07 Thread Erik Hatcher
I want to double-check with the user community now that I've run this 
past the lucene-dev list.

Anyone using SearchBean from the Sandbox?  If so, please speak up and 
let me know what it offers that the sort feature does not.  If this is 
now essentially deprecated, I'd like to remove it.

Thanks,
Erik
Begin forwarded message:
From: Erik Hatcher [EMAIL PROTECTED]
Date: February 6, 2005 10:02:37 AM EST
To: Lucene List lucene-dev@jakarta.apache.org
Subject: SearchBean?
Reply-To: Lucene Developers List lucene-dev@jakarta.apache.org
Is the SearchBean code in the Sandbox still useful now that we have 
sorting in Lucene 1.4?  If so, what does it offer that the core does 
not provide now?

As I'm cleaning up the sandbox and migrating it to a contrib area, 
I'm evaluating the pieces and making sure it makes sense to keep or if 
it is no longer useful or should be reorganized in some way.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Storage Cost of Indexed, Untokenized Fields

2005-02-07 Thread Todd VanderVeen
Is there an additional storage cost to flagging an untokenized, indexed 
field as stored? Is the flag just for indicating that it be returned in 
result sets? I assume storage for tokenized fields is managed 
separately, but am curious if untokenized fields are resolved from the 
native index structure?

Thanks,
Todd VanderVeen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RangeQuery With Date

2005-02-07 Thread Luke Shannon
Hi;

I am working on a set of queries that allow you to find modification dates
before, after and equal to a given date.

Here are some of the before queries I have been playing with. I want a query
that pull up dates modified before Nov 11 2004:

Query query = new RangeQuery(null, new Term(modified, 11/11/04), false);

This one doesn't work. It turns up all the documents in the index.

Query query = QueryParser.parse(modified:[1/1/00 TO 11/11/04], subject,
new StandardAnalyzer());

This works but I don't like having to specify the begin date like this.

Query query = QueryParser.parse(modified:[null TO 11/11/04], subject,
new StandardAnalyzer());

This throws an exception.

How are other doing a Query like this?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity coord,lengthNorm

2005-02-07 Thread Andrzej Bialecki
Erik Hatcher wrote:
On Feb 7, 2005, at 8:53 AM, Michael Celona wrote:
Would fixing the lengthNorm to 1 fix this problem?

Yes, it would eliminate the length of a field as a factor.
Your best bet is to set up a test harness where you can try out various 
tweaks to Similarity, but setting the length normalization factor to 1.0 
may be all you need to do, as the coord() takes care of the other factor 
you're after.
I'm releasing next week a new version of Luke, which includes a custom 
Similarity designer (using Rhino JavaScript engine) - it makes 
experimenting with Similarity super-easy.

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: RangeQuery With Date

2005-02-07 Thread Luke Francl
Your dates need to be stored in lexicographical order for the RangeQuery
to work.

Index them using this date format: MMDD.

Also, I'm not sure if the QueryParser can handle range queries with only
one end point. You may need to create this query programmatically.

Regards,
Luke Francl


Re: RangeQuery With Date

2005-02-07 Thread Chris Hostetter
: Your dates need to be stored in lexicographical order for the RangeQuery
: to work.
:
: Index them using this date format: MMDD.
:
: Also, I'm not sure if the QueryParser can handle range queries with only
: one end point. You may need to create this query programmatically.

and when creating them progromaticaly, you need to use the exact same
format they were indexed in.  Assuming I've corectly guess what your
indexing code looks like, you probably want...

Query query = new RangeQuery(null, new Term(modified, 2004), false);




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RangeQuery With Date

2005-02-07 Thread Luke Shannon
Bingo. Thanks!

Luke

- Original Message - 
From: Chris Hostetter [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Monday, February 07, 2005 5:10 PM
Subject: Re: RangeQuery With Date


 : Your dates need to be stored in lexicographical order for the RangeQuery
 : to work.
 :
 : Index them using this date format: MMDD.
 :
 : Also, I'm not sure if the QueryParser can handle range queries with only
 : one end point. You may need to create this query programmatically.

 and when creating them progromaticaly, you need to use the exact same
 format they were indexed in.  Assuming I've corectly guess what your
 indexing code looks like, you probably want...

 Query query = new RangeQuery(null, new Term(modified, 2004),
false);




 -Hoss


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: When are deletions permanent?

2005-02-07 Thread yahootintin . 1247688
When you close the IndexReader you know the delete is commited to disk.  I
believe calling the commit method will also guarantee that all changes are
written to disk.  



--- Lucene Users List lucene-user@jakarta.apache.org
wrote:

Hi everyone,

 

 I need to update a document in Lucene. I already
know that for that

 I need to do a delete (IndexReader) and then an add
(IndexWriter). I

 also know that the deletion means been marked as deleted,
until

 optimize().

 

 My question is, when am I SURE that the mark is
commited to disk? I

 mean, suppose that there is a crash while Im doing
a deletion. Could

 it be that when I recover and check Lucene, the item
is still there?

 At which point Im 100% sure the deletion is permanent?

 

 What about adds?

 

 Thanks,

 Christian

 

 -

 To unsubscribe, e-mail: [EMAIL PROTECTED]

 For
additional commands, e-mail: [EMAIL PROTECTED]

 

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Starts With x and Ends With x Queries

2005-02-07 Thread Luke Shannon
I implemented this concept for my ends with query. It works very well!

- Original Message - 
From: Chris Hostetter [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Friday, February 04, 2005 9:37 PM
Subject: Re: Starts With x and Ends With x Queries



 : Also keep in mind that QueryParser only allows a trailing asterisk,
 : creating a PrefixQuery.  However, if you use a WildcardQuery directly,
 : you can use an asterisk as the starting character (at the risk of
 : performance).

 On the issue of ends with wildcard queries, I wanted to throw out and
 idea that i've seen used to deal with matches like this in other systems.
 I've never acctually tried this with Lucene, but I've seen it used
 effectively with other systems where the goal is to sort strings by the
 least significant (ie: right most) characters first.  I think it could
 apply nicely to people who have compelling needs for efficent 'ends with'
 queries.



 Imagine you have a field call name, which you can already do efficient
 prefix matching on using the PrefixQuery class.  Your docs and query may
 look something like this...

D1 name:Adam Smith age:13 state:CA ...
D2 name:Joe Bob age:42 state:WA ...
D3 name:John Adams age:35 state:NV ...
D3 name:Sue Smith age:33 state:CA ...

 ...and your queries may look something like...

Query q1 = new PrefixQuery(new Term(name,J*));
Query q2 = new PrefixQuery(new Term(name,Sue*));

 If you want to start doing suffix queries (ie: all names ending with
 s, or all names ending with Smith) one approach would be to use
 WildcarQuery, which as Erik mentioned, will allow you to use a quey Term
 that starts with a *. ie...

Query q3 = new WildcardQuery(new Term(name,*s));
Query q4 = new WildcardQuery(new Term(name,*Smith));

 (NOTE: Erik says you can do this, but the docs for WildcardQuery say you
 can't I'll assume the docs are wrong and Erik is correct.)

 The problem is that this is horrendously inefficient.  In order to find
 the docs that contain Terms which match your suffix, WildcardQuery must
 first identify what all of those Terms are, by iterating over every Term
 in your index to see if they match the suffix.  This is much slower then a
 PrefixQuery, or even a WildcardQuery that has just 1 initial character
 before a * (ie: s*foobar), because it can then seek to directly to the
 first Term that starts with that character, and also stop iterating as
 soon as it encounters a Term that no longer begins with that character.

 Which leads me to my point: if you denormalize your data so that you store
 both the Term you want, and the *reverse* of the term you want, then a
 Suffix query is just a Prefix query on a reversed field -- by sacrificing
 space, you can get all the speed efficiencies of a PrefixQuery when doing
 a SuffixQuery...

D1 name:Adam Smith rname:htimS madA age:13 state:CA ...
D2 name:Joe Bob rname:boB oeJ age:42 state:WA ...
D3 name:John Adams rname:smadA nhoJ age:35 state:NV ...
D3 name:Sue Smith rname:htimS euS age:33 state:CA ...

Query q1 = new PrefixQuery(new Term(name,J*));
Query q2 = new PrefixQuery(new Term(name,Sue*));
Query q3 = new PrefixQuery(new Term(rname,s*));
Query q4 = new PrefixQuery(new Term(rname,htimS*));


 (If anyone sees a flaw in my theory, please chime in)


 -Hoss


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2005-02-07 Thread Owen Densmore
I would like to be able to analyze my document collection (~1200 
documents) and discover good buckets of categories for them.  I'm 
pretty sure this is termed Document Clustering .. finding the emergent 
clumps the documents fall naturally into judging from their term 
vectors.

Looking at the discussion that flared roughly a year ago (last message 
2003-11-12) with the subject Document Clustering, it seems Lucene 
should be able to help with this.  Has anyone had success with this 
recently?

Last year it was suggested Carrot2 could help, and it would even 
produce good labels for the clusters.  Has this proven to be true?  Our 
goal is to use clustering to build a nifty graphic interface, probably 
using Flash.

Thanks for any pointers.
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-07 Thread Aad Nales
If that is a general thought then I will plan for some time to put this 
in action.

Cheers,
Aad
David Spencer wrote:
Nice, very similar to what I was thinking of, where the most 
significant difference is probably just that I was thinking of a batch 
indexer, not one embedded in a web container. Probably a worthwhile 
contribution to the sandbox.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]