Some questions about index...

2005-02-05 Thread Karl Koch
Hi all,

for simplicity reasons I would like to use the index as my data storage
whilst using the advantage of the highly optimised Lucene index structure.

1) Can I store all the information of the text file, but also apply a
analyser. E.g. I use the StopAnalyzer. After finding the document, I want to
extract the original text also from the index. Does this require that I
store the information twice in two different fields (one indexed and one
unindexed) ?

2) I would like to extract information from the index which can found in a
boolean way. I know that Lucene is a VSM which provides Boolean operators.
This however does not change its functioning. For example, I have a field
with contains an ID number and I want to use the search like a database
operatation (e.g. to find the document with id=1). I can solve the problem
by searching with query id:1. However, this does not ensure that I will
only get one result. Usually the first result is the document I want. But it
could happen, that this sometimes does not work. What happens if I should
get no results? I guess if I search for id=5 and 5 did not exist I would
probably get 50, 51, .. just because the contain 5. Did somebody work with
this and can suggest a stable solution?

A good solution for these two questions would help me avoiding a database
which would need to replicate most the data which I already have in my
Lucene index...

Kind Regards,
Karl


-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Some questions about index...

2005-02-05 Thread Erik Hatcher
On Feb 5, 2005, at 10:04 AM, Karl Koch wrote:
1) Can I store all the information of the text file, but also apply a
analyser. E.g. I use the StopAnalyzer. After finding the document, I 
want to
extract the original text also from the index. Does this require that I
store the information twice in two different fields (one indexed and 
one
unindexed) ?
You should use a single stored, tokenized, and indexed field for this 
purpose.  Be cautious of how you construct the Field object to achieve 
this.

2) I would like to extract information from the index which can found 
in a
boolean way. I know that Lucene is a VSM which provides Boolean 
operators.
This however does not change its functioning. For example, I have a 
field
with contains an ID number and I want to use the search like a database
operatation (e.g. to find the document with id=1). I can solve the 
problem
by searching with query id:1. However, this does not ensure that I 
will
only get one result. Usually the first result is the document I want. 
But it
could happen, that this sometimes does not work.
Why wouldn't it work?  For ID-type fields, use a Field.Keyword (stored, 
indexed, but not tokenized).  Search for a specific ID using a 
TermQuery (don't use QueryParser for this, please).  If the ID values 
are unique, you'll either get zero or one result.

 What happens if I should
get no results? I guess if I search for id=5 and 5 did not exist I 
would
probably get 50, 51, .. just because the contain 5. Did somebody work 
with
this and can suggest a stable solution?
No, this would not be the case, unless you're analyzing the ID field 
with some strange character-by-character analyzer or doing a wildcard 
*5* type query.

A good solution for these two questions would help me avoiding a 
database
which would need to replicate most the data which I already have in my
Lucene index...
You're on the right track and avoiding a database when it is overkill 
or duplicative is commendable :)

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Some questions about index...

2005-02-05 Thread Karl Koch
Thank you four fast, straight and usefull comments. Keeping in mind what was
said, did anybody actually think about implementing a kind of database layer
on top of a lucene index. A database would be an index, collumns would be
fields and entries documents. At least everything which would only require a
single table could be done. A SELECT would be search ...

:-)

Karl

 
 On Feb 5, 2005, at 10:04 AM, Karl Koch wrote:
  1) Can I store all the information of the text file, but also apply a
  analyser. E.g. I use the StopAnalyzer. After finding the document, I 
  want to
  extract the original text also from the index. Does this require that I
  store the information twice in two different fields (one indexed and 
  one
  unindexed) ?
 
 You should use a single stored, tokenized, and indexed field for this 
 purpose.  Be cautious of how you construct the Field object to achieve 
 this.
 
  2) I would like to extract information from the index which can found 
  in a
  boolean way. I know that Lucene is a VSM which provides Boolean 
  operators.
  This however does not change its functioning. For example, I have a 
  field
  with contains an ID number and I want to use the search like a database
  operatation (e.g. to find the document with id=1). I can solve the 
  problem
  by searching with query id:1. However, this does not ensure that I 
  will
  only get one result. Usually the first result is the document I want. 
  But it
  could happen, that this sometimes does not work.
 
 Why wouldn't it work?  For ID-type fields, use a Field.Keyword (stored, 
 indexed, but not tokenized).  Search for a specific ID using a 
 TermQuery (don't use QueryParser for this, please).  If the ID values 
 are unique, you'll either get zero or one result.
 
   What happens if I should
  get no results? I guess if I search for id=5 and 5 did not exist I 
  would
  probably get 50, 51, .. just because the contain 5. Did somebody work 
  with
  this and can suggest a stable solution?
 
 No, this would not be the case, unless you're analyzing the ID field 
 with some strange character-by-character analyzer or doing a wildcard 
 *5* type query.
 
  A good solution for these two questions would help me avoiding a 
  database
  which would need to replicate most the data which I already have in my
  Lucene index...
 
 You're on the right track and avoiding a database when it is overkill 
 or duplicative is commendable :)
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]