How to fire a query ?

2006-10-09 Thread Bhavin Pandya
Hi guys,

How to fire digital camera when someone fire digital cam .. ?
Do i need to make manual list for such items and look up at search time or 
theree is any better way to do this...

-Bhavin pandya

Re: lucene link database

2006-10-09 Thread mark harwood
if you search the archive for database you'll bet a bunch of threads

This was a hybrid implementation I did which worked with HSQLDB and Derby:

http://www.mail-archive.com/java-user@lucene.apache.org/msg02953.html


Cheers
Mark

- Original Message 
From: Erick Erickson [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Sunday, 8 October, 2006 8:33:59 PM
Subject: Re: lucene link database

A quick word of caution about doc IDs. Lucene assigns a document id at index
time, but that ID is *not* guaranteed to remain the same for a given
document. For instance... you index docs A, B, and C. They get Lucene IDs 1,
2, 3. Then you remove doc B and optimize the index. As I understand it, doc
C will get re-assigned ID 2, and ID 3 won't exist.

In reality, I don't think that the algorithm is quite as simplistic as that,
but that's the idea. So be sure to assign your own unique identifiers that
you add to your docs as a field value.

Others on this list have talked abouta hybrid solution. That is, have *both*
lucene and a database, each doing what they do best. It's more complicated,
especially keeping the two in synch. some tools have been mentioned, I think
if you search the archive for database you'll bet a bunch of threads. But I
thought I'd mention it..

Best of luck
Erick

On 10/8/06, Cam Bazz [EMAIL PROTECTED] wrote:

 Dear Erick;

 Thank you for your detailed insight. I have been trying to code a graph
 object database for sometime.
 I have prototyped on relational as well as object oriented databases,
 including opensource and commercial implementations.
 (so far, I have tried hibernate, objectivity/db, db4o) while object
 databases excel in traversing links, they are poor when searching.

 lucene so far solves the problem of solving. I am thinking of a document
 as a list of tuples. (sequence of fields) and I can do searches with
 lucene, it is really nice.

 now I have to solve the problem of linking. if I keep the nodes with a
 lucene index, and I can fetch documents with a doc_id, or some sort of
 surrogate identifier, and
 use those identifiers as node_id in an object graph, that will be what I
 want. but in order to do that I need to be able to query the lucene
 index by document_id.

 I was referring to the link db of the nutch. They do have some sort of
 link db implementation, that runs with hadoop, but I have not understood
 the full code.
 I am trying to understand the structure of this link database. I was
 thinking of using documents with src and dst fields, that have document
 id's as values. (one idea, I will try it tomorrow)

 Again thanks a bunch.

 Best Regards,
 C.B.

 Erick Erickson wrote:
  Aproach it in whatever way you want as long as it solves your problem
  G.
 
  My first question is why use lucene? Would a database suit your needs
  better? Of course, I can't say. Lucene shines at full-text searching, so
  it's a closer call if you aren't searching on parts of text. By that I
  mean
  that if you're not searching on *parts* of your links, you may want to
  consider a DB solution.
 
  That said, and if I understand your requirement, you have a pretty
 simple
  design. Each document has two fields, incominglinks and outgoing
  links. But
  see the note below. Lucene indexes what you give it, so the fact that
  some
  of the links aren't hypertext links is immaterial to Lucene. Since you
  control both the indexer and searcher, these confrom to whatever your
  requirements are. It's up to you to map semantics onto these entities.
 
  One common trap DB-savvy people have is that they think of documents as
  entries in a table, all with the same fields. There is nothing
  requiring you
  to have the *same* fields in each document in an index. You could have
 an
  index for which no two documents shared *any* common field if you
 choose.
 
  So, if you want to find out what, say, which documents have link X as an
  incoming link, just search on incominglinks:X. If you wanted to find the
  documents that had any incoming links X, Y, Z that matched an outgoing
  link
  in another document, just search the OR of these in outgoinglinks.
 
  If you want some kind of map of the whole web of links, you'll have to
  write
  some iterative loop and keep track. There's nothing built in that I
  know of
  that lets you answer Given link X, show me all the documents no more
  than 3
  hops away. Lucene is an *engine*, designed to have apps built on top
  of it.
  Lucene doesn't deal with relations between documents, just searching
 what
  you've indexed.
 
  It's easy enough to store a variable number of links in your
  incominglinks
  or outgoinglinks field. Just be sure they're tokenized appropriately.
 You
  can add them any way you choose, either concatenate them all into a big
  string and index that, or index them into the same field, e.g.
  Document doc = new Document();
  doc.add(incoming, link1);
  doc.add(incoming, link2);
  .
  .
  .
  writer.add(doc);
 
  According to a 

Incremental updates / slow searches.

2006-10-09 Thread Rickard Bäckman

Hi,

we are using a search system based on Lucene and have recently tried to add
incremental updating of the index instead of building a new index every now
and then. However we now run into problems as our searches starts to take
very long time to complete.

Our index is about 8-9GB large and we are sending lots of updates / second
(we are probably merging in 200 - 300 in a few seconds). Today we buffer a
bunch of updates and then merge them into the existing index like a batch,
first doing deletes and then inserts.

We are currently not using any special tuning of Lucene.

Does anyone have any similiar experiences from Lucene or advices on how to
reduce the amount of times it takes to perform a search? In particular what
would be an optimal combination of update size, merge factor, max buffered
docs?

/Rickard


Re: Performing a like query

2006-10-09 Thread Rahil

Hi Steve

Thanks for your response. I was just wondering whether there is a 
difference between the regular expression you sent me i.e.

(i)   \s*(?:\b|(?=\S)(?=\s)|(?=\s)(?=\S))\s*

   and
  
(ii)   \\b


as they lead to the same output. For example, the string search testing 
a-new string=3/4 results in the same output :

Item is :
Item is : testing
Item is : 
Item is : a

Item is : -
Item is : new
Item is : 
Item is : string

Item is : =
Item is : 3
Item is : /
Item is : 4

What Id like to do though is remove the split over space characters so 
that the output is such :

Item is : testing
Item is : a
Item is : -
Item is : new
Item is : string
Item is : =
Item is : 3
Item is : /
Item is : 4

Im not great at regular expressions so would really appreciate if you 
could provide me with some insight into expression (i) .


Thanks for all your help
Rahil

Steven Rowe wrote:


Hi Rahil,

Rahil wrote:
 

I couldnt figure out a valid regular expression to write a valid 
Pattern.compile(String regex) which can tokenise a string into O/E -

visual acuity R-eye=6/24 into O,/,E, -, visual, acuity,
R, -, eye, =, 6, /, 24.
   



The following regular expression should match boundaries between word
and non-word, or between space and non-space, in either order, and
includes contiguous whitespace:

  \s*(?:\b|(?=\S)(?=\s)|(?=\s)(?=\S))\s*

Note that with the above regex, the (%$#!) in some (%$#!) text will
be tokenized as a single token.

Hope it helps,
Steve

 


Erick Erickson wrote:

   


Well, I'm not the greatest expert, but a quick look doesn't show me
anything
obvious. But I have to ask, wouldn't WhiteSpaceAnalyzer work for you?
Although I don't remember whether WhiteSpaceAnalyzer lowercases or not.

It sure looks like you're getting reasonable results given how you're
tokenizing.

If not that, you might want to think about PatternAnalyzer. It's in the
memory contribution section, see import
org.apache.lucene.index.memory.PatternAnalyzer. One note of caution, the
regex identifies what is NOT a token, rather than what is. This threw
me for
a bit.

I still claim that you could break the tokens up like 6, /, 12, and
make SpanNearQuery work with a span of 0 (or 1, I don't remember right
now),
but that may well be more trouble than it's worth, it's up to you of
course.
What you get out of this is, essentially, is a query that's only
satisfied
if the terms you specify are right next to each other. So you'd find both
your documents in your example, since you would have tokenized 6, /,
12 in, say positions 0, 1, 2 in doc1 and 4, 5, 6 in the second doc. But
since they're tokens that are next to each other in each doc,
searching with
a SpanNearQuery for 6, /, and 12 that are right next to each
other,
which you specify with a slop of 0 as I remember you should get both.

Alternatively, if you tokenize it this way, a PhraseQuery might work as
well, Thus, searching for 6 / 12 (as a phrase query and note the
spaces)
might be just what you want. You'd have to tokenize the query, but that's
relatively easy. This is probably much simpler than a SpanNearQuery
now that
I think about it.

Be aware that if you use the *TermEnums we've been talking about, you'll
probably wind up wrapping them in a ConstantScoreQuery. And if you
have no
*other* terms, you won't get any relevancy out of your search. This
may be
important.

Anyway, that's as creative as I can be Sunday night G. Best of luck

Erick

On 10/1/06, Rahil [EMAIL PROTECTED] wrote:

 


Hi Erick

Thanks for your response. There's a lot to chew on in your reply and Im
looking at the suggestions you've made.

Yeah I have Luke installed and have queried my index but there isn't any
great explanation Im getting out of it.  A query for 6/12 is sent as
TERM:6/12 which is quite straight-forward. I did an explanation of the
query in my code though and got some more information but that too
wasn't of much help either.
--
Explanation explain = searcher.explain(query,0);

OUTPUT:
query: +TERM:6/12
explain.getDescription() : weight(TERM:6/12 in 0), product of:
Detail 0 : 0.9994 = queryWeight(TERM:6/12), product of:
 2.0986123 = idf(docFreq=1)
 0.47650534 = queryNorm

Detail 1 : 0.0 = fieldWeight(TERM:6/12 in 0), product of:
 0.0 = tf(termFreq(TERM:6/12)=0)
 2.0986123 = idf(docFreq=1)
 0.5 = fieldNorm(field=TERM, doc=0)

Number of results returned: 1
SampleLucene.displayIndexResults
SCOREDESCRIPTIONSTATUSCONCEPTIDTERM
1.002602780076/12 (finding)
--

My tokeniser called BaseAnalyzer extends Analyzer. Since I wanted to
retain all non whitespace characters and not just letters and digits, I
introduced the following block of code in the overridden tokenStream( )

--
public TokenStream tokenStream(String fieldName, Reader reader) {

   return new CharTokenizer(reader) {

   protected char normalize(char c) {
return Character.toLowerCase(c);
   }
   protected boolean isTokenChar(char c) {

Re: highlight optimization

2006-10-09 Thread Erick Erickson

The fastest way to see if opening/closing your searcher is a problem would
be to write a tiny little program that opened the index, fired off a few
queries and timed each one. The queries can be canned, of course. I'm
thinking this is, say, less that 20 lines (including imports). If you're
familiar with junit, think of it in those terms.

Once you've proved that is a bottleneck you want to work on, you probably
need a search server. We've used XmlRpc, which has server code built-in, and
has worked like a charm for us. It'll add a bit of complexity, but it'll
keep your searchers open.

That said, I suspect that someone will chime in with another solution, since
this is already implemented G

Erick

On 10/9/06, Stelios Eliakis [EMAIL PROTECTED] wrote:


Hi,
I have a collection of 500 txt documents and I implement a web
application(JSP) for searching these documents.
In addition, the application shows the BestFragment of each result and
highlights the query terms.
My application is slow enough (about 2,5-3 seconds for each query) even if
I
run it from my computer (It's not publiched yet).
Do you suggest me something in order to improve speed?
I have read that you have to keep the indexsearcher open. Is it right? and
how could I do that(when I must close it)?

In Highlighting I use the following
 String result =
highlighter.getBestFragment(tokenStream,text)


text parameter must be String so I open the document and convert it to
String. Of course it is time consuming. Is there a different way?

Thanks in advance,

--
Stelios Eliakis




How to search with empty content

2006-10-09 Thread Kumar, Samala Santhosh (TPKM)
 
I want to search without giving any input, when I search leaving blank
the search text box it should give me all the documents present in the
index. please give me some solution or pointers. 

regards
Santhosh


 

 


 

 


Re: Performing a like query

2006-10-09 Thread Steven Rowe
Hi Rahil,

Rahil wrote:
 I was just wondering whether there is a
 difference between the regular expression you sent me i.e.
 (i)   \s*(?:\b|(?=\S)(?=\s)|(?=\s)(?=\S))\s*
 
and
   (ii)   \\b
 
 as they lead to the same output. For example, the string search testing
 a-new string=3/4 results in the same output : [...]

There is a difference for strings like testing a- -new string=3/4 --
with (ii), you will get:

   ..., a, - -, new, ...

but with (i), you will get:

   ..., a, -, -, new, ...

 What Id like to do though is remove the split over space characters [...]

From my reading of org.apache.lucene.index.memory.PatternAnalyzer
(assuming you're using this class), I don't think this is necessary,
since it just throws away zero-length tokens.  Actually, given the
below-discussed algorithm for PatternAnalyzer, I don't think it's even
possible to do what you want.

Here's the PatternAnalyzer.next() method definition (from
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/memory/src/java/org/apache/lucene/index/memory/PatternAnalyzer.java?revision=450725view=markup):

public Token next() {
  if (matcher == null) return null;

  while (true) { // loop takes care of leading and trailing boundary cases
int start = pos;
int end;
boolean isMatch = matcher.find();
if (isMatch) {
  end = matcher.start();
  pos = matcher.end();
} else {
  end = str.length();
  matcher = null; // we're finished
}

if (start != end) { // non-empty match (header/trailer)
  String text = str.substring(start, end);
  if (toLowerCase) text = text.toLowerCase(locale);
  return new Token(text, start, end);
}
if (!isMatch) return null;
  }
}

This method finds token breakpoints, remembering the end of the previous
breakpoint (in instance field pos), then compares the beginning of the
current breakpoint with the end of the previous breakpoint (if (start
!= end)), creating a Token *only* if the text between breakpoints has
longer than zero length.

If you're familiar with Perl, this class emulates a Perl regex idiom:

(iii) @tokens = grep { length  0 } split /my-regex/, $text;

That is, return a list of tokens generated by breaking text on a regex,
filtering out zero-length tokens.

Actually, the way I usually write this in Perl is:

(iv) @tokens = grep { /\S/ } split /my-regex/, $text;

In the above version, tokens are kept only if they contain at least one
non-space character (this also filters out zero-length tokens).
PatternAnalyzer, OTOH, *will* emit whitespace-only tokens - it
implements (iii), not (iv).

Hope it helps,
Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to search with empty content

2006-10-09 Thread Scott

You can get all document by using MatchAllDocsQuery.

Kumar, Samala Santhosh (TPKM) wrote:
 
I want to search without giving any input, when I search leaving blank

the search text box it should give me all the documents present in the
index. please give me some solution or pointers. 


regards
Santhosh


 

 



 

 



--
Scott

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TermQuery and PhraseQuery..problem with word with space

2006-10-09 Thread Ismail Siddiqui

I am using StandardAnalyzer while indexing the field..
I  am also a creatign a field called full_text in which i am adding all
these individual  fields as TOKENIZED.


here is the code

while(choiceIt.hasNext()){
 PersonProfileAnswer pa=(PersonProfileAnswer)choiceIt.next();
   if(pa.getPersonProfileChoice()!=null)
   {
   doc.add(new Field(FULL_TEXT,
pa.getPersonProfileChoice().getChoice(),Field.Store.NO,Field.Index.TOKENIZED
));
LuceneProfileQuestion lpf=this.getLuceneProfileQuestion(
pa.getPersonProfileChoice().getPersonProfileQuestion().getId());

 doc.add(new Field(lpf.getLuceneFieldName(),
pa.getPersonProfileChoice().getChoice(),Field.Store.NO,
Field.Index.UN_TOKENIZED));

   }
}

when i use luke i can see the term is there.. e.g.  for a lucence field
called fav_stores UN_TOKENIZED terms Ann Taylor and Banana Republic
are there..



If i make a search on full_text.. and type banana or republic or
banana republic i get the doucment as result..  In my java class i am
using phrasequery for full_text and termquery for each individual filed..

e.g. TermQuery subjectQuery=new TermQuery(new Term(fav_stores,favStores));


In luke i do not  see any option to select query type but when I make search
on fav_stores with term Banana Republic  there is no result.


On 10/9/06, Doron Cohen [EMAIL PROTECTED] wrote:


 I am trying to index a field which has more than one word with space e.g
.
 My Word
 i am indexng it UN_TOKENIZED .. but when i use TermQuery to query My
Word
 its not yielding any result..

Seems that it should work.

Few things to check:
- make sure you are indexing with UN_TOKENIZED.
- check that either both field and query text are lower-cased or both are
not lower-cased.
- use Luke to examine the content of the index (when adding as
un-tokenized);
print the query (toString);
- do they match each other? match your expectation?


 Is term qurey limited to one word? i mean if we index a word with space
and
 index it UN_TOKENIZED..
 shouldnt TermQuery yeild result to My Word.


 Ismail

There is no such limitation.

Hope this helps,
Doron



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




deleteDocuments being ingnored

2006-10-09 Thread cfowler
Hello,

I'm brand new to this, so hopefully you can help me. I'm 
attempting to use the IndexReader object in lucene v2 to delete and readd 
documents. I very easily set up an index and my documents are added. Now 
I'm trying to update the same index by deleting the document before 
readding. The problem is that it appears that my deleteDocument() 
instruction is being ignored. I've tried using the IndexModifier object 
and the IndexReader and both have the same behavior. If anyone can point 
out my error, or help me debug this I'll be forever in your debt. Here is 
the jist of the code. 

This is the main section:
IndexWriter writer = new IndexWriter(indexDir,new 
StandardAnalyzer(), false);
writer.setUseCompoundFile(false);
indexDirectory(writer, dataDir);
int numIndexed = writer.docCount();
writer.optimize();
writer.close();

Down at the point just before readding my document I have the following 
code (i know batch is better, just doing it this for now):
IndexReader ir = IndexReader.open(indexDir);
System.out.println( + ir.numDocs());
ir.delete(new Term(filename,f.getAbsolutePath()));
System.out.println(deletes? + ir.hasDeletions());
ir.close();
if (deleted  0) {
System.out.println(deleted old index of  + 
f.getAbsolutePath());
}
System.out.println(Indexing  + f.getAbsolutePath());
Document doc = new Document();
doc.add(new Field(contents,loadContents
(doc),Field.Store.NO,Field.Index.TOKENIZED));
doc.add(new Field(filename, 
f.getAbsolutePath(),Field.Store.YES,Field.Index.TOKENIZED));  
writer.addDocument(doc);

Thanks,

Chris

Re: deleteDocuments being ingnored

2006-10-09 Thread cfowler
My apologies, the IndexReader code I included was a commented out trial. 
Here is the active version. Sorry for the error:

IndexReader ir = IndexReader.open(indexDir);
System.out.println( + ir.numDocs());
int deleted = ir.deleteDocuments(new Term(filename
,f.getAbsolutePath()));
System.out.println(deletes? + ir.hasDeletions());
ir.close();
 
if (deleted  0) {
System.out.println(deleted old index of  + 
f.getAbsolutePath());
}

Re: deleteDocuments being ingnored

2006-10-09 Thread Simon Willnauer

System.out.println(Indexing  + f.getAbsolutePath());
Document doc = new Document();
doc.add(new Field(contents,loadContents
(doc),Field.Store.NO,Field.Index.TOKENIZED));
doc.add(new Field(filename,
f.getAbsolutePath(),Field.Store.YES,Field.Index.TOKENIZED));
writer.addDocument(doc);



Hi Chris,

Do you open the writer to add your update to before you close the
indexreader to delete the outdated index document?
Another question comes in mind, do you open a new IndexReader for your
searches after the update was written to the index?

You have to follow these steps:
1. Add you document
2. close writer
3. open reader
4. delete the outdated stuff
5. close reader
6. open writer
7. add the update
8. close writer
9. release new searcher / reader

hope that gives you a little help.

best regards simon


Re: TermQuery and PhraseQuery..problem with word with space

2006-10-09 Thread Erick Erickson

OK, when you look in the fav_stores field in Luke, what do you see?

And, are you searching on Banana Republic with the capitals? If so, and
your index has the letters in lower case, that's your problem.

Erick

On 10/9/06, Ismail Siddiqui [EMAIL PROTECTED] wrote:


I am using StandardAnalyzer while indexing the field..
I  am also a creatign a field called full_text in which i am adding all
these individual  fields as TOKENIZED.


here is the code

while(choiceIt.hasNext()){
  PersonProfileAnswer pa=(PersonProfileAnswer)choiceIt.next();
if(pa.getPersonProfileChoice()!=null)
{
doc.add(new Field(FULL_TEXT,
pa.getPersonProfileChoice().getChoice(),Field.Store.NO,
Field.Index.TOKENIZED
));
 LuceneProfileQuestion lpf=this.getLuceneProfileQuestion(
pa.getPersonProfileChoice().getPersonProfileQuestion().getId());

  doc.add(new Field(lpf.getLuceneFieldName(),
pa.getPersonProfileChoice().getChoice(),Field.Store.NO,
Field.Index.UN_TOKENIZED));

}
 }

when i use luke i can see the term is there.. e.g.  for a lucence field
called fav_stores UN_TOKENIZED terms Ann Taylor and Banana Republic
are there..



If i make a search on full_text.. and type banana or republic or
banana republic i get the doucment as result..  In my java class i am
using phrasequery for full_text and termquery for each individual filed..

e.g. TermQuery subjectQuery=new TermQuery(new
Term(fav_stores,favStores));


In luke i do not  see any option to select query type but when I make
search
on fav_stores with term Banana Republic  there is no result.


On 10/9/06, Doron Cohen [EMAIL PROTECTED] wrote:

  I am trying to index a field which has more than one word with space
e.g
 .
  My Word
  i am indexng it UN_TOKENIZED .. but when i use TermQuery to query My
 Word
  its not yielding any result..

 Seems that it should work.

 Few things to check:
 - make sure you are indexing with UN_TOKENIZED.
 - check that either both field and query text are lower-cased or both
are
 not lower-cased.
 - use Luke to examine the content of the index (when adding as
 un-tokenized);
 print the query (toString);
 - do they match each other? match your expectation?

 
  Is term qurey limited to one word? i mean if we index a word with
space
 and
  index it UN_TOKENIZED..
  shouldnt TermQuery yeild result to My Word.
 
 
  Ismail

 There is no such limitation.

 Hope this helps,
 Doron



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]






Re: TermQuery and PhraseQuery..problem with word with space

2006-10-09 Thread Doron Cohen
I would guess that one of your assumptions is wrong...
The assumptions to check are:

At indexing:
- lpf.getLuceneFieldName() == fav_stores
- pa.getPersonProfileChoice().getChoice() == Banana Republic

At search:
- the query is created like this:
   new TermQuery(new Term(fav_stores,Banana Republic))
- the searcher is opened after closing the writes that added that doc.

Best to check this by writing a tiny stand-alone program that demonstrates
this behavior.

Ismail Siddiqui [EMAIL PROTECTED] wrote on 09/10/2006 08:59:39:
 I am using StandardAnalyzer while indexing the field..
 I  am also a creatign a field called full_text in which i am adding all
 these individual  fields as TOKENIZED.


 here is the code

 while(choiceIt.hasNext()){
   PersonProfileAnswer pa=(PersonProfileAnswer)choiceIt.next();
 if(pa.getPersonProfileChoice()!=null)
 {
 doc.add(new Field(FULL_TEXT,

pa.getPersonProfileChoice().getChoice(),Field.Store.NO,Field.Index.TOKENIZED

 ));
  LuceneProfileQuestion lpf=this.getLuceneProfileQuestion(
 pa.getPersonProfileChoice().getPersonProfileQuestion().getId());

   doc.add(new Field(lpf.getLuceneFieldName(),
 pa.getPersonProfileChoice().getChoice(),Field.Store.NO,
 Field.Index.UN_TOKENIZED));

 }
  }

 when i use luke i can see the term is there.. e.g.  for a lucence field
 called fav_stores UN_TOKENIZED terms Ann Taylor and Banana Republic
 are there..



 If i make a search on full_text.. and type banana or republic or
 banana republic i get the doucment as result..  In my java class i am
 using phrasequery for full_text and termquery for each individual filed..

 e.g. TermQuery subjectQuery=new TermQuery(new
Term(fav_stores,favStores));


 In luke i do not  see any option to select query type but when I make
search
 on fav_stores with term Banana Republic  there is no result.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TermQuery and PhraseQuery..problem with word with space

2006-10-09 Thread Ismail Siddiqui

in fav_stores i see Banana Republic and Ann Taylor there .. and i am
searching it with the capitals.


On 10/9/06, Erick Erickson [EMAIL PROTECTED] wrote:


OK, when you look in the fav_stores field in Luke, what do you see?

And, are you searching on Banana Republic with the capitals? If so, and
your index has the letters in lower case, that's your problem.

Erick



Re: Incremental updates / slow searches.

2006-10-09 Thread Yonik Seeley

The biggest thing would be to limit how often you open a new
IndexSearcher, and when you do, warm up the new searcher in the
background while you continue serving searches with the existing
searcher.  This is the strategy that Solr uses.

There is also the issue of if you are analyzing/merging docs on the
same servers that you are executing searches on.  You can use a
separate box to build the index and distribute changes to boxes used
for searching.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

On 10/9/06, Rickard Bäckman [EMAIL PROTECTED] wrote:

Hi,

we are using a search system based on Lucene and have recently tried to add
incremental updating of the index instead of building a new index every now
and then. However we now run into problems as our searches starts to take
very long time to complete.

Our index is about 8-9GB large and we are sending lots of updates / second
(we are probably merging in 200 - 300 in a few seconds). Today we buffer a
bunch of updates and then merge them into the existing index like a batch,
first doing deletes and then inserts.

We are currently not using any special tuning of Lucene.

Does anyone have any similiar experiences from Lucene or advices on how to
reduce the amount of times it takes to perform a search? In particular what
would be an optimal combination of update size, merge factor, max buffered
docs?

/Rickard




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: threadsafe QueryParser?

2006-10-09 Thread Yonik Seeley

On 10/9/06, Stanislav Jordanov [EMAIL PROTECTED] wrote:

Method
static public Query parse(String query, String field, Analyzer analyzer)
in class QueryParser is deprecated in 1.9.1 and the suggestion is: /Use
an instance of QueryParser and the [EMAIL PROTECTED] #parse(String)} method 
instead./
My question is: in the context of multi threaded app, is it safe that
distinct threads utilize the same instance of QueryParser for parsing
their queries?

ps. After writing this letter, I incidentally ran into the answer in the
end of the class comment of QueryParser:
/ * pNote that QueryParser is emnot/em thread-safe./p/

So, is this it?


Yes.  A single QueryParser object should not be used from multiple threads.
It's unclear why one would want to do so anyway.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene searching algorithm

2006-10-09 Thread Grant Ingersoll

Hi Michael,

I think there are a number of good resources on this:

1.  http://lucene.apache.org/java/scoring.html covers the basics of  
searching.  The bottom has some pseudo code as well.


2. Lucene In Action

3.  Search this list and other places for information on the Vector  
Space Model.  The Wiki also has a number of links, etc. that may  
prove useful, including a variety of talks and articles.


4.  Last of all, and probably best of all, the code!  Have a look at  
how TermQuery and BooleanQuery work, as well as the Searchers, etc.


Hope this helps,
Grant

On Oct 8, 2006, at 6:57 AM, Michael Chan wrote:


Hi,

Does anyone know where I can find descriptions of Lucene's searching
algorithm, besides the lecture at University of Pisa 2004? Has it been
published? I'm trying to find a reference to the algorithm.

Thanks,
Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: wildcard and span queries

2006-10-09 Thread Erick Erickson

OK, I'm using the surround code, and it seems to be working...with the
following questions (always, more questions)...


I'm gettng an exception sometimes of TooManyBasicQueries. I can control

this by initializing BasicQueryFactory with a larger number. Do you have any
cautions about upping this number?


There's a hard-coded value minimumPrefixLength set to 3 down in the code

Surround query parser (allowedSuffix). I see no method to change this. I
assume that this is to prevent using up too much memory/time. What should I
know about this value? I'm mostly interested in a justification for the
product manager why allowing, say, two character (or one character) prefixes
is a bad idea G.


I'm a bit confused. It appears that TooManyBooleanClauses is orthogonal to

Surround queries. That is, trying RegexSpanQuery doesn't want to work at all
with the same search clause, as it runs out of memory pretty quickly..

However, working with three-letter prefixes is blazingly fast.

Thanks again...

Erick

On 10/6/06, Paul Elschot [EMAIL PROTECTED] wrote:


Mark,

On Friday 06 October 2006 22:46, Mark Miller wrote:
 Paul's parser is beyond my feeble comprehension...but I would start by
 looking at SrndTruncQuery. It looks to me like this enumerates each
 possible match just like a SpanRegexQuery does...I am too lazy to figure
 out what the visitor pattern is doing so I don't know if they then get
 added to a boolean query, but I don't know what else would happen. If

They can also be added to a SpanOrQuery as SpanTermQuery,
this depends on the context of the query (distance query or not).
The visitor pattern is used to have the same code for distance queries
and other queries as far as possible.

 this is the case, I am wondering if it is any more efficient than the
 SpanRegex implementation...which could be changed to a SpanWildcard

I don't think the surround implementation of expanding terms is more
efficient that the Lucene implementation.
Surround does have the functionality of a SpanWildCard, but
the implementation of the expansion is shared, see above.

 implementation. How exactly is this better at avoiding a toomanyclauses
 exception or ram fillup. Is it just the fact that the (lets say) three
 wildcard terms are anded so this should dramatically reduce the matches?

The limitation in BasicQueryFactory works for a complete surround query,
which can be nested.
In Lucene only the max nr of clauses for a single level BooleanQuery
can be controlled.

...

Regards,
Paul Elschot


 - Mark

 Erick Erickson wrote:
  Paul:
 
  Splendid! Now if I just understood a single thing about the SrndQuery
  family
  G.
 
  I followed your link, and took a look at the text file. That should
  give me
  enough to get started.
 
  But if you wanted to e-mail me any sample code or long explanations of
  what
  this all does, I would forever be your lackey G
 
  I should also fairly easily be able to run a few of these against the
  partial index I already have to get some sense of now it'll all work
  out in
  my problem space. I suspect that the actual number of distinct terms
  won't
  grow too much after the first 4,000 books, so it'll probably be pretty
  safe
  to get this running in the worst case, find out if/where things blow
  up,
  and put in some safeguards. Or perhaps discover that it's completely
and
  entirely perfect G.
 
  Thanks again
  Erick
 
  On 10/6/06, Paul Elschot [EMAIL PROTECTED] wrote:
 
  On Friday 06 October 2006 14:37, Erick Erickson wrote:
  ...
   Fortunately, the PM agrees that it's silly to think about span
queries
   involving OR or NOT for this app. So I'm left with something like
Jo*n
  AND
   sm*th AND jon?es WITHIN 6.
 
  OR works much the same as term expansion for wildcards.
 
   The only approach that's occurred to me is to create a filter on
  for the
   terms, giving me a subset of my docs that have any terms satisfying
  the
   above. For each doc in the filter, get creative with
  TermPositionVector
  for
   determining whether the document matches. It seems that this would
  involve
   creating a list of all positions in each doc in my filter that
match
  jo*n,
   another for sm*th, and another for jon?es and seeing if the
distance
   (however I define that) between any triple of terms (one from each
  list)
  is
   less than 6.
 
   My gut feel is that this explodes time-wise based upon the number
of
  terms
   that match. In this particular application, we are indexing 20K
books.
  Based
   on indexing 4K of them, this amounts to about a 4G index (although
I
   acutally expect this to be somewhat larger since I haven't indexed
all
  the
   fields, just the text so far). I can't imagine that comparing the
  expanded
   terms for, say, 10,000 docs will be fast. I'm putting together an
  experiment
   to test this though.
  
   But someone could save me a lot of work by telling me that this is
  solved
   already. This is your chance G..
 
  It's solved :) here:
 

Re: wildcard and span queries

2006-10-09 Thread Erick Erickson

OK, forget the stuff about TooManyBooleanClauses. I finally figured out
that if I specify the surround to have the same semantics as a SpanRegex (
i.e, and(eri*, mal*)) it blows up with TooManyBooleanClauses. So that makes
more sense to me now.

Specifying 20w(eri*, mal*) is what I was using before.

Erick

On 10/9/06, Erick Erickson [EMAIL PROTECTED] wrote:


OK, I'm using the surround code, and it seems to be working...with the
following questions (always, more questions)...

 I'm gettng an exception sometimes of TooManyBasicQueries. I can control
this by initializing BasicQueryFactory with a larger number. Do you have any
cautions about upping this number?

 There's a hard-coded value minimumPrefixLength set to 3 down in the code
Surround query parser (allowedSuffix). I see no method to change this. I
assume that this is to prevent using up too much memory/time. What should I
know about this value? I'm mostly interested in a justification for the
product manager why allowing, say, two character (or one character) prefixes
is a bad idea G.

 I'm a bit confused. It appears that TooManyBooleanClauses is orthogonal
to Surround queries. That is, trying RegexSpanQuery doesn't want to work at
all with the same search clause, as it runs out of memory pretty
quickly..

However, working with three-letter prefixes is blazingly fast.

Thanks again...

Erick

On 10/6/06, Paul Elschot  [EMAIL PROTECTED] wrote:

 Mark,

 On Friday 06 October 2006 22:46, Mark Miller wrote:
  Paul's parser is beyond my feeble comprehension...but I would start by
  looking at SrndTruncQuery. It looks to me like this enumerates each
  possible match just like a SpanRegexQuery does...I am too lazy to
 figure
  out what the visitor pattern is doing so I don't know if they then get
  added to a boolean query, but I don't know what else would happen. If

 They can also be added to a SpanOrQuery as SpanTermQuery,
 this depends on the context of the query (distance query or not).
 The visitor pattern is used to have the same code for distance queries
 and other queries as far as possible.

  this is the case, I am wondering if it is any more efficient than the
  SpanRegex implementation...which could be changed to a SpanWildcard

 I don't think the surround implementation of expanding terms is more
 efficient that the Lucene implementation.
 Surround does have the functionality of a SpanWildCard, but
 the implementation of the expansion is shared, see above.

  implementation. How exactly is this better at avoiding a
 toomanyclauses
  exception or ram fillup. Is it just the fact that the (lets say) three

  wildcard terms are anded so this should dramatically reduce the
 matches?

 The limitation in BasicQueryFactory works for a complete surround query,
 which can be nested.
 In Lucene only the max nr of clauses for a single level BooleanQuery
 can be controlled.

 ...

 Regards,
 Paul Elschot


  - Mark
 
  Erick Erickson wrote:
   Paul:
  
   Splendid! Now if I just understood a single thing about the
 SrndQuery
   family
   G.
  
   I followed your link, and took a look at the text file. That should
   give me
   enough to get started.
  
   But if you wanted to e-mail me any sample code or long explanations
 of
   what
   this all does, I would forever be your lackey G
  
   I should also fairly easily be able to run a few of these against
 the
   partial index I already have to get some sense of now it'll all work

   out in
   my problem space. I suspect that the actual number of distinct terms
   won't
   grow too much after the first 4,000 books, so it'll probably be
 pretty
   safe
   to get this running in the worst case, find out if/where things
 blow
   up,
   and put in some safeguards. Or perhaps discover that it's completely
 and
   entirely perfect G.
  
   Thanks again
   Erick
  
   On 10/6/06, Paul Elschot [EMAIL PROTECTED] wrote:
  
   On Friday 06 October 2006 14:37, Erick Erickson wrote:
   ...
Fortunately, the PM agrees that it's silly to think about span
 queries
involving OR or NOT for this app. So I'm left with something like
 Jo*n
   AND
sm*th AND jon?es WITHIN 6.
  
   OR works much the same as term expansion for wildcards.
  
The only approach that's occurred to me is to create a filter on
   for the
terms, giving me a subset of my docs that have any terms
 satisfying
   the
above. For each doc in the filter, get creative with
   TermPositionVector
   for
determining whether the document matches. It seems that this
 would
   involve
creating a list of all positions in each doc in my filter that
 match
   jo*n,
another for sm*th, and another for jon?es and seeing if the
 distance
(however I define that) between any triple of terms (one from
 each
   list)
   is
less than 6.
  
My gut feel is that this explodes time-wise based upon the number
 of
   terms
that match. In this particular application, we are indexing 20K
 books.
   Based
on 

Re: wildcard and span queries

2006-10-09 Thread Paul Elschot
Erick,

On Monday 09 October 2006 21:20, Erick Erickson wrote:
 OK, forget the stuff about TooManyBooleanClauses. I finally figured out
 that if I specify the surround to have the same semantics as a SpanRegex (
 i.e, and(eri*, mal*)) it blows up with TooManyBooleanClauses. So that makes
 more sense to me now.
 
 Specifying 20w(eri*, mal*) is what I was using before.
 
 Erick
 
 On 10/9/06, Erick Erickson [EMAIL PROTECTED] wrote:
 
  OK, I'm using the surround code, and it seems to be working...with the
  following questions (always, more questions)...
 
   I'm gettng an exception sometimes of TooManyBasicQueries. I can control
  this by initializing BasicQueryFactory with a larger number. Do you have 
any
  cautions about upping this number?
 
   There's a hard-coded value minimumPrefixLength set to 3 down in the code
  Surround query parser (allowedSuffix). I see no method to change this. I
  assume that this is to prevent using up too much memory/time. What should 
I
  know about this value? I'm mostly interested in a justification for the
  product manager why allowing, say, two character (or one character) 
prefixes
  is a bad idea G.

Once BasicQueryFactory has a satisfactory limitation, that is one that
a user can understand when the exception for too many basic queries
is thrown, there is no need to keep this minimim prefix length at 3,
1 or 2 will also do. When using many thousands as the max. basic queries,
the term expansion itself might take some time to reach that maximum.

You might want to ask the PM for a reasonable query involving such short
prefixes, though. In most western languages, they do not make much sense.

 
   I'm a bit confused. It appears that TooManyBooleanClauses is orthogonal
  to Surround queries. That is, trying RegexSpanQuery doesn't want to work 
at
  all with the same search clause, as it runs out of memory pretty
  quickly..
 
  However, working with three-letter prefixes is blazingly fast.

Your index is probably not very large (yet). Make sure to reevaluate
the max. number of basic queries as it grows.

Did you try nesting like this:
20d( 4w(lucene, action), 5d(hatch*, gospod*))
?

Could you tell a bit more about the target grammar?

Regards,
Paul Elschot


 
  Thanks again...
 
  Erick
 
  On 10/6/06, Paul Elschot  [EMAIL PROTECTED] wrote:
  
   Mark,
  
   On Friday 06 October 2006 22:46, Mark Miller wrote:
Paul's parser is beyond my feeble comprehension...but I would start by
looking at SrndTruncQuery. It looks to me like this enumerates each
possible match just like a SpanRegexQuery does...I am too lazy to
   figure
out what the visitor pattern is doing so I don't know if they then get
added to a boolean query, but I don't know what else would happen. If
  
   They can also be added to a SpanOrQuery as SpanTermQuery,
   this depends on the context of the query (distance query or not).
   The visitor pattern is used to have the same code for distance queries
   and other queries as far as possible.
  
this is the case, I am wondering if it is any more efficient than the
SpanRegex implementation...which could be changed to a SpanWildcard
  
   I don't think the surround implementation of expanding terms is more
   efficient that the Lucene implementation.
   Surround does have the functionality of a SpanWildCard, but
   the implementation of the expansion is shared, see above.
  
implementation. How exactly is this better at avoiding a
   toomanyclauses
exception or ram fillup. Is it just the fact that the (lets say) three
  
wildcard terms are anded so this should dramatically reduce the
   matches?
  
   The limitation in BasicQueryFactory works for a complete surround query,
   which can be nested.
   In Lucene only the max nr of clauses for a single level BooleanQuery
   can be controlled.
  
   ...
  
   Regards,
   Paul Elschot
  
  
- Mark
   
Erick Erickson wrote:
 Paul:

 Splendid! Now if I just understood a single thing about the
   SrndQuery
 family
 G.

 I followed your link, and took a look at the text file. That should
 give me
 enough to get started.

 But if you wanted to e-mail me any sample code or long explanations
   of
 what
 this all does, I would forever be your lackey G

 I should also fairly easily be able to run a few of these against
   the
 partial index I already have to get some sense of now it'll all work
  
 out in
 my problem space. I suspect that the actual number of distinct terms
 won't
 grow too much after the first 4,000 books, so it'll probably be
   pretty
 safe
 to get this running in the worst case, find out if/where things
   blow
 up,
 and put in some safeguards. Or perhaps discover that it's completely
   and
 entirely perfect G.

 Thanks again
 Erick

 On 10/6/06, Paul Elschot [EMAIL PROTECTED] wrote:

 On Friday 06 October 2006 

Re: wildcard and span queries

2006-10-09 Thread Erick Erickson

I've already started that conversation with the PM, I'm just trying to get a
better idea of what's possible. I'll whimper tooth and nail to keep from
having to do a lot of work to add a feature to a product that nobody in
their right mind would ever use G.

As far as the grammar, we don't actually have one yet. That's part of what
this exploration is all about. The kicker is that what we are indexing is
OCR data, some of which is pretty trashy. So you wind up with interesting
words in your index, things like rtyHrS. So the whole question of allowing
very specific queries on detailed wildcards (combined with spans) is under
discussion. It's not at all clear to me that there's any value to the end
users in the capability of, say, two character prefixes. And, it's an easy
rule that prefix queries must specify at least 3 non-wildcard
characters

Thanks for your advice. You're quite correct that the index isn't very large
yet. My task tonight is to index about 4K books. I suspect that the number
of terms won't increase dramatically after that many books, but that's an
assumption on my part.

Thanks again
Erick

On 10/9/06, Paul Elschot [EMAIL PROTECTED] wrote:


Erick,

On Monday 09 October 2006 21:20, Erick Erickson wrote:
 OK, forget the stuff about TooManyBooleanClauses. I finally figured
out
 that if I specify the surround to have the same semantics as a SpanRegex
(
 i.e, and(eri*, mal*)) it blows up with TooManyBooleanClauses. So that
makes
 more sense to me now.

 Specifying 20w(eri*, mal*) is what I was using before.

 Erick

 On 10/9/06, Erick Erickson [EMAIL PROTECTED] wrote:
 
  OK, I'm using the surround code, and it seems to be working...with the
  following questions (always, more questions)...
 
   I'm gettng an exception sometimes of TooManyBasicQueries. I can
control
  this by initializing BasicQueryFactory with a larger number. Do you
have
any
  cautions about upping this number?
 
   There's a hard-coded value minimumPrefixLength set to 3 down in the
code
  Surround query parser (allowedSuffix). I see no method to change this.
I
  assume that this is to prevent using up too much memory/time. What
should
I
  know about this value? I'm mostly interested in a justification for
the
  product manager why allowing, say, two character (or one character)
prefixes
  is a bad idea G.

Once BasicQueryFactory has a satisfactory limitation, that is one that
a user can understand when the exception for too many basic queries
is thrown, there is no need to keep this minimim prefix length at 3,
1 or 2 will also do. When using many thousands as the max. basic queries,
the term expansion itself might take some time to reach that maximum.

You might want to ask the PM for a reasonable query involving such short
prefixes, though. In most western languages, they do not make much sense.

 
   I'm a bit confused. It appears that TooManyBooleanClauses is
orthogonal
  to Surround queries. That is, trying RegexSpanQuery doesn't want to
work
at
  all with the same search clause, as it runs out of memory pretty
  quickly..
 
  However, working with three-letter prefixes is blazingly fast.

Your index is probably not very large (yet). Make sure to reevaluate
the max. number of basic queries as it grows.

Did you try nesting like this:
20d( 4w(lucene, action), 5d(hatch*, gospod*))
?

Could you tell a bit more about the target grammar?

Regards,
Paul Elschot


 
  Thanks again...
 
  Erick
 
  On 10/6/06, Paul Elschot  [EMAIL PROTECTED] wrote:
  
   Mark,
  
   On Friday 06 October 2006 22:46, Mark Miller wrote:
Paul's parser is beyond my feeble comprehension...but I would
start by
looking at SrndTruncQuery. It looks to me like this enumerates
each
possible match just like a SpanRegexQuery does...I am too lazy to
   figure
out what the visitor pattern is doing so I don't know if they then
get
added to a boolean query, but I don't know what else would happen.
If
  
   They can also be added to a SpanOrQuery as SpanTermQuery,
   this depends on the context of the query (distance query or not).
   The visitor pattern is used to have the same code for distance
queries
   and other queries as far as possible.
  
this is the case, I am wondering if it is any more efficient than
the
SpanRegex implementation...which could be changed to a
SpanWildcard
  
   I don't think the surround implementation of expanding terms is more
   efficient that the Lucene implementation.
   Surround does have the functionality of a SpanWildCard, but
   the implementation of the expansion is shared, see above.
  
implementation. How exactly is this better at avoiding a
   toomanyclauses
exception or ram fillup. Is it just the fact that the (lets say)
three
  
wildcard terms are anded so this should dramatically reduce the
   matches?
  
   The limitation in BasicQueryFactory works for a complete surround
query,
   which can be nested.
   In Lucene only the max nr of clauses for a single level 

Re: wildcard and span queries

2006-10-09 Thread Doron Cohen
Erick Erickson [EMAIL PROTECTED] wrote on 09/10/2006 13:09:21:
 ... The kicker is that what we are indexing is
 OCR data, some of which is pretty trashy. So you wind up with
interesting
 words in your index, things like rtyHrS. So the whole question of
allowing
 very specific queries on detailed wildcards (combined with spans) is
under
 discussion. It's not at all clear to me that there's any value to the end
 users in the capability of, say, two character prefixes. And, it's an
easy
 rule that prefix queries must specify at least 3 non-wildcard
 characters

Erick, I may be out of course here, but, fwiw, have you considered n-gram
indexing/search for a degree of fuzziness to compensate for OCR errors..?

For a four words query you would probably get ~20 tokens (bigrams?) - no
matter what the index size is. You would then probably want to score higher
by LA (lexical affinity - query terms appear close to each other in the
document) - and I am not sure to what degree a span query (made of n-gram
terms) would serve that, because (1) all terms in the span need to be there
(well, I think:-); and, (2) you would like to increase doc score for
close-by terms only for close-by query n-grams.

So there might not be a ready to use solution in Lucene for this, but
perhaps this is a more robust direction to try than the wild card approach
- I mean, if users want to type a wild card query, it is their right to do
so, but for an application logic this does not seem the best choice.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FieldSelectorResult instance descriptions?

2006-10-09 Thread Chris Hostetter

: If you read the entire source as I did, I becomes clear ! :)
: The interesting code is in FieldsReader.

Not neccessarily.  There can be differneces between how constants are
used and how they are suppose to be used (depending on wether or not the
code using them has any bugs in it)


: NO_LOAD : skip the field, it's value won't be available

Should the client expecation for NO_LOAD fileds be that the
Document.getField/getFieldable will return will null, and that
the List returned by getFields() will not contain anything for these
fields, or should clients assume there may be an empty Fieldable object
returned by any of these methods (or included in the list)

: LAZY_LOAD : do not load the field value, but if you request it later, it will
: be loaded on request. Note that it can be lazy-loaded only if the reader is
: still opened.

What should clicents expected to happen if the reader has already been
closed?

: LOAD_FOR_MERGE : internal use when merging segments: it avoids uncompressing
: and recompressing data, the data is merged binarily.

this seems like a second-class citizen then correct? not intende for
client code to use in their FieldSelector ? ... so what if the do use
it? ... can they expect the data n the Field object to be uncompressed on
the fly if they attempt to access it later?


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: QueryParser syntax French Operator

2006-10-09 Thread Patrick Turcotte

Hi,

I was thinking of something along those lines.

Last week, I was able to take time to understand the JavaCC syntax and
possiblities.

I have some cleaning up, testing and documentation to do, but 
basically, I
was able to expand the AND / OR / NOT patterns at runtime using the
ResourceBundle paradigm. I'll keep you posted.

Patrick

 -Message d'origine-
 De : karl wettin [mailto:[EMAIL PROTECTED]
 Envoyé : 8 octobre, 2006 10:14
 À : java-user@lucene.apache.org
 Objet : Re: QueryParser syntax French Operator


 On 10/8/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:
  Hi Patrick,
 
  If I were trying to do this, I'd modify QueryParser.jj to
 construct the grammar for boolean operators based on something
 like Locale (or LANG env. variable?).  I'd try adding code a la:
  en_AND = AND
  en_OR = OR
  en_NOT = NOT
  fr_AND = ET
  fr_OR = OU
  fr_NOT = SAUF
 
  And then:
  if (locale is 'fr')
   // construct the grammar with fr_*
  ...
 
  Something like that.

 It is a good thought, but as number of locales grows with similar
 languages you'll get deterministic errors in the lexer. So I would
 absolutely recommend one grammar file per language. Not sure if JavaCC
 allows inheritance, but with ANTlr this would be a very simple and
 effective way to solve the problem.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FieldSelectorResult instance descriptions?

2006-10-09 Thread Grant Ingersoll
See http://www.gossamer-threads.com/lists/lucene/java-dev/33964? 
search_string=Lazy%20Field%20Loading;#33964 for the discussion on  
Java Dev from wayback if you want more background info.


To some extent, I still think Lazy Fields are in the early adopter  
stage, since they haven't officially been released, so these  
questions are good for vetting them.  And there is still the question  
of how to handle Document.getField() versus Document.getFieldable 
()...  but that is a discussion for the dev list.



See below for more...

HTH,
Grant
On Oct 9, 2006, at 5:22 PM, Chris Hostetter wrote:



: If you read the entire source as I did, I becomes clear ! :)
: The interesting code is in FieldsReader.

Not neccessarily.  There can be differneces between how constants are
used and how they are suppose to be used (depending on wether or  
not the

code using them has any bugs in it)



I will put some javadocs on these (or if someone wants to add a  
patch...)





: NO_LOAD : skip the field, it's value won't be available

Should the client expecation for NO_LOAD fileds be that the
Document.getField/getFieldable will return will null, and that
the List returned by getFields() will not contain anything for these
fields, or should clients assume there may be an empty Fieldable  
object

returned by any of these methods (or included in the list)


My understanding is in the NO_LOAD case, doc.add(Field) is not  
called, so Document.getField() will return null.  Again, I will try  
to get some javadocs on this part.




: LAZY_LOAD : do not load the field value, but if you request it  
later, it will
: be loaded on request. Note that it can be lazy-loaded only if the  
reader is

: still opened.

What should clicents expected to happen if the reader has already been
closed?



Search the dev list for Semantics of a closed IndexInput for some  
discussion on this between Doug and I.  Unfortunately, the answer  
isn't all that satisfying, since it is undefined.  I would prefer  
better treatment than that, but it isn't obvious.  I originally  
thought there would be an exception to catch or something (in fact,  
my original test cases had expected it to be handled), but ended up  
putting the handling on the application, since the app should know  
when it has been closed.



: LOAD_FOR_MERGE : internal use when merging segments: it avoids  
uncompressing

: and recompressing data, the data is merged binarily.

this seems like a second-class citizen then correct? not intende for
client code to use in their FieldSelector ? ... so what if the do use
it? ... can they expect the data n the Field object to be  
uncompressed on

the fly if they attempt to access it later?



I would agree it is a second-class citizen, but maybe Otis can add  
his thoughts, as I think he added this feature.  I am unsure of the  
results of using it outside of the merge scope.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: wildcard and span queries

2006-10-09 Thread Erick Erickson

Doron:

Thanks for the suggestion, I'll certainly put it on my list, depending upon
what the PM decides. This app is geneaology reasearch, and users *can* put
in their own wildcards...

This is why I love this list... lots of smart people giving me suggestions I
never would have thought of G...

Thanks
Erick

On 10/9/06, Doron Cohen [EMAIL PROTECTED] wrote:


Erick Erickson [EMAIL PROTECTED] wrote on 09/10/2006 13:09:21:
 ... The kicker is that what we are indexing is
 OCR data, some of which is pretty trashy. So you wind up with
interesting
 words in your index, things like rtyHrS. So the whole question of
allowing
 very specific queries on detailed wildcards (combined with spans) is
under
 discussion. It's not at all clear to me that there's any value to the
end
 users in the capability of, say, two character prefixes. And, it's an
easy
 rule that prefix queries must specify at least 3 non-wildcard
 characters

Erick, I may be out of course here, but, fwiw, have you considered n-gram
indexing/search for a degree of fuzziness to compensate for OCR errors..?

For a four words query you would probably get ~20 tokens (bigrams?) - no
matter what the index size is. You would then probably want to score
higher
by LA (lexical affinity - query terms appear close to each other in the
document) - and I am not sure to what degree a span query (made of n-gram
terms) would serve that, because (1) all terms in the span need to be
there
(well, I think:-); and, (2) you would like to increase doc score for
close-by terms only for close-by query n-grams.

So there might not be a ready to use solution in Lucene for this, but
perhaps this is a more robust direction to try than the wild card approach

- I mean, if users want to type a wild card query, it is their right to do
so, but for an application logic this does not seem the best choice.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Incremental updates / slow searches.

2006-10-09 Thread Chris Hostetter

don't forget to optimize your index every now and then as well... deleting
a document just marks it as deleted it still gets inspectected by every
query during scoring at least once to see that it can skip it, optimizing
is the only thing that truely removes the deleted documents.


: Date: Mon, 9 Oct 2006 13:49:34 -0400
: From: Yonik Seeley [EMAIL PROTECTED]
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Re: Incremental updates / slow searches.
:
: The biggest thing would be to limit how often you open a new
: IndexSearcher, and when you do, warm up the new searcher in the
: background while you continue serving searches with the existing
: searcher.  This is the strategy that Solr uses.
:
: There is also the issue of if you are analyzing/merging docs on the
: same servers that you are executing searches on.  You can use a
: separate box to build the index and distribute changes to boxes used
: for searching.
:
: -Yonik
: http://incubator.apache.org/solr Solr, the open-source Lucene search server
:
: On 10/9/06, Rickard Bäckman [EMAIL PROTECTED] wrote:
:  Hi,
: 
:  we are using a search system based on Lucene and have recently tried to add
:  incremental updating of the index instead of building a new index every now
:  and then. However we now run into problems as our searches starts to take
:  very long time to complete.
: 
:  Our index is about 8-9GB large and we are sending lots of updates / second
:  (we are probably merging in 200 - 300 in a few seconds). Today we buffer a
:  bunch of updates and then merge them into the existing index like a batch,
:  first doing deletes and then inserts.
: 
:  We are currently not using any special tuning of Lucene.
: 
:  Does anyone have any similiar experiences from Lucene or advices on how to
:  reduce the amount of times it takes to perform a search? In particular what
:  would be an optimal combination of update size, merge factor, max buffered
:  docs?
: 
:  /Rickard
: 
: 
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Incremental updates / slow searches.

2006-10-09 Thread Yonik Seeley

On 10/9/06, Chris Hostetter [EMAIL PROTECTED] wrote:

don't forget to optimize your index every now and then as well... deleting
a document just marks it as deleted it still gets inspectected by every
query during scoring at least once to see that it can skip it, optimizing
is the only thing that truely removes the deleted documents.


I'd refine that statement to optimizing is the easiest way to remove
any deleted documents that still exist in the index.

Deleted documents are removed from segments that are merged, so it
depends on things like the mergeFactor, maxBufferedDocs, and where the
deleted docs are in the index (in the smallest or largest segments).
Some deleted docs will be removed quickly, but some won't.

Optimizing an index also has a beneficial effect on search speed even
beyond removing all of the deleted docs.  Each index segment is
actually a complete index on it's own... so if search is generally
O(log(N)), searching across M segments of since N will take M *
log(N).  If those segments are optimized into a single segment, the
search will be O(log(M*N)).

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]