Re: Lucene and Eastern languages (Japanese, Korean and Chinese)

2007-07-25 Thread Mathieu Lecarme
Le mardi 24 juillet 2007 à 13:01 -0700, Shaw, James a écrit :
> Hi, guys,
> I found Analyzers for Japanese, Korean and Chinese, but not stemmers;
> the Snowball stemmers only include European languages.  Does stemming
> not make sense for ideograph-based languages (i.e., no stemming is
> needed for Japanese, Korean and Chinese)?
No.

> Also for spell checking, does the default Lucene SpellChecker work for
> Japanese, Korean and Chinese?  Does edit distance make sense for these
> languages?
Japanese used group of ideogram, but levenstein distance don't make
sense with few letters but I'm not a CJK expert.

M.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What replaced org.apache.lucene.document.Field.Text?

2007-07-25 Thread Patrick Kimber

Hi Andy

I think:
Field.Text("name", "value");

has been replaced with:
new Field("name", "value", Field.Store.YES, Field.Index.TOKENIZED);

Patrick

On 25/07/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

Please reference How do I get code written for Lucene 1.4.x to work with
Lucene 2.x?
http://wiki.apache.org/lucene-java/LuceneFAQ#head-86d479476c63a2579e867b
75d4faa9664ef6cf4d


Andy
-Original Message-
From: Lindsey Hess [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 25, 2007 12:31 PM
To: Lucene
Subject: What replaced org.apache.lucene.document.Field.Text?

I'm trying to get some relatively old Lucene code to compile (please see
below), and it appears that Field.Text has been deprecated.  Can someone
please suggest what I should use in its place?

  Thank you.

  Lindsey



  public static void main(String args[]) throws Exception
  {
  String indexDir =
  System.getProperty("java.io.tmpdir", "tmp") +
  System.getProperty("file.separator") + "address-book";
  Analyzer analyzer = new WhitespaceAnalyzer();
  boolean createFlag = true;

  IndexWriter writer = new IndexWriter(indexDir, analyzer, createFlag);
  Document contactDocument = new Document();
  contactDocument.add(Field.Text("type", "individual"));

  contactDocument.add(Field.Text("name", "Zane Pasolini"));
  contactDocument.add(Field.Text("address", "999 W. Prince St."));
  contactDocument.add(Field.Text("city", "New York"));
  contactDocument.add(Field.Text("province", "NY"));
  contactDocument.add(Field.Text("postalcode", "10013"));
  contactDocument.add(Field.Text("country", "USA"));
  contactDocument.add(Field.Text("telephone", "1-212-345-6789"));
  writer.addDocument(contactDocument);
  writer.close();
  }


-
Fussy? Opinionated? Impossible to please? Perfect.  Join Yahoo!'s user
panel and lay it on us.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search for null

2007-07-25 Thread daniel rosher
You will be unable to search for fields that do not exist which is what
you originally wanted to do, instead you can do something like:

-Establish the query that will select all non-null values

TermQuery tq1 = new TermQuery(new Term("field","value1"));
TermQuery tq2 = new TermQuery(new Term("field","value2"));
...
TermQuery tqn = new TermQuery(new Term("field","valuen"));
BooleanQuery query = new BooleanQuery();
booleanQuery.add(tq1,BooleanClause.Occur.SHOULD);
booleanQuery.add(tq2,BooleanClause.Occur.SHOULD);
...
booleanQuery.add(tqn,BooleanClause.Occur.SHOULD);

OR perhaps a range query if your values are contiguous

Term start = new Term("field","198805");
Term end = new Term("field","198810");
Query query = new RangeQuery(start, end, true);
;

OR just use the QueryParser

Query query = QueryParser.parse(parseCriteria,
"field", new StandardAnalyzer());

-Create the QueryFilter

QueryFilter queryFilter = new QueryFilter(query);

-flip the bits

final BitSet filterBitSet = queryFilter.bits(reader);
filterBitSet.flip(0,filterBitSet.size());

Now you have a filter that contains document matching the opposite of
that specified by the query, and can use in subsequent queries

Dan



On Tue, 2007-07-24 at 09:40 -0700, Jay Yu wrote:
> 
> daniel rosher wrote:
> > Perhaps you can use a filter in the following way.
> > 
> > -Create a filter (via QueryFilter) that would contain all document that
> > do not have null values for the field
> Interesting: what does the QueryFilter look like? Isn't it just as hard 
> as finding out what docs have the null values for the field?
> I really like to know your trick here.
> > -flip the bits of the filter so that it now contains documents that have
> > null values for a field
> > -Use the filter in conjunction with subsequent queries.
> > 
> > This would also help with performance as filters are simply bitsets and
> > can cheaply be stored, generated once and used often.
> > 
> > Dan
> > 
> > On Mon, 2007-07-23 at 13:57 -0700, Jay Yu wrote:
> >> If you want performance, a better way might be to assign some special 
> >> string/value (if it's easy to create) to the missing field of docs and 
> >> index the field without tokenizing it. Then you may search for that 
> >> special value to find the docs.
> >>
> >> Jay
> >>
> >> Les Fletcher wrote:
> >>> Does this particular range query have any significant performance issues?
> >>>
> >>> Les
> >>>
> >>> Erik Hatcher wrote:
>  On Jul 23, 2007, at 11:32 AM, testn wrote:
> > Is it possible to search for the document that specified field 
> > doesn't exist
> > or such field value is null?
>  This is from Solr, so I'm not sure off the top of my head if this mojo 
>  applies by itself, but a search for -fieldname:[* TO *] will result in 
>  all documents that do not have the specified field.
> 
>  Erik
> 
> 
>  -
>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>  For additional commands, e-mail: [EMAIL PROTECTED]
> 
> >>> -
> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>> For additional commands, e-mail: [EMAIL PROTECTED]
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >> <>
> > Daniel Rosher
> > Developer
> > 
> > 
> > d: 0207 3489 912
> > t: 0870 2020 121
> > f: 0870 2020 131
> > m: 
> > http://www.hotonline.com/
> > 
> > 
> > 
> > 
> > 
> > 
> > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
> > - - - - - - - - - - - - - - - - -
> > This message is sent in confidence for the addressee only. It may contain 
> > privileged 
> > information. The contents are not to be disclosed to anyone other than the 
> > addressee. 
> > Unauthorised recipients are requested to preserve this confidentiality and 
> > to advise 
> > us of any errors in transmission. Thank you.
> > 
> > hotonline ltd is registered in England & Wales. Registered office: One 
> > Canada Square, 
> > Canary Wharf, London E14 5AP. Registered No: 1904765.
> > 
> > 
> > This message has been scanned for viruses by BlackSpider MailControl - 
> > www.blackspider.com
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
Daniel Rosher
Developer


d: 0207 3489 912
t: 0870 2020 121
f: 0870 2020 131
m: 
http://www.hotonline.com/






- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - -
This message is sent in confidence for the addressee only. It may co

Recovering from a Crash

2007-07-25 Thread Simon Wistow
We were affected by the great SF outage yesterday and apparently the 
indexing machine crashed without being shutdown properly.

I've taken a backup of the indexes which has the usual smattering of 
write.lock segments.gen, .cfs, .fdt, .fnm and .fdx etc files and looks 
to be about the right size.

However, if I start up my indexer with that directory it shrinks to a 
fraction of its size (500 times smaller) and (obviously) contains 
virtually no documents.

The data appears to be there - please tell me that I'm doing something 
stupid and I can recover from this.

Simon



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Recovering from a Crash

2007-07-25 Thread Michael McCandless
"Simon Wistow" <[EMAIL PROTECTED]> wrote:
> We were affected by the great SF outage yesterday and apparently the 
> indexing machine crashed without being shutdown properly.

Eek, sorry!  We are so reliant on electricity these days
 
> I've taken a backup of the indexes which has the usual smattering of 
> write.lock segments.gen, .cfs, .fdt, .fnm and .fdx etc files and looks 
> to be about the right size.

Hmm, how do you do your backups?   Is there a segments_N file present
in the backup?

It's somewhat spooky that you have a write.lock present because that
means you backed up while a writer was actively writing to the index
which is a bit dangerous because if the timing is unlucky (backup does
an "ls" but before it can copy the segments_N file a commit has
happened) you could fail to copy a segments_N file.  It's best to
either pause the writer for backpus to occur (simplest) or make a
custom deletion policy that safely allows the backup to slowly copy
even while indexing is continuing (advanced).

> However, if I start up my indexer with that directory it shrinks to a 
> fraction of its size (500 times smaller) and (obviously) contains 
> virtually no documents.

It seems like the segments_N file may be missing?

> The data appears to be there - please tell me that I'm doing something 
> stupid and I can recover from this.

Maybe try other (older) backups to see if they have the segments_N file?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Recovering from a Crash

2007-07-25 Thread Simon Wistow
On Wed, Jul 25, 2007 at 10:08:56AM +0100, me said:
> The data appears to be there - please tell me that I'm doing something 
> stupid and I can recover from this.

It appears by deleting the write.lock files everything has recovered.

Is this best practice? Have I just done something so terribly wrong that 
I've bought about the end of the universe?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and Eastern languages (Japanese, Korean and Chinese)

2007-07-25 Thread Maximilian Hütter
Mathieu Lecarme schrieb:
> Le mardi 24 juillet 2007 à 13:01 -0700, Shaw, James a écrit :
>> Hi, guys,
>> I found Analyzers for Japanese, Korean and Chinese, but not stemmers;
>> the Snowball stemmers only include European languages.  Does stemming
>> not make sense for ideograph-based languages (i.e., no stemming is
>> needed for Japanese, Korean and Chinese)?
> No.

This not quite correct, Chinese doesn't need any stemming but Japanese
is not completely ideograph-based and it could use stemming. I doubt
anyone has done this, besides some commercial software for the japanese
market. I don't know for Korean.

>> Also for spell checking, does the default Lucene SpellChecker work for
>> Japanese, Korean and Chinese?  Does edit distance make sense for these
>> languages?
> Japanese used group of ideogram, but levenstein distance don't make
> sense with few letters but I'm not a CJK expert.
> 
> M.

Edit distance only seems to work with latin character based (writen)
languages. Spell checking Chinese, Japanese (and Korean?) is more or
less pointless, as they are inputed using input methods, which should
produce "correct" words.

Best regards,

Max


-- 
Maximilian Hütter
blue elephant systems GmbH
Wollgrasweg 49
D-70599 Stuttgart

Tel:  (+49) 0711 - 45 10 17 578
Fax:  (+49) 0711 - 45 10 17 573
e-mail :  [EMAIL PROTECTED]
Sitz   :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Recovering from a Crash

2007-07-25 Thread Simon Wistow
On Wed, Jul 25, 2007 at 05:19:31AM -0400, Michael McCandless said:
> It's somewhat spooky that you have a write.lock present because that
> means you backed up while a writer was actively writing to the index
> which is a bit dangerous because if the timing is unlucky (backup does
> an "ls" but before it can copy the segments_N file a commit has
> happened) you could fail to copy a segments_N file.  It's best to
> either pause the writer for backpus to occur (simplest) or make a
> custom deletion policy that safely allows the backup to slowly copy
> even while indexing is continuing (advanced).

Sorry, I should have been clearer - I took the backup of the state of 
the index when the machine restarted after the crash. I did have another 
backup from a day or so ago but I was hoping to not have to reindex a 
days worth of data (which is alot).

Our backup strategy is currently -
1) Stop the writer (and let write tasks queue up)
2) cp -lr indexes indexes-
3) Restart the writer

Is this something approximating best practice?

Simon



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Recovering from a Crash

2007-07-25 Thread Michael McCandless
"Simon Wistow" <[EMAIL PROTECTED]> wrote:
> On Wed, Jul 25, 2007 at 05:19:31AM -0400, Michael McCandless said:
> > It's somewhat spooky that you have a write.lock present because that
> > means you backed up while a writer was actively writing to the index
> > which is a bit dangerous because if the timing is unlucky (backup does
> > an "ls" but before it can copy the segments_N file a commit has
> > happened) you could fail to copy a segments_N file.  It's best to
> > either pause the writer for backpus to occur (simplest) or make a
> > custom deletion policy that safely allows the backup to slowly copy
> > even while indexing is continuing (advanced).
>
> Sorry, I should have been clearer - I took the backup of the state of
> the index when the machine restarted after the crash. I did have another
> backup from a day or so ago but I was hoping to not have to reindex a
> days worth of data (which is alot).

Ahhh, OK.  But do you have a segments_N file?

> Our backup strategy is currently -
> 1) Stop the writer (and let write tasks queue up)
> 2) cp -lr indexes indexes-
> 3) Restart the writer
>
> Is this something approximating best practice?

Yes, this is perfect.  This is the "simple" option I described.  The
more complex option is to use a custom deletion policy which enables
you to safely do backups (even if the copy process is slow) without
pausing the write task (indexing).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Recovering from a Crash

2007-07-25 Thread Michael McCandless
> > The data appears to be there - please tell me that I'm doing something
> > stupid and I can recover from this.
>
> It appears by deleting the write.lock files everything has recovered.

Hmmm -- it's odd that the existence of the write.lock caused you to
lose most of your index.  All that should have happened here is on
creating a new writer it would throw a LockObtainTimedOut exception
saying it could not obtain the write lock.  I don't see how this would
cause most of your index to be deleted...

> Is this best practice? Have I just done something so terribly wrong
> that I've bought about the end of the universe?

Universe seems intact on my end :)  But yes deleting the write.lock is
the right thing to do in this case.  You can also switch to native
locking (NativeFSLockFactory) and then the OS would free the lock
so you would not have to delete the write.lock manually...

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Recovering from a Crash

2007-07-25 Thread Simon Wistow
On Wed, Jul 25, 2007 at 05:49:41AM -0400, Michael McCandless said:
> Ahhh, OK.  But do you have a segments_N file?

Yup.
 
> Yes, this is perfect.  This is the "simple" option I described.  The
> more complex option is to use a custom deletion policy which enables
> you to safely do backups (even if the copy process is slow) without
> pausing the write task (indexing).

I vaguely remember seeing something about that going past. 

Is there any documentation on custom deletion policies? Or example code 
for such a beast? At the moment at any given point we have to have disk 
space to allow for 3x Index size - index, backup we've just taken and 
previous backup we're just about to delete. Since our indexes are large 
even 2x is quite an issue.

I've read through JIRA LUCENE-710 but a more point-and-drool explanation 
would be useful to someone who hasn't been up all night :)

Simon




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Recovering from a Crash

2007-07-25 Thread Michael McCandless

"Simon Wistow" <[EMAIL PROTECTED]> wrote:
> On Wed, Jul 25, 2007 at 05:49:41AM -0400, Michael McCandless said:
> > Ahhh, OK.  But do you have a segments_N file?
> 
> Yup.

OK, though I still don't understand why the existence of "write.lock"
caused you to lose most of your index on creating a new writer.

> > Yes, this is perfect.  This is the "simple" option I described.  The
> > more complex option is to use a custom deletion policy which enables
> > you to safely do backups (even if the copy process is slow) without
> > pausing the write task (indexing).
> 
> I vaguely remember seeing something about that going past. 
> 
> Is there any documentation on custom deletion policies? Or example code 
> for such a beast? At the moment at any given point we have to have disk 
> space to allow for 3x Index size - index, backup we've just taken and 
> previous backup we're just about to delete. Since our indexes are large 
> even 2x is quite an issue.
> 
> I've read through JIRA LUCENE-710 but a more point-and-drool explanation 
> would be useful to someone who hasn't been up all night :)

Good question ... there is no good documentation, sample code, etc.,
for this as of yet ... I've been secretly hoping the first person
who creates this deletion policy would share it :)  I don't think it's
very difficult to create.  If that doesn't happen sometime soon I'll
try to make time to create an example.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski


Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
limited by JavaCC speed. You cannot shave much more performance out of
the grammar as it is already about as simple as it gets.



JavaCC is slow indeed. We used it for a while for Carrot2, but then (3 years
ago :) switched to JFlex, which for roughly the same grammar would sometimes
be up to 10x (!) faster. You can have a look at our JFlex specification at:

http://carrot2.svn.sourceforge.net/viewvc/carrot2/trunk/carrot2/components/carrot2-util-tokenizer/src/org/carrot2/util/tokenizer/parser/jflex/JFlexWordBasedParserImpl.jflex?view=markup

This one seems more complex than the StandardAnalyzer's but it's much faster
anyway.

If anyone is interested, I could prepare a JFlex based Analyzer equivalent
(to the extent possible) to current StandardAnalyzer, which might offer nice
indexing and highlighting speed-ups.

Best,

Staszek

--
Stanislaw Osinski, [EMAIL PROTECTED]
http://www.carrot-search.com


Which field matched ?

2007-07-25 Thread makkhar

This problem has been baffling me since quite some time now and has no
perfect solution in the forum !

I have 10 documents, each with 10 fields with "parameterName and
parameterValue". Now, When i search for some term and I get 5 hits, how do I
find out which paramName-Value pair matched ? 

I am seeking an optimal solution for this. Explanation, highlighter etc are
some of the solutions. But not the best since highlighter would perform very
bad for wildcard queries and explanation is generally not the nice way of
doing this ! I am talking really large datasets here.

Any help, highly appreciated.


-- 
View this message in context: 
http://www.nabble.com/Which-field-matched---tf4141549.html#a11780708
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Mark Miller
I would be very interested. I have been playing around with Antlr to see 
if it is any faster than JavaCC, but haven't seen great gains in my 
simple tests. I had not considered trying JFlex.


I am sure a faster StandardAnalyzer would be greatly appreciated. 
StandardAnalyzer appears widely used and horrendously slow. Even better 
would be a StandardAnalyzer that could have different recognizers 
enabled/disabled. For example, dropping NUM recognition if you don't 
need it in the current StandardAnalyzer gains like 25% speed.


- Mark

Stanislaw Osinski wrote:


Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
limited by JavaCC speed. You cannot shave much more performance out of
the grammar as it is already about as simple as it gets.



JavaCC is slow indeed. We used it for a while for Carrot2, but then (3 
years
ago :) switched to JFlex, which for roughly the same grammar would 
sometimes
be up to 10x (!) faster. You can have a look at our JFlex 
specification at:


http://carrot2.svn.sourceforge.net/viewvc/carrot2/trunk/carrot2/components/carrot2-util-tokenizer/src/org/carrot2/util/tokenizer/parser/jflex/JFlexWordBasedParserImpl.jflex?view=markup 



This one seems more complex than the StandardAnalyzer's but it's much 
faster

anyway.

If anyone is interested, I could prepare a JFlex based Analyzer 
equivalent
(to the extent possible) to current StandardAnalyzer, which might 
offer nice

indexing and highlighting speed-ups.

Best,

Staszek



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Which field matched ?

2007-07-25 Thread makkhar


Currently, we use regular expression pattern matching to get hold of which
field matched. Again a pathetic solution since we have to agree upon the
subset of the lucene search and pattern matching. We cannot use Boolean
queries etc in this case.



makkhar wrote:
> 
> This problem has been baffling me since quite some time now and has no
> perfect solution in the forum !
> 
> I have 10 documents, each with 10 fields with "parameterName and
> parameterValue". Now, When i search for some term and I get 5 hits, how do
> I
> find out which paramName-Value pair matched ? 
> 
> I am seeking an optimal solution for this. Explanation, highlighter etc
> are some of the solutions. But not the best since highlighter would
> perform very bad for wildcard queries and explanation is generally not the
> nice way of doing this ! I am talking really large datasets here.
> 
> Any help, highly appreciated.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Which-field-matched---tf4141549.html#a11780757
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski


I am sure a faster StandardAnalyzer would be greatly appreciated.



I'm increasing the priority of that task then :)

StandardAnalyzer appears widely used and horrendously slow. Even better

would be a StandardAnalyzer that could have different recognizers
enabled/disabled. For example, dropping NUM recognition if you don't
need it in the current StandardAnalyzer gains like 25% speed.



That's a good idea, though I'd need to check if in case of JFlex there would
be considerable performance differences depending on the grammar.

Staszek

--
Stanislaw Osinski, [EMAIL PROTECTED]
http://www.carrot-search.com


Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Grant Ingersoll


On Jul 25, 2007, at 7:19 AM, Stanislaw Osinski wrote:



Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
limited by JavaCC speed. You cannot shave much more performance  
out of

the grammar as it is already about as simple as it gets.



JavaCC is slow indeed. We used it for a while for Carrot2, but then  
(3 years
ago :) switched to JFlex, which for roughly the same grammar would  
sometimes
be up to 10x (!) faster. You can have a look at our JFlex  
specification at:


http://carrot2.svn.sourceforge.net/viewvc/carrot2/trunk/carrot2/ 
components/carrot2-util-tokenizer/src/org/carrot2/util/tokenizer/ 
parser/jflex/JFlexWordBasedParserImpl.jflex?view=markup


This one seems more complex than the StandardAnalyzer's but it's  
much faster

anyway.

If anyone is interested, I could prepare a JFlex based Analyzer  
equivalent
(to the extent possible) to current StandardAnalyzer, which might  
offer nice

indexing and highlighting speed-ups.


+1.  I think a lot of people would be interested in a faster  
StandardAnalyzer.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene Highlighter linkage Error

2007-07-25 Thread ki

Hello!

I am working with Tomcat. I have put the Lucene highlighter.jar in the
folder lib. And I have created an extra css, where I say that the background
color has to be yellow. The searchword has to be highlighted know.
I have got a dataTable in which the result of the following Lucene method is
loaded:

[code]
public void search(String q,  File index, String [] fields, ArrayList
subresult, int numresults)  throws Exception {
 
 
Directory fsDir = FSDirectory.getDirectory(index, 
false);
IndexSearcher is = new IndexSearcher(fsDir);
 
Analyzer analyzer = new StandardAnalyzer();
Fragmenter fragmenter = new 
SimpleFragmenter(100);
QueryParser queryparser = new 
MultiFieldQueryParser (fields, analyzer);
Query query = 
queryparser.parse(q);
Hits hits = is.search(query);
IndexReader reader=null;
query=query.rewrite(reader);
QueryScorer scorer = new 
QueryScorer(query);

SimpleHTMLFormatter formatter= new SimpleHTMLFormatter("","");
Highlighter high = new 
Highlighter(formatter,scorer);

high.setTextFragmenter(fragmenter); 
numresults = numresults == -1 
|| numresults > hits.length() ?
hits.length() : numresults;
String rating = "";
for (int i = 0; i schwelli){
float f = 
hits.score(i);
if (0.9f <= f) 
{rating = "**";}
else if (0.8f 
<= f && f<0.9f){rating = "*";}
else if (0.7f 
<= f && f<0.8f){rating = "";}
else if (0.6f 
<= f && f<0.7f){rating = "***";}
else if (0.5f 
<= f && f<0.6f){rating = "**";}
else if (f <= 
0.5f){rating = "*";}
 Document doc = 
hits.doc(i);
 String 
abstracts =doc.get("ABSTRACTS");
 String title = 
doc.get("TITLE");
 TokenStream 
abstract_stream = analyzer.tokenStream(q, new
StringReader(abstracts));
 TokenStream 
title_straem = analyzer.tokenStream(q, new
StringReader(title));
String 
fragment_abstract =
high.getBestFragments(abstract_stream,abstracts, 5, "...");
 String 
fragment_title = high.getBestFragments(title_straem,title,
5, "...");
 
if(fragment_title.length()==0){

 setAusgabeTitle(doc.get("TITLE"));
 }else{

 setAusgabeTitle(fragment_title);
 }
  
if(fragment_abstract.length()==0){

 setAusgabeAbstract(doc.get("ABSTRACTS"));
 }else{

 setAusgabeAbstract(fragment_abstract);
 }
  

//list.add(i+1+"\t"+q+"\t"+doc.get(entry_medline)+"\t"+hits.score(i)+"\t"+abstract_stream+"\t"+title_straem+"\t"+"MEDLINE");
 
 /*int No = i;
 subresult.add((new 
Integer(No)).toString());*/
 
subresult.add(doc.get(entry_medline));
   

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Hey Guys,

I need to know how I can use the HitCollector class ? I am using Hits and
looping over all the possible document hits (turns out its 92 times I am
looping; for 300 searches, its 300*92 !!). Can I avoid this using
HitCollector ? I can't seem to understand how its used.

thanks a lot,

Askar

On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote:
>
> Askar,
> why do you need to add +id:?
> thanks,
> dt,
> www.ejinz.com
> search engine news forms
> - Original Message -
> From: "Askar Zaidi" <[EMAIL PROTECTED]>
> To: ; <[EMAIL PROTECTED]>
> Sent: Wednesday, July 25, 2007 12:39 AM
> Subject: Re: Fine Tuning Lucene implementation
>
>
> > Hey Hira ,
> >
> > Thanks so much for the reply. Much appreciate it.
> >
> > Quote:
> >
> > Would it be possible to just include a query clause?
> >   - i.e., instead of just contents:, also add
> > +id:
> >
> > How can I do that ?
> >
> > I see my query as :
> >
> > +contents:harvard +contents:business +contents:review
> >
> > where the search phrase was: harvard business review
> >
> > Now how can I add +id:  ??
> >
> > This would give me that one exact document I am looking for , for that
> id.
> > I
> > don't have to iterate through hits.
> >
> > thanks,
> >
> > Askar
> >
> >
> >
> > On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote:
> >>
> >> I'm no expert on this (so please accept the comments in that context)
> >> but 2 things seem weird to me:
> >>
> >> 1.  Iterating over each hit is an expensive proposition.  I've often
> >> seen people recommending a HitCollector.
> >>
> >> 2.  It seems that doBodySearch() is essentially saying, do this search
> >> and return the score pertinent to this ID (using an exhaustive loop).
> >> Would it be possible to just include a query clause?
> >> - i.e., instead of just contents:, also add
> >> +id:
> >>
> >> In general though, I think your algorithm seems inefficient (if I
> >> understand it correctly):-- if I want to search for one term among 3 in
> >> a "collection" of 300 documents (as defined by some external
> attribute),
> >> I will wind up executing 300 x 3 searches, and for each search that is
> >> executed, I will iterate over every Hit, even if I've already found the
> >> one that I "care about".
> >>
> >> What would break if you:
> >> 1.  Included "creator" in the Lucene index (or, filtered out the Hits
> >> using a BitSet or something like it)
> >> 2.  Executed 1 search
> >> 3.  Collected the results of the first N Hits (where N is some
> >> reasonable limit, like 100 or 500)
> >>
> >> -h
> >>
> >>
> >> On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote:
> >>
> >> > Sure.
> >> >
> >> >  public float doBodySearch(Searcher searcher,String query, int id){
> >> >
> >> >  try{
> >> > score = search(searcher, query,id);
> >> >  }
> >> >   catch(IOException io){}
> >> >   catch(ParseException pe){}
> >> >
> >> >   return score;
> >> >
> >> > }
> >> >
> >> >  private float search(Searcher searcher, String queryString, int id)
> >> > throws ParseException, IOException {
> >> >
> >> > // Build a Query object
> >> >
> >> > QueryParser queryParser = new QueryParser("contents", new
> >> > KeywordAnalyzer());
> >> >
> >> > queryParser.setDefaultOperator(QueryParser.Operator.AND);
> >> >
> >> > Query query = queryParser.parse(queryString);
> >> >
> >> > // Search for the query
> >> >
> >> > Hits hits = searcher.search(query);
> >> > Document doc = null;
> >> >
> >> > // Examine the Hits object to see if there were any matches
> >> > int hitCount = hits.length();
> >> >
> >> > for(int i=0;i >> > doc = hits.doc(i);
> >> > String str = doc.get("item");
> >> > int tmp = Integer.parseInt(str);
> >> > if(tmp==id)
> >> > score = hits.score(i);
> >> > }
> >> >
> >> > return score;
> >> > }
> >> >
> >> > I really need to optimize doBodySearch(...) as this takes the most
> >> > time.
> >> >
> >> > thanks guys,
> >> > Askar
> >> >
> >> >
> >> > On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote:
> >> >
> >> > Could you show us the relevant source from doBodySearch()?
> >> >
> >> > -h
> >> >
> >> > On Tue, 2007-07-24 at 19:58 -0400, Askar Zaidi wrote:
> >> > > I ran some tests and it seems that the slowness is from
> >> > Lucene calls when I
> >> > > do "doBodySearch", if I remove that call, Lucene gives me
> >> > results in 5
> >> > > seconds. otherwise it takes about 50 seconds.
> >> > >
> >> > > But I need to do Body search and that field contains lots
> of
> >> > text. The field
> >> > > is . How can I optimize that ?
> >> > >
> >> > > thanks,
> >> > > Askar
> >> > >
> >> >   

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Grant Ingersoll

Hi Askar,

I suggest we take a step back, and ask the question, what are you  
trying to accomplish?  That is, what is your application trying to  
do?  Forget the code, etc. just explain what you want the end result  
to be and we can work from there.   Based on what you have described,  
I am not sure you need access to the hits.  It seems like you just  
need to make better queries.


Is your itemID a unique identifier?  If yes, then you shouldn't need  
to loop over hits at all, as you should only ever have one result IF  
your query contains a required term.  Also, if this is the case, why  
do you need to do a search at all?  Haven't you already identified  
the items of interest when you did your select query in the  
database?  Or is it that you want to score the item based on some  
terms as well.  If that is the case, there are other ways of doing  
this and we can discuss them.


-Grant

On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote:


Hey Guys,

I need to know how I can use the HitCollector class ? I am using  
Hits and
looping over all the possible document hits (turns out its 92 times  
I am

looping; for 300 searches, its 300*92 !!). Can I avoid this using
HitCollector ? I can't seem to understand how its used.

thanks a lot,

Askar

On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote:


Askar,
why do you need to add +id:?
thanks,
dt,
www.ejinz.com
search engine news forms
- Original Message -
From: "Askar Zaidi" <[EMAIL PROTECTED]>
To: ; <[EMAIL PROTECTED]>
Sent: Wednesday, July 25, 2007 12:39 AM
Subject: Re: Fine Tuning Lucene implementation



Hey Hira ,

Thanks so much for the reply. Much appreciate it.

Quote:

Would it be possible to just include a query clause?
  - i.e., instead of just contents:, also add
+id:

How can I do that ?

I see my query as :

+contents:harvard +contents:business +contents:review

where the search phrase was: harvard business review

Now how can I add +id:  ??

This would give me that one exact document I am looking for , for  
that

id.

I
don't have to iterate through hits.

thanks,

Askar



On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote:


I'm no expert on this (so please accept the comments in that  
context)

but 2 things seem weird to me:

1.  Iterating over each hit is an expensive proposition.  I've  
often

seen people recommending a HitCollector.

2.  It seems that doBodySearch() is essentially saying, do this  
search
and return the score pertinent to this ID (using an exhaustive  
loop).

Would it be possible to just include a query clause?
- i.e., instead of just contents:, also add
+id:

In general though, I think your algorithm seems inefficient (if I
understand it correctly):-- if I want to search for one term  
among 3 in

a "collection" of 300 documents (as defined by some external

attribute),
I will wind up executing 300 x 3 searches, and for each search  
that is
executed, I will iterate over every Hit, even if I've already  
found the

one that I "care about".

What would break if you:
1.  Included "creator" in the Lucene index (or, filtered out the  
Hits

using a BitSet or something like it)
2.  Executed 1 search
3.  Collected the results of the first N Hits (where N is some
reasonable limit, like 100 or 500)

-h


On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote:


Sure.

 public float doBodySearch(Searcher searcher,String query, int  
id){


 try{
score = search(searcher,  
query,id);

 }
  catch(IOException io){}
  catch(ParseException pe){}

  return score;

}

 private float search(Searcher searcher, String queryString,  
int id)

throws ParseException, IOException {

// Build a Query object

QueryParser queryParser = new QueryParser("contents", new
KeywordAnalyzer());

queryParser.setDefaultOperator(QueryParser.Operator.AND);

Query query = queryParser.parse(queryString);

// Search for the query

Hits hits = searcher.search(query);
Document doc = null;

// Examine the Hits object to see if there were any  
matches

int hitCount = hits.length();

for(int i=0;i wrote:

Could you show us the relevant source from doBodySearch()?

-h

On Tue, 2007-07-24 at 19:58 -0400, Askar Zaidi wrote:

I ran some tests and it seems that the slowness is from

Lucene calls when I

do "doBodySearch", if I remove that call, Lucene gives me

results in 5

seconds. otherwise it takes about 50 seconds.

But I need to do Body search and that field contains lots

of

text. The field

is . How can I optimize that ?

thanks,
Askar












-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
Grant Ingersoll
Center for Natural Lan

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Hi Grant,

Thanks for the response. Heres what I am trying to accomplish:

1. Iterate over itemID (unique) in the database using one SQL query.
2. For every itemID found, run 4 searches on Lucene Index.
3. doTagSearch(itemID) ; collect score
4. doTitleSearch(itemID...) ; collect score
5. doSummarySearch(itemID...) ; collect score
6. doBodySearch(itemID) ; collect score

These scores are then added and I get a total score for each unique item in
the database.

Lucene Index has: 

So if I am running a body search, I have 92 hits from over 300 documents for
a query. I already know my hit with the  .

For instance, from step (1) if itemID 16 is passed to all the 4 searches, I
just need to get the score of the document which has itemID field = 16. I
don't have to iterate over all the hits.

I suppose I have to change my query to look for  where itemID=16.
Can you guide me as to how to do it ?

thanks a ton,

Askar

On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> Hi Askar,
>
> I suggest we take a step back, and ask the question, what are you
> trying to accomplish?  That is, what is your application trying to
> do?  Forget the code, etc. just explain what you want the end result
> to be and we can work from there.   Based on what you have described,
> I am not sure you need access to the hits.  It seems like you just
> need to make better queries.
>
> Is your itemID a unique identifier?  If yes, then you shouldn't need
> to loop over hits at all, as you should only ever have one result IF
> your query contains a required term.  Also, if this is the case, why
> do you need to do a search at all?  Haven't you already identified
> the items of interest when you did your select query in the
> database?  Or is it that you want to score the item based on some
> terms as well.  If that is the case, there are other ways of doing
> this and we can discuss them.
>
> -Grant
>
> On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote:
>
> > Hey Guys,
> >
> > I need to know how I can use the HitCollector class ? I am using
> > Hits and
> > looping over all the possible document hits (turns out its 92 times
> > I am
> > looping; for 300 searches, its 300*92 !!). Can I avoid this using
> > HitCollector ? I can't seem to understand how its used.
> >
> > thanks a lot,
> >
> > Askar
> >
> > On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote:
> >>
> >> Askar,
> >> why do you need to add +id:?
> >> thanks,
> >> dt,
> >> www.ejinz.com
> >> search engine news forms
> >> - Original Message -
> >> From: "Askar Zaidi" <[EMAIL PROTECTED]>
> >> To: ; <[EMAIL PROTECTED]>
> >> Sent: Wednesday, July 25, 2007 12:39 AM
> >> Subject: Re: Fine Tuning Lucene implementation
> >>
> >>
> >>> Hey Hira ,
> >>>
> >>> Thanks so much for the reply. Much appreciate it.
> >>>
> >>> Quote:
> >>>
> >>> Would it be possible to just include a query clause?
> >>>   - i.e., instead of just contents:, also add
> >>> +id:
> >>>
> >>> How can I do that ?
> >>>
> >>> I see my query as :
> >>>
> >>> +contents:harvard +contents:business +contents:review
> >>>
> >>> where the search phrase was: harvard business review
> >>>
> >>> Now how can I add +id:  ??
> >>>
> >>> This would give me that one exact document I am looking for , for
> >>> that
> >> id.
> >>> I
> >>> don't have to iterate through hits.
> >>>
> >>> thanks,
> >>>
> >>> Askar
> >>>
> >>>
> >>>
> >>> On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote:
> 
>  I'm no expert on this (so please accept the comments in that
>  context)
>  but 2 things seem weird to me:
> 
>  1.  Iterating over each hit is an expensive proposition.  I've
>  often
>  seen people recommending a HitCollector.
> 
>  2.  It seems that doBodySearch() is essentially saying, do this
>  search
>  and return the score pertinent to this ID (using an exhaustive
>  loop).
>  Would it be possible to just include a query clause?
>  - i.e., instead of just contents:, also add
>  +id:
> 
>  In general though, I think your algorithm seems inefficient (if I
>  understand it correctly):-- if I want to search for one term
>  among 3 in
>  a "collection" of 300 documents (as defined by some external
> >> attribute),
>  I will wind up executing 300 x 3 searches, and for each search
>  that is
>  executed, I will iterate over every Hit, even if I've already
>  found the
>  one that I "care about".
> 
>  What would break if you:
>  1.  Included "creator" in the Lucene index (or, filtered out the
>  Hits
>  using a BitSet or something like it)
>  2.  Executed 1 search
>  3.  Collected the results of the first N Hits (where N is some
>  reasonable limit, like 100 or 500)
> 
>  -h
> 
> 
>  On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote:
> 
> > Sure.
> >
> >  public float doBodySearch(Searcher searcher,String query, int
> > id){
> >
> >  try{
> >

Re: Search for null

2007-07-25 Thread Jay Yu
what if I do not know all possible values of that field which is a 
typical case in a free text search?


daniel rosher wrote:

You will be unable to search for fields that do not exist which is what
you originally wanted to do, instead you can do something like:

-Establish the query that will select all non-null values

TermQuery tq1 = new TermQuery(new Term("field","value1"));
TermQuery tq2 = new TermQuery(new Term("field","value2"));
...
TermQuery tqn = new TermQuery(new Term("field","valuen"));
BooleanQuery query = new BooleanQuery();
booleanQuery.add(tq1,BooleanClause.Occur.SHOULD);
booleanQuery.add(tq2,BooleanClause.Occur.SHOULD);
...
booleanQuery.add(tqn,BooleanClause.Occur.SHOULD);

OR perhaps a range query if your values are contiguous

Term start = new Term("field","198805");
Term end = new Term("field","198810");
Query query = new RangeQuery(start, end, true);
;

OR just use the QueryParser

Query query = QueryParser.parse(parseCriteria,
"field", new StandardAnalyzer());

-Create the QueryFilter

QueryFilter queryFilter = new QueryFilter(query);

-flip the bits

final BitSet filterBitSet = queryFilter.bits(reader);
filterBitSet.flip(0,filterBitSet.size());

Now you have a filter that contains document matching the opposite of
that specified by the query, and can use in subsequent queries

Dan



On Tue, 2007-07-24 at 09:40 -0700, Jay Yu wrote:

daniel rosher wrote:

Perhaps you can use a filter in the following way.

-Create a filter (via QueryFilter) that would contain all document that
do not have null values for the field
Interesting: what does the QueryFilter look like? Isn't it just as hard 
as finding out what docs have the null values for the field?

I really like to know your trick here.

-flip the bits of the filter so that it now contains documents that have
null values for a field
-Use the filter in conjunction with subsequent queries.

This would also help with performance as filters are simply bitsets and
can cheaply be stored, generated once and used often.

Dan

On Mon, 2007-07-23 at 13:57 -0700, Jay Yu wrote:
If you want performance, a better way might be to assign some special 
string/value (if it's easy to create) to the missing field of docs and 
index the field without tokenizing it. Then you may search for that 
special value to find the docs.


Jay

Les Fletcher wrote:

Does this particular range query have any significant performance issues?

Les

Erik Hatcher wrote:

On Jul 23, 2007, at 11:32 AM, testn wrote:
Is it possible to search for the document that specified field 
doesn't exist

or such field value is null?
This is from Solr, so I'm not sure off the top of my head if this mojo 
applies by itself, but a search for -fieldname:[* TO *] will result in 
all documents that do not have the specified field.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



<>

Daniel Rosher
Developer


d: 0207 3489 912
t: 0870 2020 121
f: 0870 2020 131
m: 
http://www.hotonline.com/







- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - -
This message is sent in confidence for the addressee only. It may contain privileged 
information. The contents are not to be disclosed to anyone other than the addressee. 
Unauthorised recipients are requested to preserve this confidentiality and to advise 
us of any errors in transmission. Thank you.


hotonline ltd is registered in England & Wales. Registered office: One Canada Square, 
Canary Wharf, London E14 5AP. Registered No: 1904765.



This message has been scanned for viruses by BlackSpider MailControl - 
www.blackspider.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Daniel Rosher
Developer


d: 0207 3489 912
t: 0870 2020 121
f: 0870 2020 131
m: 
http://www.hotonline.com/







- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - -
This message is sent in confidence for the addressee only. It may contain privileged 
information. The contents are not to be disclosed to anyone other than the addressee. 
Unauthorised recipients are requested to preserve this confidentiality and to advise 
us of any errors in transmission. Thank you.


hotonline ltd is registered in England & Wales. Registered office

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Grant Ingersoll
So, you really want a single Lucene score (based on the scores of  
your 4 fields) for every itemID, correct?  And this score consists of  
scoring the title, tag, summary and body against some keywords correct?


Here's what I would do:

while (rs.next())
{
doc = getDocument(itemId);  // Get your document, including  
contents from your database, no need even to put them in Lucene,  
although you could

add the doc to a MemoryIndex (see contrib/memory)
Run your 4 searches against that memory index to get your  
score.  Even better, combine your query into a single query that  
searches all 4 fields at once, then Lucene will combine the score for  
you

}

MemoryIndex info can be found at http://lucene.zones.apache.org:8080/ 
hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/ 
package-summary.html


-Grant

On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote:


Hi Grant,

Thanks for the response. Heres what I am trying to accomplish:

1. Iterate over itemID (unique) in the database using one SQL query.
2. For every itemID found, run 4 searches on Lucene Index.
3. doTagSearch(itemID) ; collect score
4. doTitleSearch(itemID...) ; collect score
5. doSummarySearch(itemID...) ; collect score
6. doBodySearch(itemID) ; collect score

These scores are then added and I get a total score for each unique  
item in

the database.

Lucene Index has: 

So if I am running a body search, I have 92 hits from over 300  
documents for

a query. I already know my hit with the  .

For instance, from step (1) if itemID 16 is passed to all the 4  
searches, I
just need to get the score of the document which has itemID field =  
16. I

don't have to iterate over all the hits.

I suppose I have to change my query to look for  where  
itemID=16.

Can you guide me as to how to do it ?

thanks a ton,

Askar

On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


Hi Askar,

I suggest we take a step back, and ask the question, what are you
trying to accomplish?  That is, what is your application trying to
do?  Forget the code, etc. just explain what you want the end result
to be and we can work from there.   Based on what you have described,
I am not sure you need access to the hits.  It seems like you just
need to make better queries.

Is your itemID a unique identifier?  If yes, then you shouldn't need
to loop over hits at all, as you should only ever have one result IF
your query contains a required term.  Also, if this is the case, why
do you need to do a search at all?  Haven't you already identified
the items of interest when you did your select query in the
database?  Or is it that you want to score the item based on some
terms as well.  If that is the case, there are other ways of doing
this and we can discuss them.

-Grant

On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote:


Hey Guys,

I need to know how I can use the HitCollector class ? I am using
Hits and
looping over all the possible document hits (turns out its 92 times
I am
looping; for 300 searches, its 300*92 !!). Can I avoid this using
HitCollector ? I can't seem to understand how its used.

thanks a lot,

Askar

On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote:


Askar,
why do you need to add +id:?
thanks,
dt,
www.ejinz.com
search engine news forms
- Original Message -
From: "Askar Zaidi" <[EMAIL PROTECTED]>
To: ; <[EMAIL PROTECTED]>
Sent: Wednesday, July 25, 2007 12:39 AM
Subject: Re: Fine Tuning Lucene implementation



Hey Hira ,

Thanks so much for the reply. Much appreciate it.

Quote:

Would it be possible to just include a query clause?
  - i.e., instead of just contents:, also add
+id:

How can I do that ?

I see my query as :

+contents:harvard +contents:business +contents:review

where the search phrase was: harvard business review

Now how can I add +id:  ??

This would give me that one exact document I am looking for , for
that

id.

I
don't have to iterate through hits.

thanks,

Askar



On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote:


I'm no expert on this (so please accept the comments in that
context)
but 2 things seem weird to me:

1.  Iterating over each hit is an expensive proposition.  I've
often
seen people recommending a HitCollector.

2.  It seems that doBodySearch() is essentially saying, do this
search
and return the score pertinent to this ID (using an exhaustive
loop).
Would it be possible to just include a query clause?
- i.e., instead of just contents:, also add
+id:

In general though, I think your algorithm seems inefficient (if I
understand it correctly):-- if I want to search for one term
among 3 in
a "collection" of 300 documents (as defined by some external

attribute),

I will wind up executing 300 x 3 searches, and for each search
that is
executed, I will iterate over every Hit, even if I've already
found the
one that I "care about".

What would break if you:
1.  Included "creator" in the Lucene index (or, filtered out the
Hits
using a BitSet or something like it)
2.  Executed 1 search
3.  

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Instead of refactoring the code, would there be a way to just modify the
query in each search routine ?

Such as, "search contents: and item:"; This means it would
just collect the score of that one document whose itemID field = itemID
passed from while(rs.next()).

I just need to collect the score of the  already in the index.

Would there be a way to modify the query ? Add a clause ?

thanks,
Askar


On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> So, you really want a single Lucene score (based on the scores of
> your 4 fields) for every itemID, correct?  And this score consists of
> scoring the title, tag, summary and body against some keywords correct?
>
> Here's what I would do:
>
> while (rs.next())
> {
>  doc = getDocument(itemId);  // Get your document, including
> contents from your database, no need even to put them in Lucene,
> although you could
>  add the doc to a MemoryIndex (see contrib/memory)
>  Run your 4 searches against that memory index to get your
> score.  Even better, combine your query into a single query that
> searches all 4 fields at once, then Lucene will combine the score for
> you
> }
>
> MemoryIndex info can be found at http://lucene.zones.apache.org:8080/
> hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/
> package-summary.html
>
> -Grant
>
> On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote:
>
> > Hi Grant,
> >
> > Thanks for the response. Heres what I am trying to accomplish:
> >
> > 1. Iterate over itemID (unique) in the database using one SQL query.
> > 2. For every itemID found, run 4 searches on Lucene Index.
> > 3. doTagSearch(itemID) ; collect score
> > 4. doTitleSearch(itemID...) ; collect score
> > 5. doSummarySearch(itemID...) ; collect score
> > 6. doBodySearch(itemID) ; collect score
> >
> > These scores are then added and I get a total score for each unique
> > item in
> > the database.
> >
> > Lucene Index has: 
> >
> > So if I am running a body search, I have 92 hits from over 300
> > documents for
> > a query. I already know my hit with the  .
> >
> > For instance, from step (1) if itemID 16 is passed to all the 4
> > searches, I
> > just need to get the score of the document which has itemID field =
> > 16. I
> > don't have to iterate over all the hits.
> >
> > I suppose I have to change my query to look for  where
> > itemID=16.
> > Can you guide me as to how to do it ?
> >
> > thanks a ton,
> >
> > Askar
> >
> > On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> >>
> >> Hi Askar,
> >>
> >> I suggest we take a step back, and ask the question, what are you
> >> trying to accomplish?  That is, what is your application trying to
> >> do?  Forget the code, etc. just explain what you want the end result
> >> to be and we can work from there.   Based on what you have described,
> >> I am not sure you need access to the hits.  It seems like you just
> >> need to make better queries.
> >>
> >> Is your itemID a unique identifier?  If yes, then you shouldn't need
> >> to loop over hits at all, as you should only ever have one result IF
> >> your query contains a required term.  Also, if this is the case, why
> >> do you need to do a search at all?  Haven't you already identified
> >> the items of interest when you did your select query in the
> >> database?  Or is it that you want to score the item based on some
> >> terms as well.  If that is the case, there are other ways of doing
> >> this and we can discuss them.
> >>
> >> -Grant
> >>
> >> On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote:
> >>
> >>> Hey Guys,
> >>>
> >>> I need to know how I can use the HitCollector class ? I am using
> >>> Hits and
> >>> looping over all the possible document hits (turns out its 92 times
> >>> I am
> >>> looping; for 300 searches, its 300*92 !!). Can I avoid this using
> >>> HitCollector ? I can't seem to understand how its used.
> >>>
> >>> thanks a lot,
> >>>
> >>> Askar
> >>>
> >>> On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote:
> 
>  Askar,
>  why do you need to add +id:?
>  thanks,
>  dt,
>  www.ejinz.com
>  search engine news forms
>  - Original Message -
>  From: "Askar Zaidi" <[EMAIL PROTECTED]>
>  To: ; <[EMAIL PROTECTED]>
>  Sent: Wednesday, July 25, 2007 12:39 AM
>  Subject: Re: Fine Tuning Lucene implementation
> 
> 
> > Hey Hira ,
> >
> > Thanks so much for the reply. Much appreciate it.
> >
> > Quote:
> >
> > Would it be possible to just include a query clause?
> >   - i.e., instead of just contents:, also add
> > +id:
> >
> > How can I do that ?
> >
> > I see my query as :
> >
> > +contents:harvard +contents:business +contents:review
> >
> > where the search phrase was: harvard business review
> >
> > Now how can I add +id:  ??
> >
> > This would give me that one exact document I am looking for , for
> > that
>  id.
> > I
> > don't have to iterate thro

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Grant Ingersoll

Yes, you can do that.


On Jul 25, 2007, at 12:31 PM, Askar Zaidi wrote:


Heres what I mean:

http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields

title:"The Right Way" AND text:go


Although, I am not searching for the title "the right way" , I am  
looking

for the score by specifying a unique field (itemID).

when I do System.out.println(query);

I get:

+contents:Harvard +contents:Business + contents: Review

Can I just add:

+contents:Harvard +contents:Business + contents: Review  
+itemID=id   ??


That query would just return one document.

On 7/25/07, Askar Zaidi <[EMAIL PROTECTED]> wrote:


Instead of refactoring the code, would there be a way to just  
modify the

query in each search routine ?

Such as, "search contents: and item:"; This means it  
would
just collect the score of that one document whose itemID field =  
itemID

passed from while( rs.next()).

I just need to collect the score of the  already in the  
index.


Would there be a way to modify the query ? Add a clause ?

thanks,
Askar


On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


So, you really want a single Lucene score (based on the scores of
your 4 fields) for every itemID, correct?  And this score  
consists of
scoring the title, tag, summary and body against some keywords  
correct?


Here's what I would do:

while (rs.next())
{
 doc = getDocument(itemId);  // Get your document, including
contents from your database, no need even to put them in Lucene,
although you could
 add the doc to a MemoryIndex (see contrib/memory)
 Run your 4 searches against that memory index to get your
score.  Even better, combine your query into a single query that
searches all 4 fields at once, then Lucene will combine the score  
for

you
}

MemoryIndex info can be found at http://lucene.zones.apache.org: 
8080/

hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/
package-summary.html

-Grant

On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote:


Hi Grant,

Thanks for the response. Heres what I am trying to accomplish:

1. Iterate over itemID (unique) in the database using one SQL  
query.

2. For every itemID found, run 4 searches on Lucene Index.
3. doTagSearch(itemID) ; collect score
4. doTitleSearch(itemID...) ; collect score
5. doSummarySearch(itemID...) ; collect score
6. doBodySearch(itemID) ; collect score

These scores are then added and I get a total score for each unique
item in
the database.

Lucene Index has: 

So if I am running a body search, I have 92 hits from over 300
documents for
a query. I already know my hit with the  .

For instance, from step (1) if itemID 16 is passed to all the 4
searches, I
just need to get the score of the document which has itemID field =
16. I
don't have to iterate over all the hits.

I suppose I have to change my query to look for  where
itemID=16.
Can you guide me as to how to do it ?

thanks a ton,

Askar

On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:


Hi Askar,

I suggest we take a step back, and ask the question, what are you
trying to accomplish?  That is, what is your application trying to
do?  Forget the code, etc. just explain what you want the end  
result
to be and we can work from there.   Based on what you have  
described,

I am not sure you need access to the hits.  It seems like you just
need to make better queries.

Is your itemID a unique identifier?  If yes, then you shouldn't  
need
to loop over hits at all, as you should only ever have one  
result IF
your query contains a required term.  Also, if this is the  
case, why

do you need to do a search at all?  Haven't you already identified
the items of interest when you did your select query in the
database?  Or is it that you want to score the item based on some
terms as well.  If that is the case, there are other ways of doing
this and we can discuss them.

-Grant

On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote:


Hey Guys,

I need to know how I can use the HitCollector class ? I am using
Hits and
looping over all the possible document hits (turns out its 92  
times

I am
looping; for 300 searches, its 300*92 !!). Can I avoid this using
HitCollector ? I can't seem to understand how its used.

thanks a lot,

Askar

On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote:


Askar,
why do you need to add +id:?
thanks,
dt,
www.ejinz.com
search engine news forms
- Original Message -
From: "Askar Zaidi" <[EMAIL PROTECTED] >
To: ; <[EMAIL PROTECTED]>
Sent: Wednesday, July 25, 2007 12:39 AM
Subject: Re: Fine Tuning Lucene implementation



Hey Hira ,

Thanks so much for the reply. Much appreciate it.

Quote:

Would it be possible to just include a query clause?
  - i.e., instead of just contents:, also add
+id:

How can I do that ?

I see my query as :

+contents:harvard +contents:business +contents:review

where the search phrase was: harvard business review

Now how can I add +id:  ??

This would give me that one exact document I am looking  
for , for

that

id.

I
d

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Heres what I mean:

http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields

title:"The Right Way" AND text:go


Although, I am not searching for the title "the right way" , I am looking
for the score by specifying a unique field (itemID).

when I do System.out.println(query);

I get:

+contents:Harvard +contents:Business + contents: Review

Can I just add:

+contents:Harvard +contents:Business + contents: Review +itemID=id   ??

That query would just return one document.

On 7/25/07, Askar Zaidi <[EMAIL PROTECTED]> wrote:
>
> Instead of refactoring the code, would there be a way to just modify the
> query in each search routine ?
>
> Such as, "search contents: and item:"; This means it would
> just collect the score of that one document whose itemID field = itemID
> passed from while( rs.next()).
>
> I just need to collect the score of the  already in the index.
>
> Would there be a way to modify the query ? Add a clause ?
>
> thanks,
> Askar
>
>
> On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> >
> > So, you really want a single Lucene score (based on the scores of
> > your 4 fields) for every itemID, correct?  And this score consists of
> > scoring the title, tag, summary and body against some keywords correct?
> >
> > Here's what I would do:
> >
> > while (rs.next())
> > {
> >  doc = getDocument(itemId);  // Get your document, including
> > contents from your database, no need even to put them in Lucene,
> > although you could
> >  add the doc to a MemoryIndex (see contrib/memory)
> >  Run your 4 searches against that memory index to get your
> > score.  Even better, combine your query into a single query that
> > searches all 4 fields at once, then Lucene will combine the score for
> > you
> > }
> >
> > MemoryIndex info can be found at http://lucene.zones.apache.org:8080/
> > hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/
> > package-summary.html
> >
> > -Grant
> >
> > On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote:
> >
> > > Hi Grant,
> > >
> > > Thanks for the response. Heres what I am trying to accomplish:
> > >
> > > 1. Iterate over itemID (unique) in the database using one SQL query.
> > > 2. For every itemID found, run 4 searches on Lucene Index.
> > > 3. doTagSearch(itemID) ; collect score
> > > 4. doTitleSearch(itemID...) ; collect score
> > > 5. doSummarySearch(itemID...) ; collect score
> > > 6. doBodySearch(itemID) ; collect score
> > >
> > > These scores are then added and I get a total score for each unique
> > > item in
> > > the database.
> > >
> > > Lucene Index has: 
> > >
> > > So if I am running a body search, I have 92 hits from over 300
> > > documents for
> > > a query. I already know my hit with the  .
> > >
> > > For instance, from step (1) if itemID 16 is passed to all the 4
> > > searches, I
> > > just need to get the score of the document which has itemID field =
> > > 16. I
> > > don't have to iterate over all the hits.
> > >
> > > I suppose I have to change my query to look for  where
> > > itemID=16.
> > > Can you guide me as to how to do it ?
> > >
> > > thanks a ton,
> > >
> > > Askar
> > >
> > > On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:
> > >>
> > >> Hi Askar,
> > >>
> > >> I suggest we take a step back, and ask the question, what are you
> > >> trying to accomplish?  That is, what is your application trying to
> > >> do?  Forget the code, etc. just explain what you want the end result
> > >> to be and we can work from there.   Based on what you have described,
> > >> I am not sure you need access to the hits.  It seems like you just
> > >> need to make better queries.
> > >>
> > >> Is your itemID a unique identifier?  If yes, then you shouldn't need
> > >> to loop over hits at all, as you should only ever have one result IF
> > >> your query contains a required term.  Also, if this is the case, why
> > >> do you need to do a search at all?  Haven't you already identified
> > >> the items of interest when you did your select query in the
> > >> database?  Or is it that you want to score the item based on some
> > >> terms as well.  If that is the case, there are other ways of doing
> > >> this and we can discuss them.
> > >>
> > >> -Grant
> > >>
> > >> On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote:
> > >>
> > >>> Hey Guys,
> > >>>
> > >>> I need to know how I can use the HitCollector class ? I am using
> > >>> Hits and
> > >>> looping over all the possible document hits (turns out its 92 times
> > >>> I am
> > >>> looping; for 300 searches, its 300*92 !!). Can I avoid this using
> > >>> HitCollector ? I can't seem to understand how its used.
> > >>>
> > >>> thanks a lot,
> > >>>
> > >>> Askar
> > >>>
> > >>> On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote:
> > 
> >  Askar,
> >  why do you need to add +id:?
> >  thanks,
> >  dt,
> >  www.ejinz.com
> >  search engine news forms
> >  - Original Message -
> >  From: "Askar Zaidi" <[EMAIL PROTECTED] >
> > >

Re: Search for null

2007-07-25 Thread daniel rosher
In this case you should look at the source for RangeFilter.java. 

Using this you could create your own filter using TermEnum and TermDocs
to find all documents that had some value for the field. 

You would then flip this filter (perhaps write a FlipFilter.java, that
takes an existing filter in it's constructor, for reuse) to get all
documents then didn't have a value for this field (i.e. null values). 

Depending on the time it takes to generate these filters, you could then
cache this filter with CachingWrappingFilter for subsequent searches.

Dan

On Wed, 2007-07-25 at 08:57 -0700, Jay Yu wrote:
> what if I do not know all possible values of that field which is a 
> typical case in a free text search?
> 
> daniel rosher wrote:
> > You will be unable to search for fields that do not exist which is what
> > you originally wanted to do, instead you can do something like:
> > 
> > -Establish the query that will select all non-null values
> > 
> > TermQuery tq1 = new TermQuery(new Term("field","value1"));
> > TermQuery tq2 = new TermQuery(new Term("field","value2"));
> > ...
> > TermQuery tqn = new TermQuery(new Term("field","valuen"));
> > BooleanQuery query = new BooleanQuery();
> > booleanQuery.add(tq1,BooleanClause.Occur.SHOULD);
> > booleanQuery.add(tq2,BooleanClause.Occur.SHOULD);
> > ...
> > booleanQuery.add(tqn,BooleanClause.Occur.SHOULD);
> > 
> > OR perhaps a range query if your values are contiguous
> > 
> > Term start = new Term("field","198805");
> > Term end = new Term("field","198810");
> > Query query = new RangeQuery(start, end, true);
> > ;
> > 
> > OR just use the QueryParser
> > 
> > Query query = QueryParser.parse(parseCriteria,
> > "field", new StandardAnalyzer());
> > 
> > -Create the QueryFilter
> > 
> > QueryFilter queryFilter = new QueryFilter(query);
> > 
> > -flip the bits
> > 
> > final BitSet filterBitSet = queryFilter.bits(reader);
> > filterBitSet.flip(0,filterBitSet.size());
> > 
> > Now you have a filter that contains document matching the opposite of
> > that specified by the query, and can use in subsequent queries
> > 
> > Dan
> > 
> > 
> > 
> > On Tue, 2007-07-24 at 09:40 -0700, Jay Yu wrote:
> >> daniel rosher wrote:
> >>> Perhaps you can use a filter in the following way.
> >>>
> >>> -Create a filter (via QueryFilter) that would contain all document that
> >>> do not have null values for the field
> >> Interesting: what does the QueryFilter look like? Isn't it just as hard 
> >> as finding out what docs have the null values for the field?
> >> I really like to know your trick here.
> >>> -flip the bits of the filter so that it now contains documents that have
> >>> null values for a field
> >>> -Use the filter in conjunction with subsequent queries.
> >>>
> >>> This would also help with performance as filters are simply bitsets and
> >>> can cheaply be stored, generated once and used often.
> >>>
> >>> Dan
> >>>
> >>> On Mon, 2007-07-23 at 13:57 -0700, Jay Yu wrote:
>  If you want performance, a better way might be to assign some special 
>  string/value (if it's easy to create) to the missing field of docs and 
>  index the field without tokenizing it. Then you may search for that 
>  special value to find the docs.
> 
>  Jay
> 
>  Les Fletcher wrote:
> > Does this particular range query have any significant performance 
> > issues?
> >
> > Les
> >
> > Erik Hatcher wrote:
> >> On Jul 23, 2007, at 11:32 AM, testn wrote:
> >>> Is it possible to search for the document that specified field 
> >>> doesn't exist
> >>> or such field value is null?
> >> This is from Solr, so I'm not sure off the top of my head if this mojo 
> >> applies by itself, but a search for -fieldname:[* TO *] will result in 
> >> all documents that do not have the specified field.
> >>
> >> Erik
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>  -
>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>  For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
>  <>
> >>> Daniel Rosher
> >>> Developer
> >>>
> >>>
> >>> d: 0207 3489 912
> >>> t: 0870 2020 121
> >>> f: 0870 2020 131
> >>> m: 
> >>> http://www.hotonline.com/
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
> >>> - - - - - - - - - - - - - - - - - -
> >>> This message is sent in confidence for the addressee only. It may contain 
> >>> privileged 
> >>> information. The contents are not to be disclosed to anyone other than 
> >>> the address

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Hey guys,

One last question and I think I'll have an optimized algorithm.

How can I build a query in my program ?

This is what I am doing:

QueryParser queryParser = new QueryParser("contents", new
StandardAnalyzer());

 queryParser.setDefaultOperator(QueryParser.Operator.AND);

 Query q = queryParser.parse(queryString);

So doing : System.out.println(q) shows:

+contents:harvard +contents:business +contents:review

I'd like to modify Query q to read:

+contents:harvard +contents:business +contents:review +itemID: (id passed in
the search method)

So this would pick the one document I need from the Index and give me the
score. I don't have to iterate over Hits.

Any clues ? I can't find any examples on query building .

thanks !

Askar


On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> Yes, you can do that.
>
>
> On Jul 25, 2007, at 12:31 PM, Askar Zaidi wrote:
>
> > Heres what I mean:
> >
> > http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields
> >
> > title:"The Right Way" AND text:go
> >
> >
> > Although, I am not searching for the title "the right way" , I am
> > looking
> > for the score by specifying a unique field (itemID).
> >
> > when I do System.out.println(query);
> >
> > I get:
> >
> > +contents:Harvard +contents:Business + contents: Review
> >
> > Can I just add:
> >
> > +contents:Harvard +contents:Business + contents: Review
> > +itemID=id   ??
> >
> > That query would just return one document.
> >
> > On 7/25/07, Askar Zaidi <[EMAIL PROTECTED]> wrote:
> >>
> >> Instead of refactoring the code, would there be a way to just
> >> modify the
> >> query in each search routine ?
> >>
> >> Such as, "search contents: and item:"; This means it
> >> would
> >> just collect the score of that one document whose itemID field =
> >> itemID
> >> passed from while( rs.next()).
> >>
> >> I just need to collect the score of the  already in the
> >> index.
> >>
> >> Would there be a way to modify the query ? Add a clause ?
> >>
> >> thanks,
> >> Askar
> >>
> >>
> >> On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> >>>
> >>> So, you really want a single Lucene score (based on the scores of
> >>> your 4 fields) for every itemID, correct?  And this score
> >>> consists of
> >>> scoring the title, tag, summary and body against some keywords
> >>> correct?
> >>>
> >>> Here's what I would do:
> >>>
> >>> while (rs.next())
> >>> {
> >>>  doc = getDocument(itemId);  // Get your document, including
> >>> contents from your database, no need even to put them in Lucene,
> >>> although you could
> >>>  add the doc to a MemoryIndex (see contrib/memory)
> >>>  Run your 4 searches against that memory index to get your
> >>> score.  Even better, combine your query into a single query that
> >>> searches all 4 fields at once, then Lucene will combine the score
> >>> for
> >>> you
> >>> }
> >>>
> >>> MemoryIndex info can be found at http://lucene.zones.apache.org:
> >>> 8080/
> >>> hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/
> >>> package-summary.html
> >>>
> >>> -Grant
> >>>
> >>> On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote:
> >>>
>  Hi Grant,
> 
>  Thanks for the response. Heres what I am trying to accomplish:
> 
>  1. Iterate over itemID (unique) in the database using one SQL
>  query.
>  2. For every itemID found, run 4 searches on Lucene Index.
>  3. doTagSearch(itemID) ; collect score
>  4. doTitleSearch(itemID...) ; collect score
>  5. doSummarySearch(itemID...) ; collect score
>  6. doBodySearch(itemID) ; collect score
> 
>  These scores are then added and I get a total score for each unique
>  item in
>  the database.
> 
>  Lucene Index has: 
> 
>  So if I am running a body search, I have 92 hits from over 300
>  documents for
>  a query. I already know my hit with the  .
> 
>  For instance, from step (1) if itemID 16 is passed to all the 4
>  searches, I
>  just need to get the score of the document which has itemID field =
>  16. I
>  don't have to iterate over all the hits.
> 
>  I suppose I have to change my query to look for  where
>  itemID=16.
>  Can you guide me as to how to do it ?
> 
>  thanks a ton,
> 
>  Askar
> 
>  On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:
> >
> > Hi Askar,
> >
> > I suggest we take a step back, and ask the question, what are you
> > trying to accomplish?  That is, what is your application trying to
> > do?  Forget the code, etc. just explain what you want the end
> > result
> > to be and we can work from there.   Based on what you have
> > described,
> > I am not sure you need access to the hits.  It seems like you just
> > need to make better queries.
> >
> > Is your itemID a unique identifier?  If yes, then you shouldn't
> > need
> > to loop over hits at all, as you should only ever have

Re: What replaced org.apache.lucene.document.Field.Text?

2007-07-25 Thread Lindsey Hess
Andy, Patrick,
   
  Thank you.  I replaced Field.Text with new Field("name", "value", 
Field.Store.YES, Field.Index.TOKENIZED); and it works just fine.
   
  Cheers,
   
  Lindsey

   
  

Patrick Kimber <[EMAIL PROTECTED]> wrote:
  Hi Andy

I think:
Field.Text("name", "value");

has been replaced with:
new Field("name", "value", Field.Store.YES, Field.Index.TOKENIZED);

Patrick

On 25/07/07, [EMAIL PROTECTED] 
wrote:
> Please reference How do I get code written for Lucene 1.4.x to work with
> Lucene 2.x?
> http://wiki.apache.org/lucene-java/LuceneFAQ#head-86d479476c63a2579e867b
> 75d4faa9664ef6cf4d
>
>
> Andy
> -Original Message-
> From: Lindsey Hess [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 25, 2007 12:31 PM
> To: Lucene
> Subject: What replaced org.apache.lucene.document.Field.Text?
>
> I'm trying to get some relatively old Lucene code to compile (please see
> below), and it appears that Field.Text has been deprecated. Can someone
> please suggest what I should use in its place?
>
> Thank you.
>
> Lindsey
>
>
>
> public static void main(String args[]) throws Exception
> {
> String indexDir =
> System.getProperty("java.io.tmpdir", "tmp") +
> System.getProperty("file.separator") + "address-book";
> Analyzer analyzer = new WhitespaceAnalyzer();
> boolean createFlag = true;
>
> IndexWriter writer = new IndexWriter(indexDir, analyzer, createFlag);
> Document contactDocument = new Document();
> contactDocument.add(Field.Text("type", "individual"));
>
> contactDocument.add(Field.Text("name", "Zane Pasolini"));
> contactDocument.add(Field.Text("address", "999 W. Prince St."));
> contactDocument.add(Field.Text("city", "New York"));
> contactDocument.add(Field.Text("province", "NY"));
> contactDocument.add(Field.Text("postalcode", "10013"));
> contactDocument.add(Field.Text("country", "USA"));
> contactDocument.add(Field.Text("telephone", "1-212-345-6789"));
> writer.addDocument(contactDocument);
> writer.close();
> }
>
>
> -
> Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user
> panel and lay it on us.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



   
-
Boardwalk for $500? In 2007? Ha! 
Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games.

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Grant Ingersoll


On Jul 25, 2007, at 1:26 PM, Askar Zaidi wrote:


Hey guys,

One last question and I think I'll have an optimized algorithm.

How can I build a query in my program ?

This is what I am doing:

QueryParser queryParser = new QueryParser("contents", new
StandardAnalyzer());

 queryParser.setDefaultOperator(QueryParser.Operator.AND);

 Query q = queryParser.parse(queryString);


Just concatenate it onto your string:

 Query q = queryParser.parse(queryString + " +" + itemID);



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MoreLikeThis for multiple documents

2007-07-25 Thread Jens Grivolla

Hello,

I'm looking to extract significant terms characterizing a set of 
documents (which in turn relate to a topic).


This basically comes down to functionality similar to determining the 
terms with the greatest offer weight (as used for blind relevance 
feedback), or maximizing tf.idf (as is done in MoreLikeThis).


Is there anything like this already implemented, or do I need to iterate 
through all documents in the set "manually", re-tokenize each one (or 
maybe use TermVectors), and then calculate the weight for each term?


Thanks,
   Jens

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Assembling a query from multiple fields

2007-07-25 Thread Joe Attardi

Hi all,

Apologies for the cryptic subject line, but I couldn't think of a more
descriptive one-liner to describe my problem/question to you all. Still
fairly new to Lucene here, although I'm hoping to have more of a clue once I
get a chance to read "Lucene In Action".

I am implementing a search engine using Lucene for a web application. It is
not really a free-text search like some other, more standard
implementations.
The requirement is for the search to be as easy and user-friendly as
possible, so instead of specifying the field to search in the query itself -
such as ip:192.168.102.230 - and being parsed with QueryParser, the field is
being selected via a HTML  element, and the search keywords are
entered in a text field.

As far as I can tell, I basically have two options:
(1) Manually prepend the field identifier to the query text, for example:
 String fullQuery = field + ":" + queryText;
then parse this query normally with QueryParser, OR
(2) Since I know it is only going to be searching one term, manually create
a TermQuery with a Term object representing what the user typed in, for
example:
 Query query = new TermQuery(new Term(field, queryText));

Is there any advantage or disadvantage to any of these, or is one preferable
over the other? My gut tells me that directly creating the TermQuery is more
efficient since it doesn't have to perform parsing, but I'm not sure.

I have other questions, too, but I don't want to get ahead of myself. One at
a time... :)

Appreciate any help you all might have!

--
Joe Attardi


Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Yonik Seeley

On 7/25/07, Stanislaw Osinski <[EMAIL PROTECTED]> wrote:

JavaCC is slow indeed.


JavaCC is a very fast parser for a large document... the issue is
small fields and JavaCC's use of an exception for flow control at the
end of a value.  As JVMs have advanced, exception-as-control-flow as
gotten comparably slower.

Does JFlex have a jar associated with it?  It's GPL (although you can
freely use the files it generates under any license), so if there were
other non-generated files required, we wouldn't be able to incorporate
them.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Linear Hashing in Lucene?

2007-07-25 Thread Dmitry

Hey,
Some common questions about Lucene.
1. does exist Ontology Wraper in Lucene implementation?
2. Does Lucene using Linear Hashing?

thnaks,
DT,
www.ejinz.com
Search news

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search for null

2007-07-25 Thread Daniel Noll
On Thursday 26 July 2007 03:12:20 daniel rosher wrote:
> In this case you should look at the source for RangeFilter.java.
>
> Using this you could create your own filter using TermEnum and TermDocs
> to find all documents that had some value for the field.

That's certainly the way to do it for speed.

For the least code you can probably do...

  BooleanFilter f = new BooleanFilter();
  f.add(new FilterClause(RangeFilter.More("field", ""),
 BooleanClause.Occur.MUST_NOT));
  f = new CachingWrapperFilter(f);

Daniel

-- 
Daniel Noll
Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://nuix.com/   Fax: +61 2 9212 6902

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Highlighter strategy in Lucene

2007-07-25 Thread Dmitry

Waht kind of Highlighter strategy Lucene is using?

thanks,
Dt
www.ejinz.com
Search Engine for News

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Displaying results in the order

2007-07-25 Thread Dmitry

Is there a way to update a document in the Index without causing any change
to the order in which it comes up in searches?

thanks,
DT,
www.ejinz.com
Search everything
news, tech, movies, music


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Strange Error while deleting Documents from index while indexing.

2007-07-25 Thread miztaken

Hi,
I am dumping the database tables into lucene documents.
I am doing like this:

1. Get the rowset from database to be stored as Lucene Document.
2. Open IndexReader and check if they are already indexed.
   If Indexed, delete them and add the new rowset.
   Continue this till the end
3. Close IndexReader
4. Open IndexWriter
5. Write the same rowset in the index.
6. delete the rowset from database..
7. Repeat the same process[step 1 - step 7 ] till there are records in
database.


Like this i am doing Indexing and deletion.
Some key points:
1. New indexWriter is opened when there is not instance of indexwriter
available,but if available it makes use of the same IndexWriter. i.e. My
index Writer opens once in Step 4 and after that the whole process makes use
of it.
2. But i open indexReader for each deletion and close.
3. I optimize IndexWriter after certain threshold is crossed.

Now my problem is:
In the first deletion of document (if present) in step 2 and closing of
indexreader in step 3. I get no error.
But in the second loop, i get the error while trying to close the
IndexReader.

The error is : 
Unable to cast object of type 'System.Collections.DictionaryEntry' to type
'System.String'.

Stack Trace:
   at Lucene.Net.Index.IndexFileDeleter.DeleteFiles(ArrayList files)
   at Lucene.Net.Index.IndexFileDeleter.DeleteFiles()
   at Lucene.Net.Index.IndexFileDeleter.CommitPendingFiles()
   at Lucene.Net.Index.IndexReader.Commit()
   at Lucene.Net.Index.IndexReader.Close()
   at QueryDatabaseForIndexing.Program.Main(String[] args) in E:\Test
Applications\ORS Lucene Developments\July 25\
TotalIndexingAndSearching_25_july\T otalIndexingAndSearching\  
QueryDatabaseForIndexing \Program2.cs:line 159

I dont know whats the cause of this error.

I am in real need of help.
Please help me find error.



-- 
View this message in context: 
http://www.nabble.com/Strange-Error-while-deleting-Documents-from-index-while-indexing.-tf4149570.html#a11804824
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski

On 25/07/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 7/25/07, Stanislaw Osinski <[EMAIL PROTECTED]> wrote:
> JavaCC is slow indeed.

JavaCC is a very fast parser for a large document... the issue is
small fields and JavaCC's use of an exception for flow control at the
end of a value.  As JVMs have advanced, exception-as-control-flow as
gotten comparably slower.



In Carrot2 we tokenize mostly very short documents (search results), so in
this context JFlex proved much faster. I did a very rough performance test
of Highlighter using JFlex and JavaCC-generated analyzers with medium-sized
documents (up to ~1kB), and JFlex was still faster. What size would a
'large' document be?

Does JFlex have a jar associated with it?  It's GPL (although you can

freely use the files it generates under any license), so if there were
other non-generated files required, we wouldn't be able to incorporate
them.



You need JFlex jar only to generate the tokenizer (one Java class). The
generated tokenizer is standalone and doesn't need the JFlex jar to run.

Staszek


Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Hey Guys,

Thanks for all the responses. I finally got it working with some query
modification.

The idea was to pick an itemID from the database and for that itemID in the
Index, get the scores across 4 fields; add them up and ta-da !

I still have to verify my scores.

Thanks a ton, I'll be active on this list from now on and try and answer
questions to which I was seeking answers.

later,
Askar

On 7/25/07, Doron Cohen <[EMAIL PROTECTED]> wrote:
>
> "Askar Zaidi" wrote:
>
> > ... Heres what I am trying to accomplish:
> >
> > 1. Iterate over itemID (unique) in the database using one SQL query.
> > 2. For every itemID found, run 4 searches on Lucene Index.
> > 3. doTagSearch(itemID) ; collect score
> > 4. doTitleSearch(itemID...) ; collect score
> > 5. doSummarySearch(itemID...) ; collect score
> > 6. doBodySearch(itemID) ; collect score
> >
> > These scores are then added and I get a total score for each
> > unique item in the database.
>
> oining this late I might be missing something. Still I
> would like to understand better *what* you are trying to do
> here (before going into the *how*).
>
> By your description above, my understanding is this:
>
> 1. Assume one table in the DB, with textual
>columns: ItemID(unique), Title, Summary, Body, Tags.
> 2. The ItemID columns is a unique key in the table.
> 3. Assume entries in the ItemID column looks like
>this: itemID=127, itemID=75, etc.
> 4. Some of the other columns (not the ItemID column)
>can contain IDs as well.
> 5. You are iterating over the ItemID column, and,
>for each value, (each ID), ranking all the documents
>in the index (all the rows in that table) for
>occurrences of that ID.
>
> Is that so?
>
> If so, you are actually trying to find for each row (doc),
> which (other) rows (docs) "refer" to it most. Right?
> Is this really a textual search problem?
>
> For instance, if rows X has N references to row Z,
> and row Y has N+1 references to row Z, but the length
> of the text in row Z is much more than that of row X,
> would you expect row X to rank higher, because it is
> shorter (what Lucene is likely to do) or that row Y
> will rank higher, because it has slightly more
> references to row Z?
>
> In another email you have this:
>
> > Can I just add:
> >
> > +contents:Harvard +contents:Business +contents: Review +itemID=77
> ??
> >
> > That query would just return one document.
>
> Which is different than the above - it has a textual
> task, not only ID. Are you interested here in all docs
> (rows) that reference itemID=77 or only want to check
> if the specific row whose ID is itemID=77, satisfies
> the textual part of this query?
>
> This brings back to the start point: perhaps it would
> help more if you once again define the task/problem you
> are trying to solve? Forget about loops and doXyzSearch()
> methods - just define input; output; logic;
>
> Regards,
> Doron
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


java gc with a frequently changing index?

2007-07-25 Thread Tim Sturge

Hi,

I am indexing a set of constantly changing documents. The change rate is 
moderate (about 10 docs/sec over a 10M document collection with a 6G 
total size) but I want to be  right up to date (ideally within a second 
but within 5 seconds is acceptable) with the index.


Right now I have code that adds new documents to the index and deletes 
old ones using updateDocument() in the 2.1 IndexWriter. In order to see 
the changes, I need to recreate the IndexReader/IndexSearcher every 
second or so. I am not calling optimize() on the index in the writer, 
and the mergeFactor is 10.


The problem I am facing is that java gc is terrible at collecting the 
IndexSearchers I am discarding. I usually have a 3msec query time, but I 
get gc pauses of 300msec to 3 sec (I assume is is collecting the 
"tenured" generation in these pauses, which is my old IndexSearcher)


I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and 
calling System.gc() right after I close the old index without much luck 
(I get the pauses down to 1sec, but get 3x as many. I want < 25 msec 
pauses). So my question is, should I be avoiding reloading my index in 
this way? Should I keep a separate IndexReader (which only deletes old 
documents) and one for new documents? Is there a standard technique for 
a quickly changing index?


Thanks,

Tim


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fine Tuning Lucene implementation

2007-07-25 Thread Doron Cohen
"Askar Zaidi" wrote:

> ... Heres what I am trying to accomplish:
>
> 1. Iterate over itemID (unique) in the database using one SQL query.
> 2. For every itemID found, run 4 searches on Lucene Index.
> 3. doTagSearch(itemID) ; collect score
> 4. doTitleSearch(itemID...) ; collect score
> 5. doSummarySearch(itemID...) ; collect score
> 6. doBodySearch(itemID) ; collect score
>
> These scores are then added and I get a total score for each
> unique item in the database.

oining this late I might be missing something. Still I
would like to understand better *what* you are trying to do
here (before going into the *how*).

By your description above, my understanding is this:

1. Assume one table in the DB, with textual
   columns: ItemID(unique), Title, Summary, Body, Tags.
2. The ItemID columns is a unique key in the table.
3. Assume entries in the ItemID column looks like
   this: itemID=127, itemID=75, etc.
4. Some of the other columns (not the ItemID column)
   can contain IDs as well.
5. You are iterating over the ItemID column, and,
   for each value, (each ID), ranking all the documents
   in the index (all the rows in that table) for
   occurrences of that ID.

Is that so?

If so, you are actually trying to find for each row (doc),
which (other) rows (docs) "refer" to it most. Right?
Is this really a textual search problem?

For instance, if rows X has N references to row Z,
and row Y has N+1 references to row Z, but the length
of the text in row Z is much more than that of row X,
would you expect row X to rank higher, because it is
shorter (what Lucene is likely to do) or that row Y
will rank higher, because it has slightly more
references to row Z?

In another email you have this:

> Can I just add:
>
> +contents:Harvard +contents:Business +contents: Review +itemID=77
??
>
> That query would just return one document.

Which is different than the above - it has a textual
task, not only ID. Are you interested here in all docs
(rows) that reference itemID=77 or only want to check
if the specific row whose ID is itemID=77, satisfies
the textual part of this query?

This brings back to the start point: perhaps it would
help more if you once again define the task/problem you
are trying to solve? Forget about loops and doXyzSearch()
methods - just define input; output; logic;

Regards,
Doron


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query parsing?

2007-07-25 Thread Daniel Naber
On Wednesday 25 July 2007 00:44, Lindsey Hess wrote:

> Now, I do not need Lucene to index anything, but I'm wondering if Lucene
> has query parsing classes that will allow me to transform the queries. 

The Lucene QueryParser class can parse the format descriped at 
http://lucene.apache.org/java/docs/queryparsersyntax.html. To adapt it to 
other formats, the javacc grammar needs to be modified. To output in yet 
another format, either the Java code would need to be modified or you'd 
need to write some new Java code that iterates over the object produced by 
the QueryParser. In other words: this is not what Lucene's QueryParser was 
made for and it's not too simple unless you're already familiar with 
javacc.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Delete corrupted doc

2007-07-25 Thread Rafael Rossini

Hi guys,

   Is there a way of deleting a document that, because of some corruption,
got and docID larger than the maxDoc() ? I´m trying to do this but I get
this Exception:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array
index out of range: 106577
  at org.apache.lucene.util.BitVector.set(BitVector.java:53)
  at org.apache.lucene.index.SegmentReader.doDelete (SegmentReader.java
:301)
  at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java
:674)
  at org.apache.lucene.index.MultiReader.doDelete(MultiReader.java:125)
  at org.apache.lucene.index.IndexReader.deleteDocument (IndexReader.java
:674)
  at teste.DeleteError.main(DeleteError.java:9)

Thanks