Re: WildcardQuery and SpanQuery

2007-07-18 Thread Cedric Ho

Thanks for the quick response Paul =)

However I am lost while looking at the surround package. Are you
suggesting I can solve my problem at hand using the surround package?


On 7/18/07, Paul Elschot <[EMAIL PROTECTED]> wrote:

On Wednesday 18 July 2007 05:58, Cedric Ho wrote:
> Hi everybody,
>
> We recently need to support wildcard search terms "*", "?" together
> with SpanQuery. It seems that there's no SpanWildcardQuery available.
> After looking into the lucene source code for a while, I guess we can
> either:
>
> 1. Use SpanRegexQuery, or
>
> 2. Write our own SpanWildcardQuery, and implements the
> rewrite(IndexReader) method to rewrite the query into a SpanOrQuery
> with some SpanTermQuery.
>
> Of the two approaches, Option 1 seems to be easier. But I am rather
> concerned about the performance of using regular expression. On the
> other hand, I am not sure if there are any other concerns I am not
> aware of for option 2 (i.e. is there a reason why there's no
> SpanWildcardQuery in the first place?)
>
> Any advices ?

The basic problem you are facing is that in Lucene
the expansion of the terms is tightly coupled to the generation
of a combination query using the expanded terms.

In contrib/surround the term expansion and query generation
are decoupled using a visitor pattern for the terms. The code is here:
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/src/java/org/apache/lucene/queryParser/surround/query

In surround a wild card term can provide either an OR of
normal term queries, or a SpanOrQuery of span term queries.
This query generation is in class SimpleTerm, which has one method
for a normal boolean OR query over the terms, and one for
a span query for the terms.

In both cases surround uses a regular expression to expand
the matching terms, but that could be changed to use
another wildcard expansion mechanisms than the ones in
SrndPrefixQuery and SrndTruncQuery, which
are subclasses of SimpleTerm.

With the term expansion and the query combination split,
it is also necessary to limit the maximum number of expanded
terms in another way than Lucene does. In surround the
classes BasicQueryFactory and TooManyBasicQueries are
used for that.

Regards,
Paul Elschot



>
> Cedric
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WildcardQuery and SpanQuery

2007-07-18 Thread Mark Miller

You could give this a shot (From my Qsol query parser):

package com.mhs.qsol.spans;

/**
* Copyright 2006 Mark Miller ([EMAIL PROTECTED])
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Set;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.spans.SpanOrQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.search.spans.Spans;

/**
* @author mark miller
*
*/
public class SpanWildcardQuery extends SpanQuery {
   private Term term;

   private BooleanQuery rewrittenWildQuery;

   public SpanWildcardQuery(Term term) {
   this.term = term;
   }

   public Term getTerm() {
   return term;
   }

   public Query rewrite(IndexReader reader) throws IOException {
   WildcardQuery wildQuery = new WildcardQuery(term);

   rewrittenWildQuery = (BooleanQuery) wildQuery.rewrite(reader);

   BooleanQuery bq = (BooleanQuery) rewrittenWildQuery.rewrite(reader);

   BooleanClause[] clauses = bq.getClauses();
   SpanQuery[] sqs = new SpanQuery[clauses.length];

   for (int i = 0; i < clauses.length; i++) {
   BooleanClause clause = clauses[i];

   TermQuery tq = (TermQuery) clause.getQuery();

   sqs[i] = new SpanTermQuery(tq.getTerm());
   sqs[i].setBoost(tq.getBoost());
   }

   SpanOrQuery query = new SpanOrQuery(sqs);
   query.setBoost(wildQuery.getBoost());

   return query;
   }

   public Spans getSpans(IndexReader reader) throws IOException {
   throw new UnsupportedOperationException(
   "Query should have been rewritten");
   }

   public String getField() {
   return term.field();
   }

   /**
* @deprecated use extractTerms instead
* @see #extractTerms(Set);
*/
   public Collection getTerms() {
   Collection terms = new ArrayList();
   terms.add(term);

   return terms;
   }

   public void extractTerms(Set terms) {
   terms.add(term);
   }

   public String toString(String field) {
   StringBuffer buffer = new StringBuffer();
   buffer.append("spanWildcardQuery(");
   buffer.append(term);
   buffer.append(")");

   // buffer.append(ToStringUtils.boost(getBoost()));
   return buffer.toString();
   }
}


Cedric Ho wrote:

Hi everybody,

We recently need to support wildcard search terms "*", "?" together
with SpanQuery. It seems that there's no SpanWildcardQuery available.
After looking into the lucene source code for a while, I guess we can
either:

1. Use SpanRegexQuery, or

2. Write our own SpanWildcardQuery, and implements the
rewrite(IndexReader) method to rewrite the query into a SpanOrQuery
with some SpanTermQuery.

Of the two approaches, Option 1 seems to be easier. But I am rather
concerned about the performance of using regular expression. On the
other hand, I am not sure if there are any other concerns I am not
aware of for option 2 (i.e. is there a reason why there's no
SpanWildcardQuery in the first place?)

Any advices ?

Cedric

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query in lucene

2007-07-18 Thread WATHELET Thomas
Witch analyser I have to use to find text like this ''?


Re: Query in lucene

2007-07-18 Thread Erick Erickson

When in doubt, WhitespaceAnalyzer is the most predictable. Note that
it doesn't lower-case the tokens though. Depending upon your
requirements, you can always pre-process your query and indexing
streams and do your own lowercasing and/or character stripping.

You can always create your own analyzer with the building blocks
provided via Filters and Tokenizers

Erick.

On 7/18/07, WATHELET Thomas <[EMAIL PROTECTED]> wrote:


Witch analyser I have to use to find text like this ''?



Re: Does Index have a Tokenizer Built into it

2007-07-18 Thread John Paul Sondag

Is there a way to know how big to make the array before hand (how many terms
are in the topic total?).  I'm worried about the efficiency of this, since
I'd have to rebuild every document that is a "hit" on the fly to make a
snippet for each "hit" on the page (say 10 a page).

Now I have to wonder how storing the termPosition vectors in the index +
sorting them by position  compares to storing the location of the document +
using a tokenizer on the document.  Both in the end give me the result I
want.

Any opinions?

--JP

On 7/18/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: After indexing I have been able to retrieve the TermPositionVector from
the
: index and it has all of the data, but I cannot find a way where given a
: position I can retrieve the term at that position. Which is how I was
hoping
: to create my contextual snippets.

there is no easy way to go from a position to a term -- coincidently there
is a very recent thread on this on java-dev...

http://www.nabble.com/Best-Practices-for-getting-Strings-from-a-position-range-tf4084187.html

...a new API may come out of it, but in the mean time you may be
interested in taking the approach the current highlighter uses (as
mentioned in that thread), of using the TermPositionVector to rebuild the
orriginal tokenstream, then skipping ahead to the positions you are
interested in.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: WildcardQuery and SpanQuery

2007-07-18 Thread Paul Elschot
On Wednesday 18 July 2007 12:30, Cedric Ho wrote:
> Thanks for the quick response Paul =)
> 
> However I am lost while looking at the surround package.

That is not really surprising, the code is factored to the bone, and it
is hardly documented. 
You could have a look at the test code to start.
Also the surround.txt file in the contrib/surround directory should
be helpful.

> Are you 
> suggesting I can solve my problem at hand using the surround package?

In case the surround syntax fits what you need, you might use the surround
package.

You could also use your own parser and target the
o.a.l.queryParser.surround.query package.
The code posted by Mark Miller may solve your problem, too.

Regards,
Paul Elschot


> 
> 
> On 7/18/07, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > On Wednesday 18 July 2007 05:58, Cedric Ho wrote:
> > > Hi everybody,
> > >
> > > We recently need to support wildcard search terms "*", "?" together
> > > with SpanQuery. It seems that there's no SpanWildcardQuery available.
> > > After looking into the lucene source code for a while, I guess we can
> > > either:
> > >
> > > 1. Use SpanRegexQuery, or
> > >
> > > 2. Write our own SpanWildcardQuery, and implements the
> > > rewrite(IndexReader) method to rewrite the query into a SpanOrQuery
> > > with some SpanTermQuery.
> > >
> > > Of the two approaches, Option 1 seems to be easier. But I am rather
> > > concerned about the performance of using regular expression. On the
> > > other hand, I am not sure if there are any other concerns I am not
> > > aware of for option 2 (i.e. is there a reason why there's no
> > > SpanWildcardQuery in the first place?)
> > >
> > > Any advices ?
> >
> > The basic problem you are facing is that in Lucene
> > the expansion of the terms is tightly coupled to the generation
> > of a combination query using the expanded terms.
> >
> > In contrib/surround the term expansion and query generation
> > are decoupled using a visitor pattern for the terms. The code is here:
> > 
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/src/java/org/apache/lucene/queryParser/surround/query
> >
> > In surround a wild card term can provide either an OR of
> > normal term queries, or a SpanOrQuery of span term queries.
> > This query generation is in class SimpleTerm, which has one method
> > for a normal boolean OR query over the terms, and one for
> > a span query for the terms.
> >
> > In both cases surround uses a regular expression to expand
> > the matching terms, but that could be changed to use
> > another wildcard expansion mechanisms than the ones in
> > SrndPrefixQuery and SrndTruncQuery, which
> > are subclasses of SimpleTerm.
> >
> > With the term expansion and the query combination split,
> > it is also necessary to limit the maximum number of expanded
> > terms in another way than Lucene does. In surround the
> > classes BasicQueryFactory and TooManyBasicQueries are
> > used for that.
> >
> > Regards,
> > Paul Elschot
> >
> >
> >
> > >
> > > Cedric
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> > >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene shows parts of search query as a HIT

2007-07-18 Thread Askar Zaidi

Hey folks,

I am a new Lucene user , I used the following after indexing:

search(searcher, "W. Chan Kim");

Lucene showed me hits of documents where "channel" word existed. Notice that
"Chan" is a part of "Channel" . How do I stop this ?

I am keen to find the exact word.

I used the following, before the search method:

IndexWriter writer = new IndexWriter(indexPath, new StandardAnalyzer(),
true);

   writer.addDocument
(createDocument(item,words));
   writer.optimize();
   writer.close();
   searcher = new IndexSearcher(indexPath);

thanks !

AZ


lucene version?

2007-07-18 Thread Akanksha Baid
Is there a way to test as to which version of Lucene was used to build 
an index?


-Akanksha

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene shows parts of search query as a HIT

2007-07-18 Thread Erick Erickson

Are you sure that the hit wasn't on "w" or "kim"? The
default for searching is OR...

I recommend that you get a copy of Luke (google lucene luke)
which allows you to examine your index as well as see how
queries parse using various analyzers. It's an invaluable tool...

Best
Erick

On 7/18/07, Askar Zaidi <[EMAIL PROTECTED]> wrote:


Hey folks,

I am a new Lucene user , I used the following after indexing:

search(searcher, "W. Chan Kim");

Lucene showed me hits of documents where "channel" word existed. Notice
that
"Chan" is a part of "Channel" . How do I stop this ?

I am keen to find the exact word.

I used the following, before the search method:

IndexWriter writer = new IndexWriter(indexPath, new StandardAnalyzer(),
true);

writer.addDocument
(createDocument(item,words));
writer.optimize();
writer.close();
searcher = new IndexSearcher(indexPath);

thanks !

AZ



Re: lucene version?

2007-07-18 Thread Michael McCandless

I don't think this is stored in the index.

I think the closest you can get is the "format" of the segments_N file
which changes every time the index file format changes.  That at least
lets you narrow it down possibly to a single release if the file
format is changing frequently (eg it has in the past 2 releases).

There's no public API to read the format.  You could instead make your
own class, in package org.apache.lucene.index, that implements a
method similar to how the SegmentInfos.readCurrentVersion(...) method
is implemented, but just returns the format instead.

Mike

"Akanksha Baid" <[EMAIL PROTECTED]> wrote:
> Is there a way to test as to which version of Lucene was used to build 
> an index?
> 
> -Akanksha
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene shows parts of search query as a HIT

2007-07-18 Thread Askar Zaidi

Hey Guys,

I just checked my Lucene results. It shows a document with the word hit
"change" when I am searching for "Chan", and it considers that as a hit. Is
there a way to stop this and show just the exact word match ?

I started using Lucene yesterday, so I am fairly new !

thanks
AZ

On 7/18/07, Erick Erickson <[EMAIL PROTECTED]> wrote:


Are you sure that the hit wasn't on "w" or "kim"? The
default for searching is OR...

I recommend that you get a copy of Luke (google lucene luke)
which allows you to examine your index as well as see how
queries parse using various analyzers. It's an invaluable tool...

Best
Erick

On 7/18/07, Askar Zaidi <[EMAIL PROTECTED]> wrote:
>
> Hey folks,
>
> I am a new Lucene user , I used the following after indexing:
>
> search(searcher, "W. Chan Kim");
>
> Lucene showed me hits of documents where "channel" word existed. Notice
> that
> "Chan" is a part of "Channel" . How do I stop this ?
>
> I am keen to find the exact word.
>
> I used the following, before the search method:
>
> IndexWriter writer = new IndexWriter(indexPath, new StandardAnalyzer(),
> true);
>
> writer.addDocument
> (createDocument(item,words));
> writer.optimize();
> writer.close();
> searcher = new IndexSearcher(indexPath);
>
> thanks !
>
> AZ
>



Dictionary Type Lookup

2007-07-18 Thread muraalee

Hi,

I am trying to model a Dictionary Type Search in Lucene. My approach was
this

- Load the dictionary file ( words & their meanings ) and index each
dictionary term and associated meaning as a Lucene Document.
- Use IndexReader's term method to peek at the index and get the TermEnum.
TermEnum' next() return's the next term.

The snippet looks like this
  TermEnum browseTermEnum = indexReader.terms(new Term(browseIndex,
browsableTerm));
  while( browseTermEnum.next()){
 System.out.println(browseTermEnum.term().text())
  }

This works fine, and i can fetch next 'n' terms. 
The only problem i see with this route is, i can't get the previous terms
!!!

1. Is there a way to get previous terms from TermEnum ?
2. Is there a better way to model Dictionary Type lookup in Lucene ?


Appreciate your suggestions ?

Thanks
Murali V


-- 
View this message in context: 
http://www.nabble.com/Dictionary-Type-Lookup-tf4107251.html#a11679841
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



TermEnum - previous() method ?

2007-07-18 Thread muraalee

Hi All,
I searched in this forum for anybody looking for need for previous() method
in TermEnum. I found only this link
http://www.nabble.com/How-to-navigate-through-indexed-terms-tf28148.html#a189225

Would it be possible to implement previous() method ? I know i am asking for
quick solution here ;) Just i want to ensure if it not implemented, there
might be a reason. So i can consider alternates approaches to implement
similar feature..

appreciate your thoughts...

Thanks
Murali V
-- 
View this message in context: 
http://www.nabble.com/TermEnumprevious%28%29-method---tf4107296.html#a11679947
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MoreLikeThis

2007-07-18 Thread Akanksha Baid
I am using Lucene 2.1.0 and want to use MoreLikeThis for querying 
documents. I understand that the jar file for the same is in contrib. I 
have the contrib folder extracted, but am not sure what to do from this 
point on. What jar file am I looking for and where should  put it. I am 
using Eclipse.


If someone could please point me to some directions for the same , that 
would be a big help.

Thanks.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MoreLikeThis

2007-07-18 Thread yu
You can put lucene-queries-2.2.0.jar on your class path or your Eclipse 
project build path. That's all you need.



Jay

Akanksha Baid wrote:
I am using Lucene 2.1.0 and want to use MoreLikeThis for querying 
documents. I understand that the jar file for the same is in contrib. 
I have the contrib folder extracted, but am not sure what to do from 
this point on. What jar file am I looking for and where should  put 
it. I am using Eclipse.


If someone could please point me to some directions for the same , 
that would be a big help.

Thanks.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MoreLikeThis

2007-07-18 Thread Akanksha Baid
Right , I was making a silly mistake there. I have it working now. 
Thanks for the reply.


yu wrote:
You can put lucene-queries-2.2.0.jar on your class path or your 
Eclipse project build path. That's all you need.



Jay

Akanksha Baid wrote:
I am using Lucene 2.1.0 and want to use MoreLikeThis for querying 
documents. I understand that the jar file for the same is in contrib. 
I have the contrib folder extracted, but am not sure what to do from 
this point on. What jar file am I looking for and where should  put 
it. I am using Eclipse.


If someone could please point me to some directions for the same , 
that would be a big help.

Thanks.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



StandardTokenizer is slowing down highlighting a lot

2007-07-18 Thread Michael Stoppelman

Hi all,

I was tracking down slowness in the contrib highlighter code and it seems
the seemingly simple tokenStream.next() is the culprit.
I've seen multiple posts about this being a possible cause. Has anyone
looked into how to speed up StandardTokenizer? For my
documents it's taking about 70ms per document that's a big ugh! I was
thinking I might just cache the TermVectors in memory if
that will be faster. Anyone have another approach to solving this problem?

-M


Re: StandardTokenizer is slowing down highlighting a lot

2007-07-18 Thread Mark Miller
Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really 
limited by JavaCC speed. You cannot shave much more performance out of 
the grammar as it is already about as simple as it gets. You should 
first see if you can get away without it and use a different Analyzer, 
or if you can re-implement just the functionality you need in a custom 
Analyzer. Do you really need the support for abbreviations, companies, 
email address, etc?


If so:

You can use the TokenSources class in the highlighter package to rebuild 
a TokenStream without re-analyzing if you store term offsets and 
positions in the index. I have not found this to be super beneficial, 
even when using the StandardAnalyzer to re-analyze, but it certainly 
could be faster if you have large enough documents.


Your best bet is probably to use 
https://issues.apache.org/jira/browse/LUCENE-644, which is a 
non-positional Highlighter that finds offsets to highlight by looking up 
query term offset information in the index. For larger documents this 
can be much faster than using the standard contrib Highlighter, even if 
your using TokenSources. LUCENE-644 has a much flatter curve than the 
contrib Highlighter as document size goes up.


- Mark

Michael Stoppelman wrote:

Hi all,

I was tracking down slowness in the contrib highlighter code and it seems
the seemingly simple tokenStream.next() is the culprit.
I've seen multiple posts about this being a possible cause. Has anyone
looked into how to speed up StandardTokenizer? For my
documents it's taking about 70ms per document that's a big ugh! I was
thinking I might just cache the TermVectors in memory if
that will be faster. Anyone have another approach to solving this 
problem?


-M



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizer is slowing down highlighting a lot

2007-07-18 Thread Michael Stoppelman

Might be nice to add a line of documentation to the highlighter on the
possible
performance hit if one uses StandardAnalyzer which probably is a common
case.
Thanks for the speedy response.

-M

On 7/18/07, Mark Miller <[EMAIL PROTECTED]> wrote:


Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
limited by JavaCC speed. You cannot shave much more performance out of
the grammar as it is already about as simple as it gets. You should
first see if you can get away without it and use a different Analyzer,
or if you can re-implement just the functionality you need in a custom
Analyzer. Do you really need the support for abbreviations, companies,
email address, etc?

If so:

You can use the TokenSources class in the highlighter package to rebuild
a TokenStream without re-analyzing if you store term offsets and
positions in the index. I have not found this to be super beneficial,
even when using the StandardAnalyzer to re-analyze, but it certainly
could be faster if you have large enough documents.

Your best bet is probably to use
https://issues.apache.org/jira/browse/LUCENE-644, which is a
non-positional Highlighter that finds offsets to highlight by looking up
query term offset information in the index. For larger documents this
can be much faster than using the standard contrib Highlighter, even if
your using TokenSources. LUCENE-644 has a much flatter curve than the
contrib Highlighter as document size goes up.

- Mark

Michael Stoppelman wrote:
> Hi all,
>
> I was tracking down slowness in the contrib highlighter code and it
seems
> the seemingly simple tokenStream.next() is the culprit.
> I've seen multiple posts about this being a possible cause. Has anyone
> looked into how to speed up StandardTokenizer? For my
> documents it's taking about 70ms per document that's a big ugh! I was
> thinking I might just cache the TermVectors in memory if
> that will be faster. Anyone have another approach to solving this
> problem?
>
> -M
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Inrease the performance of Indexing in Lucene

2007-07-18 Thread miztaken

Hi, Please help me.
Its been a month since i am trying lucene.
My requirements are huge, i have to index and search in TB of data. 
I have question regarding three topics:

1. Problem in Indexing
  As i need to index TB of data, so by googling and visiting different forum
i deployed following fashion for indexing:

1. First i created an array of RAMDirectory then i added the documents on
it. 
   After crossing certain threshold i dumped it into my drive as tempIndex1

2. I repeated the same process until all documents are indexed in my drive
as tempindex1. tempindex2...

3. Then finally i loaded the temp directories and merged in as a main full
indexed directories. 

4. I have used threading too for this purpose.

5. This some what removed the optimize() overhead of IndexWriter, as i added
directories together only at the end.

Am i doing this the right way or not, is there any other solution to boost
the indexing process. 

2. Problem in searching

As lucene doesnt support LSI and SVD so as to achieve conceptual search, i
first search the lucene index for the user inputted text then retrieved the
document and then expanded the query using LSI and SVD and then re-searched
the index. 
Now with few words in query doesnt seem to have performance problem but when
i expand the query i.e. when the query contains ten words ORed together then
it takes tremendously unacceptable amount of time to get Hits. Is this
obvious or am i missing something here too.. 
What are the ways to achieve boost in query performance when the query
contains many terms and especially they are ORed, as for ANDed query it
requires less time to produce Hits.
I have used single Indexsearcher and my index is optimized as well..


3. Another Problem

As i require to dump my database table too inside lucene along with fulltext
info. 
What effect will it have on indexing and searching.?
Also i might need to change the name of Field of Document indexed in lucene,
will it be possible.?
I know its not possible to change the value of the field but will it be
possible to change the name of the field or we have to control externally..? 

Please shade me some light in these things:
Your help is highly anticipated

-- 
View this message in context: 
http://www.nabble.com/Inrease-the-performance-of-Indexing-in-Lucene-tf4108165.html#a11682360
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]