Re: Databases

2010-07-23 Thread Chris Lu
3) Sounds you want to use Lucene for storage, without databases like 
mysql. It may work, but hard for later data management.
1) and 2) You can use mysql as main storage, and pull data out to create 
Lucene indexes. Pay attention to incremental changes. It's a continuous 
process, not one-time data import. Or you would have to put a hook in 
your program to write new content to the index. Anyway, you can get it 
work, but maybe not as simple as you expected.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes 

DBSight customer, a shopping comparison site, (anonymous per request) 
got 2.6 Million Euro funding!



On 7/22/2010 10:46 PM, manjula wijewickrema wrote:

Hi,

Normally, when I am building my index directory for indexed documents, I
used to keep my indexed files simply in a directory called 'filesToIndex'.
So in this case, I do not use any standar database management system such
as mySql or any other.

1) Will it be possible to use mySql or any other for the purpose of manage
indexed documents in Lucene?

2) Is it necessary to follow such kind of methodology with Lucene?

3) If we do not use such type of database management system, will there be
any disadvantages with large number of indexed files?

Appreciate any reply from you.
Thanks,
Manjula.

   



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Databases

2010-07-23 Thread tarun sapra
You can use HibernateSearch to maintain the synchronization between Lucene
index and Mysql  RDBMS.

On Fri, Jul 23, 2010 at 11:16 AM, manjula wijewickrema
wrote:

> Hi,
>
> Normally, when I am building my index directory for indexed documents, I
> used to keep my indexed files simply in a directory called 'filesToIndex'.
> So in this case, I do not use any standar database management system such
> as mySql or any other.
>
> 1) Will it be possible to use mySql or any other for the purpose of manage
> indexed documents in Lucene?
>
> 2) Is it necessary to follow such kind of methodology with Lucene?
>
> 3) If we do not use such type of database management system, will there be
> any disadvantages with large number of indexed files?
>
> Appreciate any reply from you.
> Thanks,
> Manjula.
>



-- 
Thanks & Regards
Tarun Sapra


Re: Hot to get word importance in lucene index

2010-07-23 Thread Karl Wettin

Hi,

Please define "important". Important to do what?

It would probably be helpful if you explained what it is you attempt  
to achieve by doing this. Perhaps there is something in MoreLikeThis  
that will help you?



karl




23 jul 2010 kl. 04.44 skrev Xaida:



Hi all!

hmmm, i need to get how important is the word in entire document  
collection
that is indexed in the lucene index. I need to extract some  
"representable
words", lets say concepts that are common and can be representable  
to whole
collection. Or collection "keywords". I did the fulltext indexing  
and the
only field i am using are text contents, because titles of the  
documents are

mostly not representable(numbers, codes etc)

So, if i calculate tfidf, it gives me importance of single term with  
respect
to single document. But if that word is repeating in the documents,  
how can

i calculate its total importance within index?

All help appreciated!! Thank you!!!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Hot-to-get-word-importance-in-lucene-index-tp988836p988836.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hot to get word importance in lucene index

2010-07-23 Thread Xaida

Hi! thanks for reply! I will try to explain better, sorry if it was unclear.

I have user text document collection. Not too big. Goal is to get the most
"important" concepts  which would in a way represent user interests.  That
is what i mean when i say important :)


So lets say, in my collection I have my school documents, i have some
snowboarding articles, i have some backpacking and easy travelling guides,
my favorite cooking recipes..and so on. Collection is more - less
supervised, so number of documents for each "area" is similar. Not equal,
but there is some balance.

So i would like, as result to get terms which are important in the entire
collection. For example, i think that term "cheese" should appear in my
results, because i know in my recipes there is a lot of cheese. Also i would
like to get the term "database"...from my school documents. And so on.

So nothing more smart comes to my mind than this :)
step 1 take one document
step 2 calculate tfidf for all its terms
step 3 take the terms with best tfidf and save them somewhere
step 4 go to step 1..and so on for all the documents

And in the end to merge these results somehow :/

I guess there is better way :)

Thank you!!





-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Hot-to-get-word-importance-in-lucene-index-tp988836p989301.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Reverse Lucene queries

2010-07-23 Thread Karl Wettin


23 jul 2010 kl. 08.30 skrev sk...@sloan.mit.edu:


Hi all, I have an interesting problem...instead of going from a query
to a document collection, is it possible to come up with the best fit
query for a given document collection (results)? "Best fit" being a
query which maximizes the hit scores of the resulting document
collection.


It would probably be helpful if you explained what it is you attempt  
to achieve by doing this. Are you looking for MoreLikeThis?



How should I approach this? All suggestions appreciated.



How exepensive of an operation is this allowed to be? Can you waste  
seconds, minutes, hours or days?

Are there any requirements on the precision and recall?

I would no matter what start with looking at the output from a feature  
selection algorithm fed with the complete corpus divided in the two  
classes "query factory set" and "all other documents".


The output will not tell you why the terms are important, just that  
they probably are used when deciding when to classify documents as  
part of query factory set or all other documents.


It's hard to say where to go from there.

Create a set of selected terms available in the query factory set.
Create a set of selected terms available in all other documents.
Create a set of selected terms only available in the query factory set.
Create a set of selected terms only available in all other documents.

See if there is a simple strategy based on above that produce a good  
result.


If not you might want to look in to some evolving algorithm that  
execute queries with permutations of selected features in order to  
find the best query. Or if you have the resources, simply create all  
permutation of queries.


If it works then I think all of the steps above could be optimized,  
cached or simplified in several ways to make it speedy.


See Mahout, Weka (has a good experimenter/explorer GUI), Rapidminer,  
etc for machine learning APIs.


It should not have to be too complicated to implement a gain ratio  
feature selector using IndexReader if the term vector space is  
available.



karl


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hot to get word importance in lucene index

2010-07-23 Thread Karl Wettin

Are you perhaps looking for this:

http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/similar/MoreLikeThis.html

?

karl

23 jul 2010 kl. 10.54 skrev Xaida:



Hi! thanks for reply! I will try to explain better, sorry if it was  
unclear.


I have user text document collection. Not too big. Goal is to get  
the most
"important" concepts  which would in a way represent user  
interests.  That

is what i mean when i say important :)


So lets say, in my collection I have my school documents, i have some
snowboarding articles, i have some backpacking and easy travelling  
guides,

my favorite cooking recipes..and so on. Collection is more - less
supervised, so number of documents for each "area" is similar. Not  
equal,

but there is some balance.

So i would like, as result to get terms which are important in the  
entire
collection. For example, i think that term "cheese" should appear in  
my
results, because i know in my recipes there is a lot of cheese. Also  
i would

like to get the term "database"...from my school documents. And so on.

So nothing more smart comes to my mind than this :)
step 1 take one document
step 2 calculate tfidf for all its terms
step 3 take the terms with best tfidf and save them somewhere
step 4 go to step 1..and so on for all the documents

And in the end to merge these results somehow :/

I guess there is better way :)

Thank you!!





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Hot-to-get-word-importance-in-lucene-index-tp988836p989301.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Reverse Lucene queries

2010-07-23 Thread Grant Ingersoll

On Jul 23, 2010, at 5:06 AM, Karl Wettin wrote:

> 
> 23 jul 2010 kl. 08.30 skrev sk...@sloan.mit.edu:
> 
>> Hi all, I have an interesting problem...instead of going from a query
>> to a document collection, is it possible to come up with the best fit
>> query for a given document collection (results)? "Best fit" being a
>> query which maximizes the hit scores of the resulting document
>> collection.
> 
> It would probably be helpful if you explained what it is you attempt to 
> achieve by doing this. Are you looking for MoreLikeThis?

MatchAllDocsQuery returns the document collection all with a score of 1.  
Somehow, I don't think this is what you are after.  Perhaps you mean given all 
the queries you've seen in the past, find the "best one"?

> 
>> How should I approach this? All suggestions appreciated.
> 
> 
> How exepensive of an operation is this allowed to be? Can you waste seconds, 
> minutes, hours or days?
> Are there any requirements on the precision and recall?
> 
> I would no matter what start with looking at the output from a feature 
> selection algorithm fed with the complete corpus divided in the two classes 
> "query factory set" and "all other documents".
> 
> The output will not tell you why the terms are important, just that they 
> probably are used when deciding when to classify documents as part of query 
> factory set or all other documents.
> 
> It's hard to say where to go from there.
> 
> Create a set of selected terms available in the query factory set.
> Create a set of selected terms available in all other documents.
> Create a set of selected terms only available in the query factory set.
> Create a set of selected terms only available in all other documents.
> 
> See if there is a simple strategy based on above that produce a good result.
> 
> If not you might want to look in to some evolving algorithm that execute 
> queries with permutations of selected features in order to find the best 
> query. Or if you have the resources, simply create all permutation of queries.
> 
> If it works then I think all of the steps above could be optimized, cached or 
> simplified in several ways to make it speedy.
> 
> See Mahout, Weka (has a good experimenter/explorer GUI), Rapidminer, etc for 
> machine learning APIs.
> 
> It should not have to be too complicated to implement a gain ratio feature 
> selector using IndexReader if the term vector space is available.
> 
> 
>   karl
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hot to get word importance in lucene index

2010-07-23 Thread Xaida

Thanx! 

I am not sure, I have to study this class more deeper today , this is bit
complex, and i am not so advanced user to understand all. But this part
written in description is important to me:

"An efficient, effective "more-like-this" query generator would be a great
contribution, if anyone's interested. I'd imagine that it would take a
Reader or a String (the document's text), analyzer Analyzer, and return a
set of representative terms using heuristics like those above.  "

I know that "more like this" finds the set of similar documents.(but i
dont need documents, i need "important"terms from whole index, to retrieve
them into some list for example)... so for my case, important terms that
I would need, would actually be the terms generated in this query???

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Hot-to-get-word-importance-in-lucene-index-tp988836p989510.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hot to get word importance in lucene index

2010-07-23 Thread Grant Ingersoll
Couple of thoughts inline...

On Jul 22, 2010, at 10:44 PM, Xaida wrote:

> 
> Hi all!
> 
> hmmm, i need to get how important is the word in entire document collection
> that is indexed in the lucene index. I need to extract some "representable
> words", lets say concepts that are common and can be representable to whole
> collection. Or collection "keywords". I did the fulltext indexing and the
> only field i am using are text contents, because titles of the documents are
> mostly not representable(numbers, codes etc)
> 
> So, if i calculate tfidf, it gives me importance of single term with respect
> to single document.

TF gives you the importance in a single document.
IDF gives you the inverse of importance across the collection

> But if that word is repeating in the documents, how can
> i calculate its total importance within index?


Also, Lucene can also normalize by length, which is often a part of these 
things too.  

This information can be retrieved from TermDocs, TermEnum, etc.

Also, as a related item, you may be interested in important phrases, which can 
often be more helpful.  Check out 
https://cwiki.apache.org/confluence/display/MAHOUT/Collocations for one way of 
doing that.

-Grant

-
Grant Ingersoll
http://www.lucidimagination.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Reverse Lucene queries

2010-07-23 Thread Ivan Provalov
You can also look at carrot2 open source project, which does search results 
clustering.  Cluster labels which carrot2 generates can be used as query terms 
"fitting" the documents in these clusters.  Keep in mind that carrot2 is 
designed for a small set of documents (1000).

http://project.carrot2.org

Ivan Provalov


On Jul 23, 2010, at 6:55 AM, Grant Ingersoll  wrote:


On Jul 23, 2010, at 5:06 AM, Karl Wettin wrote:


23 jul 2010 kl. 08.30 skrev sk...@sloan.mit.edu:

Hi all, I have an interesting problem...instead of going from a query
to a document collection, is it possible to come up with the best fit
query for a given document collection (results)? "Best fit" being a
query which maximizes the hit scores of the resulting document
collection.

It would probably be helpful if you explained what it is you attempt to achieve 
by doing this. Are you looking for MoreLikeThis?

MatchAllDocsQuery returns the document collection all with a score of 1.  
Somehow, I don't think this is what you are after.  Perhaps you mean given all 
the queries you've seen in the past, find the "best one"?


How should I approach this? All suggestions appreciated.


How exepensive of an operation is this allowed to be? Can you waste seconds, 
minutes, hours or days?
Are there any requirements on the precision and recall?

I would no matter what start with looking at the output from a feature 
selection algorithm fed with the complete corpus divided in the two classes 
"query factory set" and "all other documents".

The output will not tell you why the terms are important, just that they 
probably are used when deciding when to classify documents as part of query 
factory set or all other documents.

It's hard to say where to go from there.

Create a set of selected terms available in the query factory set.
Create a set of selected terms available in all other documents.
Create a set of selected terms only available in the query factory set.
Create a set of selected terms only available in all other documents.

See if there is a simple strategy based on above that produce a good result.

If not you might want to look in to some evolving algorithm that execute 
queries with permutations of selected features in order to find the best query. 
Or if you have the resources, simply create all permutation of queries.

If it works then I think all of the steps above could be optimized, cached or 
simplified in several ways to make it speedy.

See Mahout, Weka (has a good experimenter/explorer GUI), Rapidminer, etc for 
machine learning APIs.

It should not have to be too complicated to implement a gain ratio feature 
selector using IndexReader if the term vector space is available.


karl


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using lucene for substring matching

2010-07-23 Thread Ian Lea
So, if I've understood this correctly, you've got some text and wan't
to loop through a list of words and/or phrases, and see which of those
match the text.

e.g.

text "some random article about something or other of some random length"

words

some - matches
many - no match
article - matches
word - no match

You can certainly do that with lucene.  Load the text into a document
and loop round the words or phrases searching for each.  You are
likely to need to look into analyzers depending on your requirements
around stop words, punctuation, case, etc.  And phrase/span queries
for phrases.
There are also probably some lucene techniques for speeding this up,
but as ever, start simple - lucene is usually plenty fast enough.


--
Ian.


On Thu, Jul 22, 2010 at 11:30 PM, Geir Gullestad Pettersen
 wrote:
> Hi,
>
> I'm about to write an application that does very simple text analysis,
> namely dictionary based entity entraction. The alternative is to do in
> memory matching with substring:
>
> String text; // could be any size, but normally "news paper length"
> List matches;
> for( String wordOrPhrase : dictionary) {
>   if ( text.substring( wordOrPhrase ) >= 0 ) {
>      matches.add( wordOrPhrase );
>   }
> }
>
> I am concerned the above code will be quite cpu intensitive, it will also be
> case sensitive and lot leave any room for fuzzy matching.
>
> I thought this task could also be solved by indexing every bit of text that
> is to be analyzed, and then executing a query per dicionary entry:
>
> (pseudo)
>
> lucene.index(text)
> List matches
> for( String wordOrPhrase : dictionary {
>   if( lucene.search( wordOrPharse, text_id) gives hit ) {
>      matches.add(wordOrPhrase)
>   }
> }
>
> I have not used lucene very much, so I don't know if it is a good idea or
> not to use lucene for this task at all. Could anyone please share their
> thoughs on this?
>
> Thanks,
> Geir
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



LUCENE-2456 (A Column-Oriented Cassandra-Based Lucene Directory)

2010-07-23 Thread Utku Can Topçu
Hi All,

I'm trying to use the patch for testing, provided in the issue.

I downloaded the patch and the dependency *LUCENE-2453
*.
I tested this contribution against the r942817 revision where I assume the
contributor has been using during the time of development. The tests seemed
to fail.

This time, I updated the CassandraDirectory.java to match the new Cassandra
Interface. It unfortunately failed again.

Has anyone here have an idea on which cassandra revision and lucene revision
this patch works  against?

Best Regards,
Utku


RE: on-the-fly "filters" from docID lists

2010-07-23 Thread Burton-West, Tom
Hi all,

>>Re scalability of filter construction - the database is likely to hold stable 
>>primary keys not lucene doc ids 
>>which are unstable in the face of updates.

This is the scalability issue I was concerned about.  Assume the database call 
efficiently retrieves a sorted array of 50,000 stable primary keys.  What is 
the best way to efficiently convert that list of primary keys to Lucene docIds. 
  

I was looking at the Lucene in Action example code (which was not designed for 
this use case) where the Lucene docId is retrieved by iteratively calling 
termDocs.read. How expensive is this operation?  Would 50,000 calls return in a 
few seconds or less?  

for (String isbn : isbns) {
if (isbn != null) {
TermDocs termDocs =
reader.termDocs(new Term("isbn", isbn));
int count = termDocs.read(docs, freqs);
if (count == 1) {
bits.set(docs[0]);
}

>>That could involve a lot of disk seeks unless you cache a pk->docid lookup in 
>>ram.
That sounds interesting. How would the pk->docid lookup get populated?
Wouldn't a pk->docid cache be invalidated with each commit or merge?

Tom

-Original Message-
From: Mark Harwood [mailto:markharw...@yahoo.co.uk] 
Sent: Friday, July 23, 2010 2:56 AM
To: java-user@lucene.apache.org
Subject: Re: on-the-fly "filters" from docID lists

Re scalability of filter construction - the database is likely to hold stable 
primary keys not lucene doc ids which are unstable in the face of updates. You 
therefore need a quick way of converting stable database keys read from the db 
into current lucene doc ids to create the filter. That could involve a lot of 
disk seeks unless you cache a pk->docid lookup in ram.  You should use 
cachingwrapperfilter too to cache the computed  user permissions from one 
search to the next. 
This can get messy. If the access permissions are centred around roles/groups 
it is normally faster to tag docs with these group names and query them with 
the list of roles the user holds. 
If individual user-doc-level perms are required you could also consider 
dynamically looking up perms for just the top n results being shown at the risk 
of needing to repeat the query with a larger n if insufficient matches pass the 
lookup. 

Cheers 
Mark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: on-the-fly "filters" from docID lists

2010-07-23 Thread Mark Harwood
> What is the best way to efficiently convert that list of primary keys to 
> Lucene docIds.   


Avoid disk seeks. Lucene is fast but still beholden to the laws of physics. 
Random disk seeks will cost you  eg. 50,000 * 5ms =250 seconds (minus any 
effects of OS disk caching).
Best way to handle this lookup is a PK-docid cache which can be reused for all 
users. Since 2.9 Lucene holds caches e.g. FieldCache down at segment level so a 
commit   or merge should only invalidate a subset of cached items. Trouble is I 
think FieldCache is for docID->FieldValue lookups whereas you want a cache that 
works the other way around.

Cheers
Mark

> 
> I was looking at the Lucene in Action example code (which was not designed 
> for this use case) where the Lucene docId is retrieved by iteratively calling 
> termDocs.read. How expensive is this operation?  Would 50,000 calls return in 
> a few seconds or less?  
> 
> for (String isbn : isbns) {
>   if (isbn != null) {
>   TermDocs termDocs =
>   reader.termDocs(new Term("isbn", isbn));
>   int count = termDocs.read(docs, freqs);
>   if (count == 1) {
>   bits.set(docs[0]);
> }
> 
>>> That could involve a lot of disk seeks unless you cache a pk->docid lookup 
>>> in ram.
> That sounds interesting. How would the pk->docid lookup get populated?
> Wouldn't a pk->docid cache be invalidated with each commit or merge?
> 
> Tom
> 
> -Original Message-
> From: Mark Harwood [mailto:markharw...@yahoo.co.uk] 
> Sent: Friday, July 23, 2010 2:56 AM
> To: java-user@lucene.apache.org
> Subject: Re: on-the-fly "filters" from docID lists
> 
> Re scalability of filter construction - the database is likely to hold stable 
> primary keys not lucene doc ids which are unstable in the face of updates. 
> You therefore need a quick way of converting stable database keys read from 
> the db into current lucene doc ids to create the filter. That could involve a 
> lot of disk seeks unless you cache a pk->docid lookup in ram.  You should use 
> cachingwrapperfilter too to cache the computed  user permissions from one 
> search to the next. 
> This can get messy. If the access permissions are centred around roles/groups 
> it is normally faster to tag docs with these group names and query them with 
> the list of roles the user holds. 
> If individual user-doc-level perms are required you could also consider 
> dynamically looking up perms for just the top n results being shown at the 
> risk of needing to repeat the query with a larger n if insufficient matches 
> pass the lookup. 
> 
> Cheers 
> Mark
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Searching for user agents

2010-07-23 Thread Maciej Bednarz
Hi,

I am using apache lucene 3.0.2 and searching for an optimal analyzer to search 
for best matching http user agents. Imagine, that we store following http user 
agents in a field:

Lynx/2.8.4rel.1 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.6c
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)

Now as search query a best matching agent for the following input should be 
returned:

Mozilla/4.1 (compatible; MSIE 6.0; Windows NT 5.0)

From my natural view the Mozilla/4.0 is the best fit result. What analyzer do I 
need to use to store and find it? The text not natural, so I need some kind of 
n gram search (I guess). My initial setup does not return it at all:

String agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)";
final static Analyzer analyzer = new NGramAnalyzer(2, 4);
final Document doc = new Document();
doc.add(new Field("agent", agent, Field.Store.YES, Field.Index.ANALYZED));
...
final QueryParser parser = new QueryParser(Version.LUCENE_30, "content", 
analyzer);
final Query query = parser.parse("Mozilla/4.1 (compatible; MSIE 6.0; Windows NT 
5.0)");
final TopScoreDocCollector collector = TopScoreDocCollector.create(50, true);
searcher.search(query, collector);

NGramAnalyzer is defined as:

import java.io.Reader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.NGramTokenizer;

public class NGramAnalyzer extends Analyzer {

private final int minGram;
private final int maxGram;

public NGramAnalyzer(final int minGram, final int maxGram) {
this.minGram = minGram;
this.maxGram = maxGram;
}

@Override
public TokenStream tokenStream(final String fieldName, final Reader 
reader) {
return new NGramTokenizer(reader, minGram, maxGram);
}
}


Thank you very much for a solution or any other approach.

Maciej
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org