RE: when indexing, java.io.FileNotFoundException

2005-02-03 Thread Will Allen
Increase the minMergeDocs and use the compact file format when creating your 
index.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#setUseCompoundFile(boolean)

-Original Message-
From: Chris Lu [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 03, 2005 12:46 PM
To: Lucene Users List
Subject: when indexing, java.io.FileNotFoundException


Hi,
I am getting this exception now and then when I am indexing content.
It doesn't always happen. But when it happens, I have to delete the
index and start over again.
This is a serious problem for us.

In this email, Doug was say it has something to do with win32's lack of
atomic renaming.
http://java2.5341.com/msg/1348.html

But how can I prevent this?

Chris Lu


java.io.FileNotFoundException: C:\data\indexes\customer\_temp\0\_1e.fnm
(The system cannot find the file specified)
   at java.io.RandomAccessFile.open(Native Method)
   at java.io.RandomAccessFile.init(RandomAccessFile.java:204)
   at 
org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.java:376)
   at org.apache.lucene.store.FSInputStream.init(FSDirectory.java:405)
   at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
   at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:53)
   at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:109)
   at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:94)
   at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480)
   at 
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458)
   at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)
   at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:294)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



literal search in quotes on non-tokenized field

2004-11-30 Thread Allen Atamer
Here is a problem I am experiencing with Lucene searches on non-tokenized
fields:
 
A search in quotes on a field named Build with the query \orig\ does not
work but the query origi yields 62 hits
 
I have run indexing on the field with the following method
 
doc.add(Field.Keyword(data.getColumnName(j),
fieldValue.toString().toLowerCase()));
 
so even though the original data has ORIGI in the Build field, lowercase
is not the problem
 
Here's a log of the parsed query before going to the searcher:
 
Parsed query: (Build:origi) for the first search
Parsed query: (Build:origi) for the second search
 
Right now we're not using a query parser / analyzer system to build the
query. We're building the query up.
The query mentioned above is a TermQuery object
 
Thanks 


RE: literal search in quotes on non-tokenized field

2004-11-30 Thread Allen Atamer
Erik,


 -Original Message-
  Here's a log of the parsed query before going to the searcher:
 
  Parsed query: (Build:origi) for the first search
  Parsed query: (Build:origi) for the second search
 
 What do you mean by parsed, since below you say you're not using
 QueryParser/Analyzer.


Sorry, that's residual log text. The lines of code are 

BooleanQuery totalQuery = new BooleanQuery();

.. logic to build totalQuery ...

log.debug(Parsed query:  + totalQuery.toString());
dbSearchHits = searcher.search(totalQuery);


  Right now we're not using a query parser / analyzer system to build the
  query. We're building the query up.
  The query mentioned above is a TermQuery object
 
 Let me hopefully clarify what you've said you've indexed (I'm not
 using quotes on purpose) origi, but you're doing a TermQuery on origi
 (with the quotes) and expecting it to match?
 
 It doesn't work that way.  A TermQuery must match *exactly* what was
 indexed (either directly as a Keyword, or as tokens emitted from the
 analyzer).  Since you're building the query up yourself from, I'm
 assuming, user input, you may need to pre-process what the user entered
 to get the right term to query on.  Only the term origi would match.

Yeah but it doesn't. The exact text in the database is ORIGI. Keyword
doesn't work if you supply more than one word. In fact we're doing it wrong.
Fields with a small number of terms should not be indexed as keyword, but
tokenized. I'm going to change the indexing strategy to only use keyword
when there's one and only one keyword in the data itself. Fields with two to
three words will be tokenized with the NoTokenizingTokenizer that was posted
earlier, and fields with four or more words will be tokenized with
MyTokenizer.

All we need to do for searching keyword fields is remove the double quotes
to be consistent with searching in a tokenized field. Then use QueryParser
to parse the tokenized fields with the appropriate parser for the field.
This should solve the problem.

Thanks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: modifying existing index

2004-11-23 Thread Will Allen
To update a document you need to insert the modified document, then delete the 
old one.

Here is some code that I use to get you going in the right direction (it wont 
compile, but if you follow it closely you will see how I take an array of 
lucene documents with new properties and add them, then delete the old ones.):


public void  updateDocuments( Document[] documentsToUpdate )
{
if ( documentsToUpdate.length  0 )
{
String updateDate = Dates.formatDate( new Date(), 
MMddHHmm );
//  wait on some other modification to finish
HashSet failedToAdd = new HashSet();
waitToModify();
synchronized(directory)
{
IndexWriter indexWriter = null;
try
{
indexWriter = getWriter();
indexWriter.mergeFactor = 2; //this 
seems to be needed to accomodate a lucene (ver 1.4.2) bug
//otherwise the index does not 
accurately reflect the change
//load data from new document into old 
document
for ( int i = 0; i  
documentsToUpdate.length; i++ )
{
try
{
Document newDoc = 
modifyDocument( documentsToUpdate[i], updateDate );
if ( newDoc != null )
{

documentsToUpdate[i] = newDoc;

indexWriter.addDocument( newDoc );
}
else
{

failedToAdd.add( documentsToUpdate[i].get( messageid ) );
}
}
catch ( IOException 
addDocException )
{
//if we fail to add, 
make a note and dont delete it
logger.error(  
[+getContext().getID()+] error updating message: + 
documentsToUpdate[i].get(messageid) ,addDocException );
failedToAdd.add( 
documentsToUpdate[i].get( messageid ) );
}
catch ( 
java.lang.IllegalStateException ise )
{
//if we fail to add, 
make a note and dont delete it
logger.error(  
[+getContext().getID()+] error updating message: + 
documentsToUpdate[i].get(messageid) ,ise );
failedToAdd.add( 
documentsToUpdate[i].get( messageid ) );
}
}
//if we fail to close the writer, we 
dont want to continue
closeWriter();
searcherVersion = -1; //establish that 
the searcher needs to update
IndexReader reader = IndexReader.open( 
indexPath );
int testid = -1;
for ( int i = 0; i  
documentsToUpdate.length; i++ )
{
Document newDoc = 
documentsToUpdate[i];
try
{
logger.debug( delete 
id: + newDoc.get( deleteid ) +  messageid: 
+ newDoc.get( 
messageid ) );
reader.delete( 
Integer.parseInt( newDoc.get( deleteid ) ) );
testid = 
Integer.parseInt( newDoc.get( deleteid ) );
}
catch ( NumberFormatException 
nfe )

RE: Too many open files issue

2004-11-22 Thread Will Allen
If you are on linux the number of file handles for a session is much lower than 
that for the whole machine.  ulimit -n will tell you.  There are instructions 
on the web for changing this setting, it involves the /etc/security/limits.conf 
and setting the values for nofile.

(bulkadm is my user)

bulkadm softnofile  8192
bulkadm hardnofile  65536

Also, if you use the condensed file format you will have many fewer files.

-Original Message-
From: Neelam Bhatnagar [mailto:[EMAIL PROTECTED]
Sent: Monday, November 22, 2004 10:02 AM
To: Otis Gospodnetic
Cc: [EMAIL PROTECTED]
Subject: Too many open files issue


Hi,
 
I had requested help on an issue we have been facing with the Too many
open files Exception garbling the search indexes and crashing the
search on the web site. 
As a suggestion, you had asked us to look at the articles on O'Reilly
Network which had specific context around this exact problem. 
One of the suggestions was to increase the limit on the number of file
descriptors on the file system. We tried it by first lowering the limit
to 200 from 256 in order to reproduce the exception. The exception did
get reproduced but even after increasing the limit to 500, the exception
kept coming until after several rounds of trying to rebuild the index,
we finally got to get it working for the default file descriptor limit
of 256.  This makes us wonder if your first suggestion of optimizing
indexes is a pre-requisite to trying this option. 
 
Another piece of relevant information is that we have the default merge
factor of 10.
 
Kindly give us pointers to what it that we are doing wrong or should we
be trying something completely different.
 
Thanks and regards
Neelam Bhatnagar
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Best Implementation of Next and Prev in Lucene

2004-11-18 Thread Will Allen
See the demo jsp pages.

-Original Message-
From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 16, 2004 9:26 PM
To: [EMAIL PROTECTED]
Subject: Best Implementation of Next and Prev in Lucene


Hi All,

 

What's the best implementation of displaying the Next and Prev search result
in Lucene?

 

Thanks,

Ramon


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



API request: isOpen on indexwriter and searcher

2004-11-18 Thread Will Allen
Could a developer consider adding an isOpen method to the writer and searcher?

I have looked at doing it myself, but not sure what I am doing.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: java.io.FileNotFoundException: ... (No such file or directory)

2004-11-18 Thread Will Allen
I have gotten this a few times.  I am also using a NFS mount, but have seen it 
in cases where a mount wasn't involved.

I cannot speak to why this is happening, but I have posted to this forum before 
a way of repairing your index by modifying the segments file.  Search for 
wallen.

The other thing I have done, is use code to copy the documents that can be read 
by a reader to a new index.  I suppose I should submit those tools to open 
source!

Anyway, this error will break the searcher, but the index can still be read 
with an indexreader.

-Will

Here is the source of a method that should get you started (logger is a log4j 
object):

public void transferDocuments()
throws IOException
{
IndexReader reader = IndexReader.open(brokenDir);
logger.debug(reader.numDocs() + );
IndexWriter writer = new IndexWriter(newIndexDir, 
PopIndexer.popAnalyzer(),true);
writer.minMergeDocs = 50;
writer.mergeFactor = 200;
writer.setUseCompoundFile(true);
int docCount = reader.numDocs();
Date start = new Date();
//docCount = Math.min(docCount, 500);
for(int x=0; x  docCount; x++)
{
try
{
if(!reader.isDeleted(x))
{
Document doc = reader.document(x);
if(x % 1000 == 0)
{
logger.debug(doc.get(subject));
}
//remove the new fields if they exist, and add new value
//TODO test not having this in
/*
for ( Enumeration newFields = doc.fields(); 
newFields.hasMoreElements(); )
{
Field newField = (Field) newFields.nextElement();
doc.removeFields( newField.name() );
doc.add( newField );
}
*/
doc.removeFields(counter);
doc.add(Field.Keyword(counter, counter));
//  reinsert old document
writer.addDocument( doc );  
}
}
catch(IOException ioe)
{
logger.error(doc: + x +  failed,  + ioe.getMessage());
}
catch(IndexOutOfBoundsException ioobe)
{
logger.error(INDEX OUT OF BOUNDS! + ioobe.getMessage());
ioobe.printStackTrace();
}
}
reader.close();
//logger.debug(done, about the optimize);
//writer.optimize();
writer.close();
long time = ((new Date()).getTime() - start.getTime())/1000;
logger.info(done optimizing:  + time +  seconds or  + (docCount / 
time) +  rec/sec);
}



-Original Message-
From: Justin Swanhart [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 18, 2004 5:00 PM
To: Lucene Users List
Subject: java.io.FileNotFoundException: ... (No such file or directory)


I have two index processes.  One is an index server, the other is a
search server.  The processes run on different machines.

The index server is a single threaded process that reads from the
database and adds
unindexed rows to the index as needed.  It sleeps for a couple minutes
between each
batch to allow newly added/updated rows to accumulate.

The searcher process keeps an open cache of IndexSearcher objects and
is multithreaded.
It accepts connections on a tcp port, runs the query and stores the
results in a database.
After a set interval, the server checks to see if the index on disk is
a newer version.  If it is,
it loads the index into a new IndexSearcher as a RAMDirectory.

Every once in awhile, the index reader process gets a FileNotFoundException:
20041118 1378 1383  (index number, old version, new version)
[newer version found] Loading index directory into RAM: 20041118
java.io.FileNotFoundException:
/path/omitted/for/obvious/reasons/_4zj6.cfs (No such file or
directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:204)
at 
org.en.lucene.store.FSInputStream$Descriptor.init(FSDirectory.java:376)
at org.en.lucene.store.FSInputStream.init(FSDirectory.java:405)
at org.en.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
at org.en.lucene.store.RAMDirectory.init(RAMDirectory.java:60)
at org.en.lucene.store.RAMDirectory.init(RAMDirectory.java:89)
at 
org.en.global.searchserver.UpdateSearchers.createIndexSearchers(Search.java:89)
at org.en.global.searchserver.UpdateSearchers.run(Search.java:54)

the code being called at that point is:
//add the directory to the HashMap of IndexSearchers (dir# = IndexSearcher)
indexSearchers.put(subDirs[i],new IndexSearcher(new
RAMDirectory(indexDir + / + subDirs[i])));

The indexes are located on a NFS mountpoint. Could 

RE: QueryParser: [stopword] AND something throws Exception

2004-11-12 Thread Will Allen
Holy cow!  This does happen!

-Original Message-
From: Peter Pimley [mailto:[EMAIL PROTECTED]
Sent: Friday, November 12, 2004 11:52 AM
To: Lucene Users List
Subject: QueryParser: [stopword] AND something throws Exception



[this is using lucene-1.4-final]

Hello.

I have just encountered a way to get the QueryParser to throw an 
ArrayIndexOutOfBoundsException.  It can be recreated with the demo 
org.apache.lucene.demo.SearchFiles program.  The way to trigger it is to 
parse a query of the form:

a AND b

...where 'a' is a stop word.  For example, the AND vector.  It only 
happens when the -first- term is a stop word.  You could search for 
vector AND the or vector AND the AND class, and it works as you 
would expect (i.e. the stop words are ignored).

Unfortunately I am up against a deadline right now so I can't fix this 
myself.  I'm just going to filter out stop words before feeding them to 
the query parser.  I'll try to have a look at it in roughly 2 weeks time 
if nobody else has solved it.

Peter Pimley,
Semantico

Here is the stack trace.

java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.Vector.elementAt(Vector.java:434)
at 
org.apache.lucene.queryParser.QueryParser.addClause(QueryParser.java:181)
at 
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:529)
at 
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:561)
at 
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500)
at 
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:561)
at 
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500)
at 
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108)
at 
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Will Allen
Any wildcard search will automatically expand your query to the number of terms 
it find in the index that suit the wildcard.

For example:

wild*, would become wild OR wilderness OR wildman etc for each of the terms 
that exist in your index.

It is because of this, that you quickly reach the 1024 limit of clauses.  I 
automatically set it to max int with the following line:

BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );


-Original Message-
From: Sanyi [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 11, 2004 6:46 AM
To: [EMAIL PROTECTED]
Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses


Hi!

First of all, I've read about BooleanQuery$TooManyClauses, so I know that it 
has a 1024 Clauses
limit by default which is good enough for me, but I still think it works 
strange.

Example:
I have an index with about 20Million documents.
Let's say that there is about 3000 variants in the entire document set of this 
word mask: cab*
Let's say that about 500 documents are containing the word: spectrum
Now, when I search for cab* AND spectrum, I don't expect it to throw an 
exception.
It should first restrict the search for the 500 documents containing the word 
spectrum, then it
should collect the variants of cab* withing these documents, which turns out 
in two or three
variants of cab* (cable, cables, maybe some more) and the search should 
return let's say 10
documents.

Similar example: When I search for cab* AND nonexistingword it still throws a 
TooManyClauses
exception instead of saying No results, since there is no nonexistingword 
in my document set,
so it doesn't even have to start collecting the variations of cab*.

Is there any path for this issue?
Thank you for your time!

Sanyi
(I'm using: lucene 1.4.2)

p.s.: Sorry for re-sending this message, I was first sending it as an 
accidental reply to a wrong thread..



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Acedemic Question About Indexing

2004-11-11 Thread Will Allen
I have a servlet that instanciates a multisearcher on 6 indexes:
(du -h)
7.2G./0
7.2G./1
7.2G./2
7.2G./3
7.2G./4
7.2G./5
43G .

I recreate the index from scratch each month based upon a 50gig zip file with 
all of the 40 million documents.  I wanted to keep my indexing speed as low as 
possible, without hurting search performace too much, as each searcher 
allocates a certain amount of memory proportional to the number of terms it 
has.  A single large index has a lot of overlap in terms, so it needs less 
memory than multiple indexes.

Anyway, for indexing, I am able to index ~100 documents per second.  The total 
indexing process takes 2.5 days.  I have a powerful machine with 2 
hyperthreaded processors (linux sees 4 processors) and 1GB ram.  I also have 
pretty fast SCSI disks.

I perform no updates or deletes on my indexes.

The indexing process equally divides the work amongst the indexers.  The 
bottleneck of the indexing process is not memory or CPU, rather disk IO of 6 
writers.  If I had faster disks, I could create more indexers.

-Original Message-
From: Sodel Vazquez-Reyes
[mailto:[EMAIL PROTECTED]
Sent: Thursday, November 11, 2004 11:37 AM
To: Lucene Users List
Cc: Will Allen
Subject: Re: Acedemic Question About Indexing


Will,
could you give more details about your architecture?
-each time update o create new indexes
-data stored at each index
etc.

because it is quite interesting, and I would like to test it.

Sodel



Quoting Luke Shannon [EMAIL PROTECTED]:

 40 Million! Wow. Ok this is the kind of answer I was looking for. The site I
 am working on indexes maybe 1000 at any given time. I think I am ok with a
 single index.

 Thanks.

 - Original Message -
 From: Will Allen [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, November 10, 2004 7:23 PM
 Subject: RE: Acedemic Question About Indexing


 I have an application that I run monthly that indexes 40 million documents
 into 6 indexes, then uses a multisearcher.  The advantage for me is that I
 can have multiple writers indexing 1/6 of that total data reducing the time
 it takes to index by about 5X.

 -Original Message-
 From: Luke Shannon [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, November 10, 2004 2:39 PM
 To: Lucene Users List
 Subject: Re: Acedemic Question About Indexing


 Don't worry, regardless of what I learn in this forum I am telling my
 company to get me a copy of that bad boy when it comes out (which as far as
 I am concerned can't be soon enough). I will pay for grama's myself.

 I think I have reviewed the code you are referring to and have something
 similar working in my own indexer (using the uid). All is well.

 My stupid question for the day is why would you ever want multiple indexes
 running if you can build one smart indexer that does everything as
 efficiently as possible? Does the answer to this question move me to multi
 threaded indexing territory?

 Thanks,

 Luke


 - Original Message -
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, November 10, 2004 2:08 PM
 Subject: Re: Acedemic Question About Indexing


 Uh, I hate to market it, but it's in the book.  But you don't have
 to wait for it, as there already is a Lucene demo that does what you
 described.  I am not sure if the demo always recreates the index or
 whether it deletes and re-adds only the new and modified files, but if
 it's the former, you would only need to modify the demo a little bit to
 check the timestamps of File objects and compare them to those stored
 in the index (if they are being stored - if not, you should add a field
 to hold that data)

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  I am working on debugging an existing Lucene implementation.
 
  Before I started, I built a demo to understand Lucene. In my demo I
  indexed
  the entire content hierarhcy all at once, and than optimize this
  index and
  used it for queries. It was time consuming but very simply.
 
  The code I am currently trying to fix indexes the content hierarchy
  by
  folder creating a seperate index for each one. Thus it ends up with a
  bunch
  of indexes. I still don't understand how this works (I am assuming
  they get
  merged someone that I have tracked down yet) but I have noticed it
  doesn't
  always index the right folder. This results in the users reporting
  inconsistant behavior in searching after they make a change to a
  document.
  To keep things simiple I would like to remove all the logic that
  figures out
  which folder to index and just do them all (usually less than 1000
  files) so
  I end up with one index.
 
  Would indexing time be the only area I would be losing out in, or is
  there
  something more to the approach of creating multiple indexes and
  merging
  them.
 
  What is a good approach I can take to indexing a content hierarchy
  composed
  primarily of pdf, xsl, doc and xml

RE: Acedemic Question About Indexing

2004-11-10 Thread Will Allen
I have an application that I run monthly that indexes 40 million documents into 
6 indexes, then uses a multisearcher.  The advantage for me is that I can have 
multiple writers indexing 1/6 of that total data reducing the time it takes to 
index by about 5X.

-Original Message-
From: Luke Shannon [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:39 PM
To: Lucene Users List
Subject: Re: Acedemic Question About Indexing


Don't worry, regardless of what I learn in this forum I am telling my
company to get me a copy of that bad boy when it comes out (which as far as
I am concerned can't be soon enough). I will pay for grama's myself.

I think I have reviewed the code you are referring to and have something
similar working in my own indexer (using the uid). All is well.

My stupid question for the day is why would you ever want multiple indexes
running if you can build one smart indexer that does everything as
efficiently as possible? Does the answer to this question move me to multi
threaded indexing territory?

Thanks,

Luke


- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:08 PM
Subject: Re: Acedemic Question About Indexing


 Uh, I hate to market it, but it's in the book.  But you don't have
 to wait for it, as there already is a Lucene demo that does what you
 described.  I am not sure if the demo always recreates the index or
 whether it deletes and re-adds only the new and modified files, but if
 it's the former, you would only need to modify the demo a little bit to
 check the timestamps of File objects and compare them to those stored
 in the index (if they are being stored - if not, you should add a field
 to hold that data)

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  I am working on debugging an existing Lucene implementation.
 
  Before I started, I built a demo to understand Lucene. In my demo I
  indexed
  the entire content hierarhcy all at once, and than optimize this
  index and
  used it for queries. It was time consuming but very simply.
 
  The code I am currently trying to fix indexes the content hierarchy
  by
  folder creating a seperate index for each one. Thus it ends up with a
  bunch
  of indexes. I still don't understand how this works (I am assuming
  they get
  merged someone that I have tracked down yet) but I have noticed it
  doesn't
  always index the right folder. This results in the users reporting
  inconsistant behavior in searching after they make a change to a
  document.
  To keep things simiple I would like to remove all the logic that
  figures out
  which folder to index and just do them all (usually less than 1000
  files) so
  I end up with one index.
 
  Would indexing time be the only area I would be losing out in, or is
  there
  something more to the approach of creating multiple indexes and
  merging
  them.
 
  What is a good approach I can take to indexing a content hierarchy
  composed
  primarily of pdf, xsl, doc and xml where any of these documents can
  be
  changed several times a day?
 
  Thanks,
 
  Luke
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Highlighting in Lucene

2004-11-04 Thread Will Allen
There is a highlighting tool in the sandbox (3/4 of the way down):

http://jakarta.apache.org/lucene/docs/lucene-sandbox/

-Original Message-
From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 04, 2004 3:40 PM
To: 'Lucene Users List'
Subject: Highlighting in Lucene


Hi All,

 

I would like to know if Lucene support highlighting on the searched text?

 

Thanks in advance.

 

Thanks,

Ramon Aseniero


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Searching for a phrase that contains quote character

2004-10-28 Thread Will Allen

I am having this same problem, but cannot find any help!

I have a keyword field that sometimes includes double quotes, but I am unable to 
search for that field because the escape for a quote doesnt work!

I have tried a number of things:

myfield:lucene is \cool\

AND

myfield:lucene is \\cool\\


http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=7351

From: [EMAIL PROTECTED] [EMAIL PROTECTED]
Subject: Searching for a phrase that contains quote character
Date: Wed, 24 Mar 2004 21:25:16 +

I'd like to search for a phrase that contains the quote character. I've tried 
escaping the quote character, but am receiving a ParseException from the 
QueryParser:

For example to search for the phrase:

 this is a test

I'm trying the following

 QueryParser.parse(field:\This is a \\\test, field, new 
StandardAnalyzer());

This results in:

org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 31.  
Encountered: EOF after : 
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:111)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87)
...

What is the proper way to accomplish this?

--Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Searching for a phrase that contains quote character

2004-10-28 Thread Will Allen
I am using a NullAnalyzer for this field.  

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, October 28, 2004 2:00 PM
To: Lucene Users List
Subject: Re: Searching for a phrase that contains quote character



On Oct 28, 2004, at 1:03 PM, Justin Swanhart wrote:
 Have you tried making a term query by hand and testing to see if it  
 works?

 Term t = new Term(field, this is a \test\);
 PhraseQuery pq = new PhraseQuery(t);

That's not accurate API, but add you used pq.add(t), it still would  
presume that text is all a single term.

Chances are, though, that even getting the query to have the quotes is  
not going to work as you've probably lost the quotes during indexing.   
Check out the AnalysisParalysis page on the wiki and analyze your  
Analyzer and make sure you are indexing the text with the quotes (no  
built-in analyzer besides WhitespaceAnalyzer would do that for you).

Erik


 ...



 On Thu, 28 Oct 2004 12:02:48 -0400, Will Allen  
 [EMAIL PROTECTED] wrote:

 I am having this same problem, but cannot find any help!

 I have a keyword field that sometimes includes double quotes, but I  
 am unable to search for that field because the escape for a quote  
 doesnt work!

 I have tried a number of things:

 myfield:lucene is \cool\

 AND

 myfield:lucene is \\cool\\

 http://issues.apache.org/eyebrowse/ReadMsg?listName=lucene- 
 [EMAIL PROTECTED]msgNo=7351

 From: [EMAIL PROTECTED] [EMAIL PROTECTED]
 Subject: Searching for a phrase that contains quote character
 Date: Wed, 24 Mar 2004 21:25:16 +

 I'd like to search for a phrase that contains the quote character.  
 I've tried
 escaping the quote character, but am receiving a ParseException from  
 the
 QueryParser:

 For example to search for the phrase:

  this is a test

 I'm trying the following

  QueryParser.parse(field:\This is a \\\test, field,  
 new StandardAnalyzer());

 This results in:

 org.apache.lucene.queryParser.ParseException: Lexical error at line  
 1, column 31.  Encountered: EOF after : 
 at  
 org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:111)
 at  
 org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87)
 ...

 What is the proper way to accomplish this?

 --Dan

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Searching for a phrase that contains quote character

2004-10-28 Thread Will Allen
The nullanalyzer overrides the isTokenChar method to simply return true in the 
tokenizer class (http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1703655).

The situation is that it seems lucene does not expect you to escape characters that 
exist inside of a quoted string.  So my search [ authorkeyword:MariaMy* ] works, but 
[ authorkeyword:MariaMy\* ] does not, even though the * character should be escaped 
(http://jakarta.apache.org/lucene/docs/queryparsersyntax.html#Terms)

So, if this is true, then the rule might be, reserved characters must be escaped 
EXCEPT when they are within double quotes as a phrase.  When double quotes are needed 
within a phrase, they should be escaped with a .. ?

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, October 28, 2004 3:05 PM
To: Lucene Users List
Subject: Re: Searching for a phrase that contains quote character


On Oct 28, 2004, at 2:02 PM, Will Allen wrote:
 I am using a NullAnalyzer for this field.

Which means that each field is added exactly as-is as a single term?

Then trying the PhraseQuery directly is a good first step  - if you can 
get that to work then you can move on to making QueryParser work with 
escaping.  But don't complicate things with QueryParser at first.  
Start with the queries constructed directly first.

Erik


 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED]
 Sent: Thursday, October 28, 2004 2:00 PM
 To: Lucene Users List
 Subject: Re: Searching for a phrase that contains quote character



 On Oct 28, 2004, at 1:03 PM, Justin Swanhart wrote:
 Have you tried making a term query by hand and testing to see if it
 works?

 Term t = new Term(field, this is a \test\);
 PhraseQuery pq = new PhraseQuery(t);

 That's not accurate API, but add you used pq.add(t), it still would
 presume that text is all a single term.

 Chances are, though, that even getting the query to have the quotes is
 not going to work as you've probably lost the quotes during indexing.
 Check out the AnalysisParalysis page on the wiki and analyze your
 Analyzer and make sure you are indexing the text with the quotes (no
 built-in analyzer besides WhitespaceAnalyzer would do that for you).

   Erik


 ...



 On Thu, 28 Oct 2004 12:02:48 -0400, Will Allen
 [EMAIL PROTECTED] wrote:

 I am having this same problem, but cannot find any help!

 I have a keyword field that sometimes includes double quotes, but I
 am unable to search for that field because the escape for a quote
 doesnt work!

 I have tried a number of things:

 myfield:lucene is \cool\

 AND

 myfield:lucene is \\cool\\

 http://issues.apache.org/eyebrowse/ReadMsg?listName=lucene-
 [EMAIL PROTECTED]msgNo=7351

 From: [EMAIL PROTECTED] [EMAIL PROTECTED]
 Subject: Searching for a phrase that contains quote character
 Date: Wed, 24 Mar 2004 21:25:16 +

 I'd like to search for a phrase that contains the quote character.
 I've tried
 escaping the quote character, but am receiving a ParseException from
 the
 QueryParser:

 For example to search for the phrase:

  this is a test

 I'm trying the following

  QueryParser.parse(field:\This is a \\\test, field,
 new StandardAnalyzer());

 This results in:

 org.apache.lucene.queryParser.ParseException: Lexical error at line
 1, column 31.  Encountered: EOF after : 
 at
 org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:111)
 at
 org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87)
 ...

 What is the proper way to accomplish this?

 --Dan

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Multi + Parallel

2004-10-15 Thread Will Allen
I am using 6 indexers / indexes to balance the speed of indexing against query 
performance for 40+ million documents.  I came to this number through trial and error, 
and performance testing on the indexing side with a fast 4 processor machine.  The 
trick is to max out the I/O throughput.

-Will

-Original Message-
From: Justin Swanhart [mailto:[EMAIL PROTECTED]
Sent: Thursday, October 14, 2004 2:43 PM
To: Lucene Users List
Subject: Re: Multi + Parallel


The overhead of creating that many searcher objects is going to far
outweigh any performance benefit you could possibly hope to gain by
splitting your index up.


On Thu, 14 Oct 2004 04:42:27 -0700 (PDT), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 Search a single merged index.
 
 Otis
 
 
 
 --- Karthik N S [EMAIL PROTECTED] wrote:
 
  Hi
 
  Apologies..
 
 
Can somebody provide me Approximate answers   [ Which is Better
  choice ]
 
A search of  10,000 subindexes using  multisearcher
 
or
 
   a search on  One Single Merged Index [ merged 10,000 Sub indexes ]
 
 
  a) SubIndexes  10,000 (   future)
 
  b) Field to be searche upon   = 4
 
  c)Field type present in Indexed format = 15
 
  d)  RAM = 1GB
 
   e) O/s Linux [ Clustered Enviournament]
 
   f) Processor make AMD [Probably High End]
 
   g) WebServer Tomcat 5.0.x
 
 
 
 
1)Which would be Faster ???;
 
2)If not What is may be the Probable Solution.
 
 
  Karthik
 
 
 
 
  -Original Message-
  From: Erik Hatcher [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, October 13, 2004 3:53 PM
  To: Lucene Users List
  Subject: Re: Multi + Parallel
 
 
  On Oct 13, 2004, at 3:14 AM, Karthik N S wrote:
   I was Curious to Know the Difference between ParallelMultiSearcher
  and
   MultiSearcher ,
  
   1) Is the working internal functionality of these  are  same or
   different .
 
  They are different internally.  Externally they should return
  identical
  results and not appear different at all.
 
  Internally, ParallelMultiSearcher searches each index in a separate
  thread (searches wait until all threads finish before returning).
  In
  MultiSearcher, each index is searched serially.
 
  You will not likely see a benefit to using ParallelMultiSearcher
  unless
  your environment is specialized to accommodate multi-threading
  (multiple CPU's, indexes on separate drives that can operate
  independently, etc).
 
   2) In terms of time domain do these differ when searching same no
  of
   fields
   / words .
  
   3)What are the features used on each of  API.
 
  There is no external difference to using either implementation.
  Benchmark searches using both and see what is best, but generally
  MultiSeacher will be better in most environments as it avoids the
  overhead of starting up and managing multiple threads.
 
Erik
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: -- TomCat/Lucene, filesystem

2004-09-08 Thread Will Allen
I think you might be refering to the xml files you keep in C:\Program 
Files\Apache\Tomcat\conf\Catalina\localhost

I have a file with the contents (myapp.xml):

?xml version='1.0' encoding='utf-8'?
Context docBase=C:/work/aggregation/myapp/web path=/myapp reloadable=true
/Context



-Original Message-
From: Rupinder Singh Mazara [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 31, 2004 12:36 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: RE: -- TomCat/Lucene, filesystem


i have a web application using  lucene via tomcat,
you may need to set 
the correct permissions in ur catalina.policy file 

i use a blanket policy of
grant  {
   permission java.io.FilePermission   /,read;
};

to manage allow access to lucene 


-Original Message-
From: J.Ph DEGLETAGNE [mailto:[EMAIL PROTECTED]
Sent: 31 August 2004 17:12
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: -- TomCat/Lucene, filesystem


Hello Somebody, 
 
..I beg your pardon... 
 
Under Windows XP / TomCat, 
 
How to customize  Webapp Lucene to access directory filesystem which are
outside TomCat ?
like this :
D:\Program Files\Apache Software Foundation\Tomcat 5.0\..
to access
E:\Data
 
Thank's a lot
 
JPhD


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: too many open files

2004-09-07 Thread Will Allen

I suspect it has to do with this change:

--- jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java  2004/08/08 
13:03:59 1.12
+++ jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java  2004/08/11 
17:37:52 1.13

I wouldn't know where to start to reproduce the problem as it was happening just once 
a day or so on an index that was being both queried and added to real time to the tune 
of 100,000 docs a day / 50 queries a day.

The corruption was always the same thing, the segments file listed an entry to a file 
that was not there.

-Will

-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 07, 2004 1:54 PM
To: Lucene Users List
Subject: Re: Spam:too many open files


On Tuesday 07 September 2004 17:41, [EMAIL PROTECTED] wrote:

 A note to developers, the code checked into lucene CVS ~Aug 15th, post
 1.4.1, was causing frequent index corruptions. When I reverted back to
 version 1.4 I no longer am getting the corruptions.

Here are some changes from around that day:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java

Could you check which of those might have caused the problem? I guess 
there's not much the developers can do without the problem being 
reproducible.

regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Spam:too many open files

2004-09-07 Thread Will Allen
I will deploy and test through the end of the week and report back Friday if the 
problem persists.  Thank you!

-Original Message-
From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 07, 2004 8:40 PM
To: Lucene Users List
Subject: Re: Spam:too many open files


Hi Wallen,

Actually, the files Daniel listed were modified on 8/11 and then again 
on 8/15. In the time between 8/11 to 8/15, I belive there could have 
been any number of problems, including corrupt indexes and poor 
multithreaded performance. However, I think after 8/15, the files should 
be in good working order. If you are not sure if you saw problems with 
pre-8/15 or post-8/15 version of the code, is it possible for you to try 
the latest CVS and see if the problem exists now? If it does, it will of 
course require urgent attention.

Thanks very much!
Dmitry.


Daniel Naber wrote:

On Tuesday 07 September 2004 17:41, [EMAIL PROTECTED] wrote:

  

A note to developers, the code checked into lucene CVS ~Aug 15th, post
1.4.1, was causing frequent index corruptions.  When I reverted back to
version 1.4 I no longer am getting the corruptions.



Here are some changes from around that day:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java

Could you check which of those might have caused the problem? I guess 
there's not much the developers can do without the problem being 
reproducible.

regards
 Daniel

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



QueryParser handling a NOT query on its own

2004-06-21 Thread Allen Atamer
The Javadoc spec calls for one or more clauses in a query, but I had trouble
with a NOT query just on its own. For example

QueryParser.parse(my_field:-exclude) throws a parsing exception

Same with

QueryParser.parse(my_field:-(exclude))
QueryParser.parse(my_field:(* AND -exclude)

The query QueryParser.parse(my_field:(-(exclude))) gives a legitimate
query that brings no results.

What I would expect is the following: If I have an index with 100 total
entries, and 20 records with the word exclude in them, then the above
queries should give 80 hits. There is no test case for this scenario in
TestQueryParser. Please confirm whether this is a bug or not,

Thank you,

Allen Atamer


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query performance on a 315 Million document index (1TB)

2004-05-06 Thread Will Allen
Hi,
I am considering a project that would index 315+ million documents. I am 
comfortable that the indexing will work well in creating an index ~800GB in size, but 
am concerned about the query performance. (Is this a = bad
assumption?)

What are the bottlenecks of performance as an index scales?  Memory?  = Cost is not a 
concern, so what would be the shortcomings of a theoretical = machine with 16GB of 
ram, 4-16 cpus and 1-2 terabytes of space?  Would it be = better to cluster machines 
to break apart the query?

Thank you for your serious responses,
Will Allen
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



termPosition does not iterate properly in Lucene 1.3 rc1

2004-03-22 Thread Allen Atamer
Lucene does not iterate through the termPositions on one of my indexed data
sources. It used to iterate properly through this data source, but not
anymore. I tried on a different indexed data source and it iterates
properly. The Lucene index directory does not have any lock files either.

My code is as follows

TermPositions termPos = reader.termPositions(aTerm);
while (termPos.next()) {
// get doc
String docID = reader.document(termPos.doc()).get(keyName);
...
}

Is there anything wrong with that? Thanks for your help,

Allen


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: implementing a TokenFilter for aliases

2003-12-05 Thread Allen Atamer
Erik,

Below are the results of a debug run on the piece of text that I want
aliased. The token spitline must be recognized as splitline i.e. when I
do a search for splitline, this record will come up.

1: [173] , start:1, end:2
1: [missing] , start:1, end:6
2: [hardware] , start:9, end:7
3: [for] , start:18, end:2
4: [bypass] , start:22, end:5
5: [spitline] , start:29, end:37

I also added extra debug info after the token text, which are the
startOffset, and the endOffset. Lucene has the first token 173 only
stored, it is not indexed. The remaining terms are tokenized, indexed and
stored. Does this make a difference?

Allen


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: implementing a TokenFilter for aliases

2003-12-05 Thread Allen Atamer
173 is the ID field from a database (which we use as a primary key). For
Lucene's purpose, it only stores the field, and does not index it.

The place where I put the print statements is before the actual filtering.
The goal of the AliasFilter is to replace spitline. The debug line is in the
Tokenizer, and the filters are run afterwards so I am not sure what is
happening inside lucene.

I can't put the util line into the analyzer after the AliasFilter is run
because it will call recursively into tokenStream() and cause a stack
overflow. I will try to work on seeing what is happening after aliasfilter
is run

Allen


 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED]
 Sent: December 5, 2003 12:23 PM
 To: Lucene Users List
 Subject: Re: implementing a TokenFilter for aliases
 
 On Friday, December 5, 2003, at 11:59  AM, Allen Atamer wrote:
  Below are the results of a debug run on the piece of text that I want
  aliased. The token spitline must be recognized as splitline i.e.
  when I
  do a search for splitline, this record will come up.
 
  1: [173] , start:1, end:2
  1: [missing] , start:1, end:6
  2: [hardware] , start:9, end:7
  3: [for] , start:18, end:2
  4: [bypass] , start:22, end:5
  5: [spitline] , start:29, end:37
 
  I also added extra debug info after the token text, which are the
  startOffset, and the endOffset. Lucene has the first token 173 only
  stored, it is not indexed. The remaining terms are tokenized, indexed
  and
  stored. Does this make a difference?
 
 I don't understand what you mean by 173 - is that output from a
 different string being analyzed?
 
 Well, it's obvious from this output that you cannot find spitline
 when splitline is used in a search.  Your analyzer isn't working as
 you expect, I'm guessing.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



implementing a TokenFilter for aliases

2003-12-04 Thread Allen Atamer
The FAQ describes implementing a TokenFilter for applying aliases. I have a
trouble accomplishing this.
 
This is the code that I have so far for the next Method within AliasFilter.
After reading some posts, I also got the idea to call
setPositionIncrement(). Neither way works, because when I search for the
alias, no search results come back.
 
Thank you for your help,
 
Allen Atamer
 

 
  public Token next() throws java.io.IOException {
Token token = tokenStream.next();
 
if (aliasMap == null || token == null) {
  return token;
}
 
TermData t = (TermData)aliasMap.get(token.termText());
 
if (t == null) {
  return token;
}
 
String tokenText = AliasManager.replaceIgnoreCase(
token.termText(), t.getTerm(), t.getTeach());
 
int increment = tokenText.length() - token.termText().length();
if (increment  0) {
  token.setPositionIncrement(increment);
}
 
return new Token(tokenText, token.startOffset(), token.endOffset());
  }