RE: Concurrent searching re-indexing

2005-02-18 Thread Paul Mellor
Ok, I will change my reindex method to delete all documents and then re-add
them all, rather than using an IndexWriter to write a completely new index.

Thanks for the help on this everyone.

Paul

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: 17 February 2005 22:26
To: Lucene Users List
Subject: Re: Concurrent searching  re-indexing


Paul Mellor wrote:
 I've read from various sources on the Internet that it is perfectly safe
to
 simultaneously search a Lucene index that is being updated from another
 Thread, as long as all write access to the index is synchronized.  But
does
 this apply only to updating the index (i.e. deleting and adding
documents),
 or to a complete re-indexing (i.e. create a new IndexWriter with the
 'create' argument true and then re-add all the documents)?
[ ...]
 java.io.IOException: couldn't delete _a.f1
 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
[...]
 This is running on Windows 2000.

On Windows one cannot delete a file while it is still open.  So, no, on 
Windows one cannot remove an index entirely while an IndexReader or 
Searcher is still open on it, since it is simply impossible to remove 
all the files in the index.

We might attempt to patch this by keeping a list of such files and 
attempt to delete them later (as is done when updating an index).  But 
this could cause problems, as a new index will eventually try to use 
these same file names again, and it would then conflict with the open 
IndexReader.  This is not a problem when updating an existing index, 
since filenames (except for a few which are not kept open, like 
segments) are never reused in the lifetime of an index.  So, in order 
for such a fix to work we would need to switch to globally unique 
segment names, e.g., long random strings, rather than increasing integers.

In the meantime, the safe way to rebuild an index from scratch while 
other processes are reading it is simply to delete all of its documents, 
then start adding new ones.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


_
This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com

This e-mail and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you are not the intended recipient, you should not copy, retransmit or
use the e-mail and/or files transmitted with it  and should not disclose
their contents. In such a case, please notify [EMAIL PROTECTED]
and delete the message from your own system. Any opinions expressed in this
e-mail and/or files transmitted with it that do not relate to the official
business of this company are those solely of the author and should not be
interpreted as being endorsed by this company.


RE: Concurrent searching re-indexing

2005-02-17 Thread Paul Mellor
Otis,

Looking at your reply again, I have a couple of questions -

IndexSearcher (IndexReader, really) does take a snapshot of the index state
when it is opened, so at that time the index segments listed in segments
should be in a complete state.  It also reads index files when searching, of
course.

1. If IndexReader takes a snapshot of the index state when opened and then
reads the files when searching, what would happen if the files it takes a
snapshot of are deleted before the search is performed (as would happen with
a reindexing in the period between opening an IndexSearcher and using it to
search)?

2. Does a similar potential problem exist when optimising an index, if this
combines all the segments into a single file?

Many thanks

Paul

-Original Message-
From: Paul Mellor [mailto:[EMAIL PROTECTED]
Sent: 16 February 2005 17:37
To: 'Lucene Users List'
Subject: RE: Concurrent searching  re-indexing


But all write access to the index is synchronized, so that although multiple
threads are creating an IndexWriter for the same directory and using it to
totally recreate that index, only one thread is doing this at once.

I was concerned about the safety of using an IndexSearcher to perform
queries on an index that is in the process of being recreated from scratch,
but I guess that if the IndexSearcher takes a snapshot of the index when it
is created (and in my code this creation is synchronized with the write
operations as well so that the threads wait for the write operations to
finish before instantiating an IndexSearcher, and vice versa) this can't be
a problem.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: 16 February 2005 17:30
To: Lucene Users List
Subject: Re: Concurrent searching  re-indexing


Hi Paul,

If I understand your setup correctly, it looks like you are running
multiple threads that create IndexWriter for the ame directory.  That's
a no no.

This section (first hit) describes all various concurrency issues with
regards to adds, updates, optimization, and searches:
  http://www.lucenebook.com/search?query=concurrent

IndexSearcher (IndexReader, really) does take a snapshot of the index
state when it is opened, so at that time the index segments listed in
segments should be in a complete state.  It also reads index files when
searching, of course.

Otis


--- Paul Mellor [EMAIL PROTECTED] wrote:

 Hi,
 
 I've read from various sources on the Internet that it is perfectly
 safe to
 simultaneously search a Lucene index that is being updated from
 another
 Thread, as long as all write access to the index is synchronized. 
 But does
 this apply only to updating the index (i.e. deleting and adding
 documents),
 or to a complete re-indexing (i.e. create a new IndexWriter with the
 'create' argument true and then re-add all the documents)?
 
 I have a class which encapsulates all access to my index, so that
 writes can
 be synchronized.  This class also exposes a method to obtain an
 IndexSearcher for the index.  I'm running unit tests to test this
 which
 create many threads - each thread does a complete re-indexing and
 then
 obtains an IndexSearcher and does a query.
 
 I'm finding that with sufficiently high numbers of threads, I'm
 getting the
 occasional failure, with the following exception thrown when
 attempting to
 construct a new IndexWriter (during the reindexing) -
 
 java.io.IOException: couldn't delete _a.f1
 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151)
 ...
 
 The exception occurs quite infrequently (usually for somewhere
 between 1-5%
 of the Threads).
 
 Does the IndexSearcher take a 'snapshot' of the index at creation? 
 Or does
 it access the filesystem whilst searching?  I am also synchronizing
 creation
 of the IndexSearcher with the write lock, so that the IndexSearcher
 is not
 created whilst the index is being recreated (and vice versa).  But do
 I need
 to ensure that the IndexSearcher cannot search whilst the index is
 being
 recreated as well?
 
 Note that a similar unit test where the threads update the index
 (rather
 than recreate it from scratch) works fine, as expected.
 
 This is running on Windows 2000.
 
 Any help would be much appreciated!
 
 Paul
 
 This e-mail and any files transmitted with it are confidential and
 intended
 solely for the use of the individual or entity to whom they are
 addressed.
 If you are not the intended recipient, you should not copy,
 retransmit or
 use the e-mail and/or files transmitted with it  and should not
 disclose
 their contents. In such a case, please notify
 [EMAIL PROTECTED]
 and delete the message from your own system. Any opinions expressed
 in this
 e-mail and/or files transmitted with it that do not relate to the
 official

RE: Concurrent searching re-indexing

2005-02-17 Thread Morus Walter
Paul Mellor writes:
 
 1. If IndexReader takes a snapshot of the index state when opened and then
 reads the files when searching, what would happen if the files it takes a
 snapshot of are deleted before the search is performed (as would happen with
 a reindexing in the period between opening an IndexSearcher and using it to
 search)?
 
On unix, open files are still there, even if they are deleted (that is,
there is no link (filename) to the file anymore but the file's content
still exists), on windows you cannot delete open files, so Lucene 
AFAIK (I don't use windows) postpones the deletion to a time, when the 
file is closed.
 
 2. Does a similar potential problem exist when optimising an index, if this
 combines all the segments into a single file?
 
AFAIK optimising creates new files.

The only problem that might occur, is opening a reader during index change
but that's handled by a lock.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Concurrent searching re-indexing

2005-02-16 Thread Otis Gospodnetic
Hi Paul,

If I understand your setup correctly, it looks like you are running
multiple threads that create IndexWriter for the ame directory.  That's
a no no.

This section (first hit) describes all various concurrency issues with
regards to adds, updates, optimization, and searches:
  http://www.lucenebook.com/search?query=concurrent

IndexSearcher (IndexReader, really) does take a snapshot of the index
state when it is opened, so at that time the index segments listed in
segments should be in a complete state.  It also reads index files when
searching, of course.

Otis


--- Paul Mellor [EMAIL PROTECTED] wrote:

 Hi,
 
 I've read from various sources on the Internet that it is perfectly
 safe to
 simultaneously search a Lucene index that is being updated from
 another
 Thread, as long as all write access to the index is synchronized. 
 But does
 this apply only to updating the index (i.e. deleting and adding
 documents),
 or to a complete re-indexing (i.e. create a new IndexWriter with the
 'create' argument true and then re-add all the documents)?
 
 I have a class which encapsulates all access to my index, so that
 writes can
 be synchronized.  This class also exposes a method to obtain an
 IndexSearcher for the index.  I'm running unit tests to test this
 which
 create many threads - each thread does a complete re-indexing and
 then
 obtains an IndexSearcher and does a query.
 
 I'm finding that with sufficiently high numbers of threads, I'm
 getting the
 occasional failure, with the following exception thrown when
 attempting to
 construct a new IndexWriter (during the reindexing) -
 
 java.io.IOException: couldn't delete _a.f1
 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151)
 ...
 
 The exception occurs quite infrequently (usually for somewhere
 between 1-5%
 of the Threads).
 
 Does the IndexSearcher take a 'snapshot' of the index at creation? 
 Or does
 it access the filesystem whilst searching?  I am also synchronizing
 creation
 of the IndexSearcher with the write lock, so that the IndexSearcher
 is not
 created whilst the index is being recreated (and vice versa).  But do
 I need
 to ensure that the IndexSearcher cannot search whilst the index is
 being
 recreated as well?
 
 Note that a similar unit test where the threads update the index
 (rather
 than recreate it from scratch) works fine, as expected.
 
 This is running on Windows 2000.
 
 Any help would be much appreciated!
 
 Paul
 
 This e-mail and any files transmitted with it are confidential and
 intended
 solely for the use of the individual or entity to whom they are
 addressed.
 If you are not the intended recipient, you should not copy,
 retransmit or
 use the e-mail and/or files transmitted with it  and should not
 disclose
 their contents. In such a case, please notify
 [EMAIL PROTECTED]
 and delete the message from your own system. Any opinions expressed
 in this
 e-mail and/or files transmitted with it that do not relate to the
 official
 business of this company are those solely of the author and should
 not be
 interpreted as being endorsed by this company.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Concurrent searching re-indexing

2005-02-16 Thread Paul Mellor
But all write access to the index is synchronized, so that although multiple
threads are creating an IndexWriter for the same directory and using it to
totally recreate that index, only one thread is doing this at once.

I was concerned about the safety of using an IndexSearcher to perform
queries on an index that is in the process of being recreated from scratch,
but I guess that if the IndexSearcher takes a snapshot of the index when it
is created (and in my code this creation is synchronized with the write
operations as well so that the threads wait for the write operations to
finish before instantiating an IndexSearcher, and vice versa) this can't be
a problem.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: 16 February 2005 17:30
To: Lucene Users List
Subject: Re: Concurrent searching  re-indexing


Hi Paul,

If I understand your setup correctly, it looks like you are running
multiple threads that create IndexWriter for the ame directory.  That's
a no no.

This section (first hit) describes all various concurrency issues with
regards to adds, updates, optimization, and searches:
  http://www.lucenebook.com/search?query=concurrent

IndexSearcher (IndexReader, really) does take a snapshot of the index
state when it is opened, so at that time the index segments listed in
segments should be in a complete state.  It also reads index files when
searching, of course.

Otis


--- Paul Mellor [EMAIL PROTECTED] wrote:

 Hi,
 
 I've read from various sources on the Internet that it is perfectly
 safe to
 simultaneously search a Lucene index that is being updated from
 another
 Thread, as long as all write access to the index is synchronized. 
 But does
 this apply only to updating the index (i.e. deleting and adding
 documents),
 or to a complete re-indexing (i.e. create a new IndexWriter with the
 'create' argument true and then re-add all the documents)?
 
 I have a class which encapsulates all access to my index, so that
 writes can
 be synchronized.  This class also exposes a method to obtain an
 IndexSearcher for the index.  I'm running unit tests to test this
 which
 create many threads - each thread does a complete re-indexing and
 then
 obtains an IndexSearcher and does a query.
 
 I'm finding that with sufficiently high numbers of threads, I'm
 getting the
 occasional failure, with the following exception thrown when
 attempting to
 construct a new IndexWriter (during the reindexing) -
 
 java.io.IOException: couldn't delete _a.f1
 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151)
 ...
 
 The exception occurs quite infrequently (usually for somewhere
 between 1-5%
 of the Threads).
 
 Does the IndexSearcher take a 'snapshot' of the index at creation? 
 Or does
 it access the filesystem whilst searching?  I am also synchronizing
 creation
 of the IndexSearcher with the write lock, so that the IndexSearcher
 is not
 created whilst the index is being recreated (and vice versa).  But do
 I need
 to ensure that the IndexSearcher cannot search whilst the index is
 being
 recreated as well?
 
 Note that a similar unit test where the threads update the index
 (rather
 than recreate it from scratch) works fine, as expected.
 
 This is running on Windows 2000.
 
 Any help would be much appreciated!
 
 Paul
 
 This e-mail and any files transmitted with it are confidential and
 intended
 solely for the use of the individual or entity to whom they are
 addressed.
 If you are not the intended recipient, you should not copy,
 retransmit or
 use the e-mail and/or files transmitted with it  and should not
 disclose
 their contents. In such a case, please notify
 [EMAIL PROTECTED]
 and delete the message from your own system. Any opinions expressed
 in this
 e-mail and/or files transmitted with it that do not relate to the
 official
 business of this company are those solely of the author and should
 not be
 interpreted as being endorsed by this company.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


_
This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com

This e-mail and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you are not the intended recipient, you should not copy, retransmit or
use the e-mail and/or files transmitted with it  and should not disclose
their contents. In such a case, please notify [EMAIL PROTECTED]
and delete the message from your own system. Any

Re: Re-Indexing a moving target???

2005-02-01 Thread Nader Henein
details?
Yousef Ourabi wrote:
Saad,
Here is what I got. I will post again, and be more
specific.
-Y
--- Nader Henein [EMAIL PROTECTED] wrote:
 

We'll need a little more detail to help you, what
are the sizes of your 
updates and how often are they updated.

1) No just re-open the index writer every time to
re-index, according to 
you it's moderately changing index, just keep a flag
on the rows and 
batch indexing every so often.
2) It all comes down to your needs, more detail
would help us help you.

Nader Henein
Yousef Ourabi wrote:
   

Hey,
We are using lucene to index a moderatly changing
database, and I have a couple of questions on a
performance strategy.
1) Should we just have one index writer open unil
 

the
   

system comes down...or create a new index writer
 

each
   

time we re-index our data-set.
2) Does anyone have anythoughts...multi-threading
 

and
   

segments instead of one index?
Thanks for your time and help.
Best,
Yousef
 

-
   

To unsubscribe, e-mail:
 

[EMAIL PROTECTED]
   

For additional commands, e-mail:
 

[EMAIL PROTECTED]
   



 

--
Nader S. Henein
Senior Applications Developer
Bayt.com

   

-
 

To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Re-Indexing a moving target???

2005-01-31 Thread Yousef Ourabi
Saad,
Here is what I got. I will post again, and be more
specific.
-Y

--- Nader Henein [EMAIL PROTECTED] wrote:

 We'll need a little more detail to help you, what
 are the sizes of your 
 updates and how often are they updated.
 
 1) No just re-open the index writer every time to
 re-index, according to 
 you it's moderately changing index, just keep a flag
 on the rows and 
 batch indexing every so often.
 2) It all comes down to your needs, more detail
 would help us help you.
 
 Nader Henein
 
 Yousef Ourabi wrote:
 
 Hey,
 We are using lucene to index a moderatly changing
 database, and I have a couple of questions on a
 performance strategy.
 
 1) Should we just have one index writer open unil
 the
 system comes down...or create a new index writer
 each
 time we re-index our data-set.
 
 2) Does anyone have anythoughts...multi-threading
 and
 segments instead of one index?
 
 Thanks for your time and help.
 Best,
 Yousef
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 
   
 
 
 -- 
 Nader S. Henein
 Senior Applications Developer
 
 Bayt.com
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing flat files with out .txt extension

2005-01-11 Thread Hetan Shah
Hi Erik,

Thanks for the pointers, I have modified the Indexer.java to index the
files from the directory by removing the file extenstion check of
(.txt). Now I do get the index from the files.

New situation is that when I run the FileSearch

java org.apache.lucene.demo.SearchFiles
Query: tty
Searching for: tty
3 total matching documents
0. No path nor URL for this document
1. No path nor URL for this document
2. No path nor URL for this document

I do not get the actual path from the index and using Luke I get the
three hits. Last two are from the index and not the real documents.

Any idea what is happeneing and how can I fix it.

Thanks.
-H

Erik Hatcher wrote:
 On Jan 10, 2005, at 7:06 PM, Hetan Shah wrote:
 
Got the latest Ant and got the demo to work. I am however not sure 
which part in the whole source code is the indexing for different file 
types is done, say for example .html .txt and such?
 
 
 Your best bet is to dig around in the codebase.  The Indexer.java code 
 is hard-coded to only do .txt file extensions - this was on purpose as 
 the first example in the book, figuring someone using this code on the 
 their C:\ drive would be relatively safe and fast to run.
 
 Their is also an example easily run from the Ant launcher to show how 
 various document types can be handled using an extensible framework.  
 Run ant ExtensionFileHandler.  It doesn't actually index the document 
 it creates, but displays it to the console.  It would be pretty trivial 
 to pair the Indexer.java code up with the file handler framework to 
 crawl a directory tree and index any content it recognizes.
 
 
Appreciate your help. If you have any sample code would certainly 
appreciate that also.
 
 
 You got all the code already.  It should be fairly straightforward to 
 navigate the src tree, especially with the Table of Contents handy:
 
   http://www.lucenebook.com/toc
 
 (incidentally, this dynamic TOC page is blending the blog content with 
 the TOC using an IndexReader to find all blog entries that refer to 
 each section - and you'll see the two, minor and cosmetic, errata 
 listed there already).
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing flat files with out .txt extension

2005-01-10 Thread Hetan Shah
Hi erik,
Got the latest Ant and got the demo to work. I am however not sure which 
part in the whole source code is the indexing for different file types 
is done, say for example .html .txt and such? From there I can derive 
how can I index a plain text file which does not have any extension.

Appreciate your help. If you have any sample code would certainly 
appreciate that also.
-H.

Erik Hatcher wrote:
On Jan 6, 2005, at 6:49 PM, Hetan Shah wrote:
Hi Erik,
I got the source downloaded and unpacked. I am having difficulty in 
building and of the modules. Maybe something's wrong with my Ant 
installation.

LuceneInAction% ant test
Buildfile: build.xml

BUILD FAILED
file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element 
available

The good ol' README says this:
R E Q U I R E M E N T S
---
  * JDK 1.4+
  * Ant 1.6+ (to run the automated examples)
  * JUnit 3.8.1+
- junit.jar should be in ANT_HOME/lib
You are not running Ant 1.6, I'm sure.  Upgrade your version of Ant, 
and of course follow the rest of the README and all should be well.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing flat files with out .txt extension

2005-01-10 Thread Erik Hatcher
On Jan 10, 2005, at 7:06 PM, Hetan Shah wrote:
Got the latest Ant and got the demo to work. I am however not sure 
which part in the whole source code is the indexing for different file 
types is done, say for example .html .txt and such?
Your best bet is to dig around in the codebase.  The Indexer.java code 
is hard-coded to only do .txt file extensions - this was on purpose as 
the first example in the book, figuring someone using this code on the 
their C:\ drive would be relatively safe and fast to run.

Their is also an example easily run from the Ant launcher to show how 
various document types can be handled using an extensible framework.  
Run ant ExtensionFileHandler.  It doesn't actually index the document 
it creates, but displays it to the console.  It would be pretty trivial 
to pair the Indexer.java code up with the file handler framework to 
crawl a directory tree and index any content it recognizes.

Appreciate your help. If you have any sample code would certainly 
appreciate that also.
You got all the code already.  It should be fairly straightforward to 
navigate the src tree, especially with the Table of Contents handy:

http://www.lucenebook.com/toc
(incidentally, this dynamic TOC page is blending the blog content with 
the TOC using an IndexReader to find all blog entries that refer to 
each section - and you'll see the two, minor and cosmetic, errata 
listed there already).

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing flat files with out .txt extension

2005-01-06 Thread Hetan Shah
Hi Erik,
I got the source downloaded and unpacked. I am having difficulty in 
building and of the modules. Maybe something's wrong with my Ant 
installation.

LuceneInAction% ant test
Buildfile: build.xml

BUILD FAILED
file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element 
available

Total time: 5 seconds
LuceneInAction% ant Indexer
Buildfile: build.xml
BUILD FAILED
file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element 
available

Total time: 5 seconds
**
Can you point me to proper module for creating my own indexer? I tried 
looking into the indexing module but was not sure.

TIA,
-H
Erik Hatcher wrote:
On Jan 5, 2005, at 6:31 PM, Hetan Shah wrote:
How can one index simple text files with out the .txt extension. I am 
trying to use the IndexFiles and IndexHTML but not to my 
satisfaction. In the IndexFiles I do not get any control over the 
content of the file and in case of IndexHTML the files with out any 
extension do not get index all together. Any pointers are really 
appreciated.

Try out the Indexer code from Lucene in Action.  You can download it 
from the link here: 
http://www.lucenebook.com/blog/announcements/sourcecode.html

It'll be cleaner to follow and borrow from.  The code that ships with 
Lucene is for demonstration purposes.  It surprises me how often folks 
use that code to build real indexes.  It's quite straightforward to 
create your own Java code to do the indexing in whatever manner you 
like, borrowing from examples.

When you get the download unpacked, simply run ant Indexer to see it 
in action.  And then ant Searcher to search the index just built.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing flat files with out .txt extension

2005-01-06 Thread Erik Hatcher
On Jan 6, 2005, at 6:49 PM, Hetan Shah wrote:
Hi Erik,
I got the source downloaded and unpacked. I am having difficulty in 
building and of the modules. Maybe something's wrong with my Ant 
installation.

LuceneInAction% ant test
Buildfile: build.xml

BUILD FAILED
file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element 
available
The good ol' README says this:
R E Q U I R E M E N T S
---
  * JDK 1.4+
  * Ant 1.6+ (to run the automated examples)
  * JUnit 3.8.1+
- junit.jar should be in ANT_HOME/lib
You are not running Ant 1.6, I'm sure.  Upgrade your version of Ant, 
and of course follow the rest of the README and all should be well.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing flat files with out .txt extension

2005-01-05 Thread Erik Hatcher
On Jan 5, 2005, at 6:31 PM, Hetan Shah wrote:
How can one index simple text files with out the .txt extension. I am 
trying to use the IndexFiles and IndexHTML but not to my satisfaction. 
In the IndexFiles I do not get any control over the content of the 
file and in case of IndexHTML the files with out any extension do not 
get index all together. Any pointers are really appreciated.
Try out the Indexer code from Lucene in Action.  You can download it 
from the link here: 
http://www.lucenebook.com/blog/announcements/sourcecode.html

It'll be cleaner to follow and borrow from.  The code that ships with 
Lucene is for demonstration purposes.  It surprises me how often folks 
use that code to build real indexes.  It's quite straightforward to 
create your own Java code to do the indexing in whatever manner you 
like, borrowing from examples.

When you get the download unpacked, simply run ant Indexer to see it 
in action.  And then ant Searcher to search the index just built.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing terms only

2004-12-22 Thread Mike Snare
Whether or not the text is stored in the index is a different concern
that how it is analyzed.  If you want the text to be indexed, and not
stored, then use the Field.Text(String, String) method or the
appropriate constructor when adding a field to the Document.  You'll
need to also store a reference to the actual file (URL, Path, etc) in
the document so it can be retrieved from the doc returned in the Hits
object.

Or did I completely misunderstand the question?

-Mike

On Wed, 22 Dec 2004 17:23:24 +0100, DES [EMAIL PROTECTED] wrote:
 hi
 
 i need to index my text so that index contains only tokenized stemmed words 
 without stopwords etc. The text ist german, so I tried to use GermanAnalyzer, 
 but it stores whole text, not terms. Please give me a tip how to index terms 
 only. Thanks!
 
 DES


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing terms only

2004-12-22 Thread DES
I actually use Field.Text(String,String) to add documents to my index. Maybe 
I do not understand the way an analyzer works, but I thought that all German 
articles (der, die, das etc) should be filtered out. However if I use Luke 
to view my index, the original text is completely stored in a field. And 
what I need is term vector, that I can create from an indexed document 
field. So this field should contain terms only.

Whether or not the text is stored in the index is a different concern
that how it is analyzed.  If you want the text to be indexed, and not
stored, then use the Field.Text(String, String) method or the
appropriate constructor when adding a field to the Document.  You'll
need to also store a reference to the actual file (URL, Path, etc) in
the document so it can be retrieved from the doc returned in the Hits
object.
Or did I completely misunderstand the question?
-Mike
On Wed, 22 Dec 2004 17:23:24 +0100, DES [EMAIL PROTECTED] wrote:
hi
i need to index my text so that index contains only tokenized stemmed 
words without stopwords etc. The text ist german, so I tried to use 
GermanAnalyzer, but it stores whole text, not terms. Please give me a tip 
how to index terms only. Thanks!

DES
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing terms only

2004-12-22 Thread Erik Hatcher
On Dec 22, 2004, at 11:36 AM, Mike Snare wrote:
Whether or not the text is stored in the index is a different concern
that how it is analyzed.  If you want the text to be indexed, and not
stored, then use the Field.Text(String, String) method
Correction: Field.Text(String, String) is a stored field.  If you want 
unstored, use Field.UnStored(String, String).
This is a bit confusing because Field.Text(String, Reader) is not 
stored.  This confusion has been cleared up in the CVS version of 
Lucene and will be deprecated in the 1.9 release, and removed in the 
2.0 release.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing terms only

2004-12-22 Thread Mike Snare
I've never used the german analyzer, so I don't know what stop words
it defines/uses.  Someone else will have to answer that.  Sorry

On Wed, 22 Dec 2004 17:45:17 +0100, DES [EMAIL PROTECTED] wrote:
 I actually use Field.Text(String,String) to add documents to my index. Maybe
 I do not understand the way an analyzer works, but I thought that all German
 articles (der, die, das etc) should be filtered out. However if I use Luke
 to view my index, the original text is completely stored in a field. And
 what I need is term vector, that I can create from an indexed document
 field. So this field should contain terms only.
 
  Whether or not the text is stored in the index is a different concern
  that how it is analyzed.  If you want the text to be indexed, and not
  stored, then use the Field.Text(String, String) method or the
  appropriate constructor when adding a field to the Document.  You'll
  need to also store a reference to the actual file (URL, Path, etc) in
  the document so it can be retrieved from the doc returned in the Hits
  object.
 
  Or did I completely misunderstand the question?
 
  -Mike
 
  On Wed, 22 Dec 2004 17:23:24 +0100, DES [EMAIL PROTECTED] wrote:
  hi
 
  i need to index my text so that index contains only tokenized stemmed
  words without stopwords etc. The text ist german, so I tried to use
  GermanAnalyzer, but it stores whole text, not terms. Please give me a tip
  how to index terms only. Thanks!
 
  DES
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing terms only

2004-12-22 Thread Mike Snare
Thanks for correcting me.  I use the reader version -- hence my confusion.

-Mike

On Wed, 22 Dec 2004 11:53:31 -0500, Erik Hatcher
[EMAIL PROTECTED] wrote:
 
 On Dec 22, 2004, at 11:36 AM, Mike Snare wrote:
  Whether or not the text is stored in the index is a different concern
  that how it is analyzed.  If you want the text to be indexed, and not
  stored, then use the Field.Text(String, String) method
 
 Correction: Field.Text(String, String) is a stored field.  If you want
 unstored, use Field.UnStored(String, String).
 This is a bit confusing because Field.Text(String, Reader) is not
 stored.  This confusion has been cleared up in the CVS version of
 Lucene and will be deprecated in the 1.9 release, and removed in the
 2.0 release.
 
 Erik
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing with Lucene 1.4.3

2004-12-17 Thread Bernhard Messer

That looks right to me, assuming you have done an optimize.  All of your
index segments are merged into the one .cfs file (which is large,
right?).  Try searching -- it should work.
 

Chuck is right, the index looks fine and will be searchable. Since lucene version 1.4, the index is stored per default using the compound file format. The index files you are missing are merged within one compound file which has the extension cfs. You can disable the compound file option using 
IndexWriters setUseCompoundFile(false).

Bernhard
  -Original Message-
  From: Hetan Shah [mailto:[EMAIL PROTECTED]
  Sent: Thursday, December 16, 2004 11:00 AM
  To: Lucene Users List
  Subject: Indexing with Lucene 1.4.3
  
  Hello,
  
  I have been trying to index around 6000 documents using IndexHTML
from
  1.4.3 and at the end of indexing in my index directory I only have 3
  files.
  segments
  deletable and
  _5en.cfs
  
  Can someone tell me what is going on and where are the actual index
  files? How can I resolve this issue?
  Thanks.
  -H
  
  
 
-
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing with Lucene 1.4.3

2004-12-17 Thread Hetan Shah
Thanks Chuck,
I now understand why I see only one file. Another question is do I have 
to specify somewhere in my code or some configuration setting that I 
would now be using a compound file format (.cfs file) for index. I have 
an application that was working in version 1.3-final till I moved to 
1.4.3 now I do not get any results back from my searches.

I tried using Luke and it shows me the content of the index. I can 
search using Luke but no success so far with my own application.

Any pointers?
Thanks.
-H
Chuck Williams wrote:
That looks right to me, assuming you have done an optimize.  All of your
index segments are merged into the one .cfs file (which is large,
right?).  Try searching -- it should work.
Chuck
  -Original Message-
  From: Hetan Shah [mailto:[EMAIL PROTECTED]
  Sent: Thursday, December 16, 2004 11:00 AM
  To: Lucene Users List
  Subject: Indexing with Lucene 1.4.3
  
  Hello,
  
  I have been trying to index around 6000 documents using IndexHTML
from
  1.4.3 and at the end of indexing in my index directory I only have 3
  files.
  segments
  deletable and
  _5en.cfs
  
  Can someone tell me what is going on and where are the actual index
  files? How can I resolve this issue?
  Thanks.
  -H
  
  
 
-
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing with Lucene 1.4.3

2004-12-17 Thread Otis Gospodnetic
The only place where you have to specify that you are using the
compound index format is on IndexWriter instance.  Nothing needs to be
done at search time on IndexSearcher.

Otis

--- Hetan Shah [EMAIL PROTECTED] wrote:

 Thanks Chuck,
 
 I now understand why I see only one file. Another question is do I
 have 
 to specify somewhere in my code or some configuration setting that I 
 would now be using a compound file format (.cfs file) for index. I
 have 
 an application that was working in version 1.3-final till I moved to 
 1.4.3 now I do not get any results back from my searches.
 
 I tried using Luke and it shows me the content of the index. I can 
 search using Luke but no success so far with my own application.
 
 Any pointers?
 
 Thanks.
 -H
 
 Chuck Williams wrote:
 
 That looks right to me, assuming you have done an optimize.  All of
 your
 index segments are merged into the one .cfs file (which is large,
 right?).  Try searching -- it should work.
 
 Chuck
 
-Original Message-
From: Hetan Shah [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 16, 2004 11:00 AM
To: Lucene Users List
Subject: Indexing with Lucene 1.4.3

Hello,

I have been trying to index around 6000 documents using
 IndexHTML
 from
1.4.3 and at the end of indexing in my index directory I only
 have 3
files.
segments
deletable and
_5en.cfs

Can someone tell me what is going on and where are the actual
 index
files? How can I resolve this issue?
Thanks.
-H


   

-
To unsubscribe, e-mail:
 [EMAIL PROTECTED]
For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing a large number of DB records

2004-12-16 Thread Garrett Heaver
There was other reasons for my choice of going with a Temp Index - namely I
was having terrible write times to my Live index as it was stored on a
different server, also, while I was writing to my live index people were
trying to search on it and were getting file not found exceptions so
rather than spend hours or days trying to fix it I took the easiest route by
creating a temp index on the server which had the application and merging to
the server with the live index. This greatly increased my indexing speed.

Best of luck
Garrett

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: 15 December 2004 18:43
To: Lucene Users List
Subject: RE: Indexing a large number of DB records

Note that this really includes some extra steps.
You don't need a temp index.  Add everything to a single index using a
single IndexWriter instance.  No need to call addIndexes nor optimize
until the end.  Adding Documents to an index takes a constant amount of
time, regardless of the index size, because new segments are created as
documents are added, and existing segments don't need to be updated
(only when merges happen).  Again, I'd run your app under a profiler to
see where the time and memory are going.

Otis

--- Garrett Heaver [EMAIL PROTECTED] wrote:

 Hi Homan
 
 I had a similar problem as you in that I was indexing A LOT of data
 
 Essentially how I got round it was to batch the index.
 
 What I was doing was to add 10,000 documents to a temporary index,
 use
 addIndexes() to merge to temporary index into the live index (which
 also
 optimizes the live index) then delete the temporary index. On the
 next loop
 I'd only query rows from the db above the id in the maxdoc of the
 live index
 and set the max rows of the query to to 10,000
 i.e
 
 SELECT TOP 1 [fields] FROM [tables] WHERE [id_field]  {ID from
 Index.MaxDoc()} ORDER BY [id_field] ASC
 
 Ensuring that the documents go into the index sequentially your
 problem is
 solved and memory usage on mine (dotlucene 1.3) is low
 
 Regards
 Garrett
 
 -Original Message-
 From: Homam S.A. [mailto:[EMAIL PROTECTED] 
 Sent: 15 December 2004 02:43
 To: Lucene Users List
 Subject: Indexing a large number of DB records
 
 I'm trying to index a large number of records from the
 DB (a few millions). Each record will be stored as a
 document with about 30 fields, most of them are
 UnStored and represent small strings or numbers. No
 huge DB Text fields.
 
 But I'm running out of memory very fast, and the
 indexing is slowing down to a crawl once I hit around
 1500 records. The problem is each document is holding
 references to the string objects returned from
 ToString() on the DB field, and the IndexWriter is
 holding references to all these document objects in
 memory, so the garbage collector is getting a chance
 to clean these up.
 
 How do you guys go about indexing a large DB table?
 Here's a snippet of my code (this method is called for
 each record in the DB):
 
 private void IndexRow(SqlDataReader rdr, IndexWriter
 iw) {
   Document doc = new Document();
   for (int i = 0; i  BrowseFieldNames.Length; i++) {
   doc.Add(Field.UnStored(BrowseFieldNames[i],
 rdr.GetValue(i).ToString()));
   }
   iw.AddDocument(doc);
 }
 
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Find what you need with new enhanced search.
 http://info.mail.yahoo.com/mail_250
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing with Lucene 1.4.3

2004-12-16 Thread Chuck Williams
That looks right to me, assuming you have done an optimize.  All of your
index segments are merged into the one .cfs file (which is large,
right?).  Try searching -- it should work.

Chuck

   -Original Message-
   From: Hetan Shah [mailto:[EMAIL PROTECTED]
   Sent: Thursday, December 16, 2004 11:00 AM
   To: Lucene Users List
   Subject: Indexing with Lucene 1.4.3
   
   Hello,
   
   I have been trying to index around 6000 documents using IndexHTML
from
   1.4.3 and at the end of indexing in my index directory I only have 3
   files.
   segments
   deletable and
   _5en.cfs
   
   Can someone tell me what is going on and where are the actual index
   files? How can I resolve this issue?
   Thanks.
   -H
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing with Lucene 1.4.3

2004-12-16 Thread Karthik N S

Hi there

Apologies.



   If u are using the IndexHTML from the demo.jar package which is
abvaliable from Lucene1.4.3.zip

 Then u bettter look at the File Extensions of u'r file's,they may be
filtered out of the indexing process

 due to this code present in IndexHTML.java
 
  } else if (file.getPath().endsWith(.html) || // index .html files
  file.getPath().endsWith(.htm) || // index .htm files
  file.getPath().endsWith(.txt)) { // index .txt files
 


It the Extensions u have is within the 'endsWith' options then u have
sucessfully indexed the 6000 Documents of u's

Try to use the Luke Monitering S/f avaliable from the Jakartha Lucene Web
site and check for the same

[Hint Try to use the SearchFiles.class from the Lucene1.4.3.zip to search
onthe documents u have indexed sucessfuly]


with regards
Karthik






-Original Message-
From: Hetan Shah [mailto:[EMAIL PROTECTED]
Sent: Friday, December 17, 2004 12:30 AM
To: Lucene Users List
Subject: Indexing with Lucene 1.4.3


Hello,

I have been trying to index around 6000 documents using IndexHTML from
1.4.3 and at the end of indexing in my index directory I only have 3 files.
segments
deletable and
_5en.cfs

Can someone tell me what is going on and where are the actual index
files? How can I resolve this issue?
Thanks.
-H


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
Hello Homam,

The batches I was referring to were batches of DB rows.
Instead of SELECT * FROM table... do SELECT * FROM table ... OFFSET=X
LIMIT=Y.

Don't close IndexWriter - use the single instance.

There is no MakeStable()-like method in Lucene, but you can control the
number of in-memory Documents, the frequence of segment merges, and
maximal size of an index segments with 3 IndexWriter parameters,
described fairly verbosely in the javadocs.

Since you are using the .Net version, you should really consult
dotLucene guy(s).  Running under the profiler should also tell you
where the time and memory go.

Otis

--- Homam S.A. [EMAIL PROTECTED] wrote:

 Thanks Otis!
 
 What do you mean by building it in batches? Does it
 mean I should close the IndexWriter every 1000 rows
 and reopen it? Does that releases references to the
 document objects so that they can be
 garbage-collected?
 
 I'm calling optimize() only at the end.
 
 I agree that 1500 documents is very small. I'm
 building the index on a PC with 512 megs, and the
 indexing process is quickly gobbling up around 400
 megs when I index around 1800 documents and the whole
 machine is grinding to a virtual halt. I'm using the
 latest DotLucene .NET port, so may be there's a memory
 leak in it.
 
 I have experience with AltaVista search (acquired by
 FastSearch), and I used to call MakeStable() every
 20,000 documents to flush memory structures to disk.
 There doesn't seem to be an equivalent in Lucene.
 
 -- Homam
 
 
 
 
 
 
 --- Otis Gospodnetic [EMAIL PROTECTED]
 wrote:
 
  Hello,
  
  There are a few things you can do:
  
  1) Don't just pull all rows from the DB at once.  Do
  that in batches.
  
  2) If you can get a Reader from your SqlDataReader,
  consider this:
 

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)
  
  3) Give the JVM more memory to play with by using
  -Xms and -Xmx JVM
  parameters
  
  4) See IndexWriter's minMergeDocs parameter.
  
  5) Are you calling optimize() at some point by any
  chance?  Leave that
  call for the end.
  
  1500 documents with 30 columns of short
  String/number values is not a
  lot.  You may be doing something else not Lucene
  related that's slowing
  things down.
  
  Otis
  
  
  --- Homam S.A. [EMAIL PROTECTED] wrote:
  
   I'm trying to index a large number of records from
  the
   DB (a few millions). Each record will be stored as
  a
   document with about 30 fields, most of them are
   UnStored and represent small strings or numbers.
  No
   huge DB Text fields.
   
   But I'm running out of memory very fast, and the
   indexing is slowing down to a crawl once I hit
  around
   1500 records. The problem is each document is
  holding
   references to the string objects returned from
   ToString() on the DB field, and the IndexWriter is
   holding references to all these document objects
  in
   memory, so the garbage collector is getting a
  chance
   to clean these up.
   
   How do you guys go about indexing a large DB
  table?
   Here's a snippet of my code (this method is called
  for
   each record in the DB):
   
   private void IndexRow(SqlDataReader rdr,
  IndexWriter
   iw) {
 Document doc = new Document();
 for (int i = 0; i  BrowseFieldNames.Length; i++)
  {
 doc.Add(Field.UnStored(BrowseFieldNames[i],
   rdr.GetValue(i).ToString()));
 }
 iw.AddDocument(doc);
   }
   
   
   
   
 
   __ 
   Do you Yahoo!? 
   Yahoo! Mail - Find what you need with new enhanced
  search.
   http://info.mail.yahoo.com/mail_250
   
  
 
 -
   To unsubscribe, e-mail:
  [EMAIL PROTECTED]
   For additional commands, e-mail:
  [EMAIL PROTECTED]
   
   
  
  
 
 -
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
 
 
 
   
 __ 
 Do you Yahoo!? 
 Take Yahoo! Mail with you! Get it on your mobile phone. 
 http://mobile.yahoo.com/maildemo 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing a large number of DB records

2004-12-15 Thread Garrett Heaver
Hi Homan

I had a similar problem as you in that I was indexing A LOT of data

Essentially how I got round it was to batch the index.

What I was doing was to add 10,000 documents to a temporary index, use
addIndexes() to merge to temporary index into the live index (which also
optimizes the live index) then delete the temporary index. On the next loop
I'd only query rows from the db above the id in the maxdoc of the live index
and set the max rows of the query to to 10,000
i.e

SELECT TOP 1 [fields] FROM [tables] WHERE [id_field]  {ID from
Index.MaxDoc()} ORDER BY [id_field] ASC

Ensuring that the documents go into the index sequentially your problem is
solved and memory usage on mine (dotlucene 1.3) is low

Regards
Garrett

-Original Message-
From: Homam S.A. [mailto:[EMAIL PROTECTED] 
Sent: 15 December 2004 02:43
To: Lucene Users List
Subject: Indexing a large number of DB records

I'm trying to index a large number of records from the
DB (a few millions). Each record will be stored as a
document with about 30 fields, most of them are
UnStored and represent small strings or numbers. No
huge DB Text fields.

But I'm running out of memory very fast, and the
indexing is slowing down to a crawl once I hit around
1500 records. The problem is each document is holding
references to the string objects returned from
ToString() on the DB field, and the IndexWriter is
holding references to all these document objects in
memory, so the garbage collector is getting a chance
to clean these up.

How do you guys go about indexing a large DB table?
Here's a snippet of my code (this method is called for
each record in the DB):

private void IndexRow(SqlDataReader rdr, IndexWriter
iw) {
Document doc = new Document();
for (int i = 0; i  BrowseFieldNames.Length; i++) {
doc.Add(Field.UnStored(BrowseFieldNames[i],
rdr.GetValue(i).ToString()));
}
iw.AddDocument(doc);
}





__ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
Note that this really includes some extra steps.
You don't need a temp index.  Add everything to a single index using a
single IndexWriter instance.  No need to call addIndexes nor optimize
until the end.  Adding Documents to an index takes a constant amount of
time, regardless of the index size, because new segments are created as
documents are added, and existing segments don't need to be updated
(only when merges happen).  Again, I'd run your app under a profiler to
see where the time and memory are going.

Otis

--- Garrett Heaver [EMAIL PROTECTED] wrote:

 Hi Homan
 
 I had a similar problem as you in that I was indexing A LOT of data
 
 Essentially how I got round it was to batch the index.
 
 What I was doing was to add 10,000 documents to a temporary index,
 use
 addIndexes() to merge to temporary index into the live index (which
 also
 optimizes the live index) then delete the temporary index. On the
 next loop
 I'd only query rows from the db above the id in the maxdoc of the
 live index
 and set the max rows of the query to to 10,000
 i.e
 
 SELECT TOP 1 [fields] FROM [tables] WHERE [id_field]  {ID from
 Index.MaxDoc()} ORDER BY [id_field] ASC
 
 Ensuring that the documents go into the index sequentially your
 problem is
 solved and memory usage on mine (dotlucene 1.3) is low
 
 Regards
 Garrett
 
 -Original Message-
 From: Homam S.A. [mailto:[EMAIL PROTECTED] 
 Sent: 15 December 2004 02:43
 To: Lucene Users List
 Subject: Indexing a large number of DB records
 
 I'm trying to index a large number of records from the
 DB (a few millions). Each record will be stored as a
 document with about 30 fields, most of them are
 UnStored and represent small strings or numbers. No
 huge DB Text fields.
 
 But I'm running out of memory very fast, and the
 indexing is slowing down to a crawl once I hit around
 1500 records. The problem is each document is holding
 references to the string objects returned from
 ToString() on the DB field, and the IndexWriter is
 holding references to all these document objects in
 memory, so the garbage collector is getting a chance
 to clean these up.
 
 How do you guys go about indexing a large DB table?
 Here's a snippet of my code (this method is called for
 each record in the DB):
 
 private void IndexRow(SqlDataReader rdr, IndexWriter
 iw) {
   Document doc = new Document();
   for (int i = 0; i  BrowseFieldNames.Length; i++) {
   doc.Add(Field.UnStored(BrowseFieldNames[i],
 rdr.GetValue(i).ToString()));
   }
   iw.AddDocument(doc);
 }
 
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Find what you need with new enhanced search.
 http://info.mail.yahoo.com/mail_250
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing a large number of DB records

2004-12-14 Thread Otis Gospodnetic
Hello,

There are a few things you can do:

1) Don't just pull all rows from the DB at once.  Do that in batches.

2) If you can get a Reader from your SqlDataReader, consider this:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)

3) Give the JVM more memory to play with by using -Xms and -Xmx JVM
parameters

4) See IndexWriter's minMergeDocs parameter.

5) Are you calling optimize() at some point by any chance?  Leave that
call for the end.

1500 documents with 30 columns of short String/number values is not a
lot.  You may be doing something else not Lucene related that's slowing
things down.

Otis


--- Homam S.A. [EMAIL PROTECTED] wrote:

 I'm trying to index a large number of records from the
 DB (a few millions). Each record will be stored as a
 document with about 30 fields, most of them are
 UnStored and represent small strings or numbers. No
 huge DB Text fields.
 
 But I'm running out of memory very fast, and the
 indexing is slowing down to a crawl once I hit around
 1500 records. The problem is each document is holding
 references to the string objects returned from
 ToString() on the DB field, and the IndexWriter is
 holding references to all these document objects in
 memory, so the garbage collector is getting a chance
 to clean these up.
 
 How do you guys go about indexing a large DB table?
 Here's a snippet of my code (this method is called for
 each record in the DB):
 
 private void IndexRow(SqlDataReader rdr, IndexWriter
 iw) {
   Document doc = new Document();
   for (int i = 0; i  BrowseFieldNames.Length; i++) {
   doc.Add(Field.UnStored(BrowseFieldNames[i],
 rdr.GetValue(i).ToString()));
   }
   iw.AddDocument(doc);
 }
 
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Find what you need with new enhanced search.
 http://info.mail.yahoo.com/mail_250
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing a large number of DB records

2004-12-14 Thread Homam S.A.
Thanks Otis!

What do you mean by building it in batches? Does it
mean I should close the IndexWriter every 1000 rows
and reopen it? Does that releases references to the
document objects so that they can be
garbage-collected?

I'm calling optimize() only at the end.

I agree that 1500 documents is very small. I'm
building the index on a PC with 512 megs, and the
indexing process is quickly gobbling up around 400
megs when I index around 1800 documents and the whole
machine is grinding to a virtual halt. I'm using the
latest DotLucene .NET port, so may be there's a memory
leak in it.

I have experience with AltaVista search (acquired by
FastSearch), and I used to call MakeStable() every
20,000 documents to flush memory structures to disk.
There doesn't seem to be an equivalent in Lucene.

-- Homam






--- Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Hello,
 
 There are a few things you can do:
 
 1) Don't just pull all rows from the DB at once.  Do
 that in batches.
 
 2) If you can get a Reader from your SqlDataReader,
 consider this:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)
 
 3) Give the JVM more memory to play with by using
 -Xms and -Xmx JVM
 parameters
 
 4) See IndexWriter's minMergeDocs parameter.
 
 5) Are you calling optimize() at some point by any
 chance?  Leave that
 call for the end.
 
 1500 documents with 30 columns of short
 String/number values is not a
 lot.  You may be doing something else not Lucene
 related that's slowing
 things down.
 
 Otis
 
 
 --- Homam S.A. [EMAIL PROTECTED] wrote:
 
  I'm trying to index a large number of records from
 the
  DB (a few millions). Each record will be stored as
 a
  document with about 30 fields, most of them are
  UnStored and represent small strings or numbers.
 No
  huge DB Text fields.
  
  But I'm running out of memory very fast, and the
  indexing is slowing down to a crawl once I hit
 around
  1500 records. The problem is each document is
 holding
  references to the string objects returned from
  ToString() on the DB field, and the IndexWriter is
  holding references to all these document objects
 in
  memory, so the garbage collector is getting a
 chance
  to clean these up.
  
  How do you guys go about indexing a large DB
 table?
  Here's a snippet of my code (this method is called
 for
  each record in the DB):
  
  private void IndexRow(SqlDataReader rdr,
 IndexWriter
  iw) {
  Document doc = new Document();
  for (int i = 0; i  BrowseFieldNames.Length; i++)
 {
  doc.Add(Field.UnStored(BrowseFieldNames[i],
  rdr.GetValue(i).ToString()));
  }
  iw.AddDocument(doc);
  }
  
  
  
  
  
  __ 
  Do you Yahoo!? 
  Yahoo! Mail - Find what you need with new enhanced
 search.
  http://info.mail.yahoo.com/mail_250
  
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Take Yahoo! Mail with you! Get it on your mobile phone. 
http://mobile.yahoo.com/maildemo 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing HTML files give following message

2004-12-12 Thread Otis Gospodnetic
Hello,

This is probably due to some bad HTML.  The application you are using
is just a demo, and uses a JavaCC-based HTML parser, which may not be
resilient to invalid HTML.  For Lucene in Action we developed a little
extensible indexing framework, and for HTML indexing we used 2 tools to
handle HTML parsing: JTidy and NekoHTML.  Since the code for the book
is freely available... http://www.manning.com.  NekoHTML knows how
to deal with some bad HTML, that's why I'm suggesting this.
The indexing framework could come handy for those working on various
'desktop search' applications (Roosster, LDesktop (if that's really
happening), Lucidity, etc.)

Otis


--- Hetan Shah [EMAIL PROTECTED] wrote:

 java org.apache.lucene.demo.IndexHTML -create -index
 /source/workarea/hs152827/newIndex ..
 adding ../0/10037.html
 adding ../0/10050.html
 adding ../0/1006132.html
 adding ../0/1013223.html
 Parse Aborted: Encountered \ at line 5, column 1.
 Was expecting one of:
 ArgName ...
 = ...
 TagEnd ...
 
 And then the indexing hangs on this line. Earlier it used to go on
 and
 index remaining pages in the directory. Any idea why would the
 indexer
 stop at this error.
 
 Pointers are much needed and appreciated.
 -H
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing within an XML document

2004-11-10 Thread Otis Gospodnetic
Redirecting to lucene-user, which is more appropriate.

I'm not sure what exactly the question is here, but:

Parse your XML document and for each p element you encounter create a
new Document instance, and then populate its fields with some data,
like the URI data you mentioned.
If you parse with DOM - just walk the node tree and make new Document
whenever you encounter an element you want as a separate Document.  If
you are using the SAX API you'll probably want some logic in
start/endElement and characters methods. When you reach the end of the
element you are done with your Document instance, so add it to the
IndexWriter instance that you opened once, before the parser.
When you are done with the whole XML document close the IndexWriter.

Otis


--- Murray Altheim [EMAIL PROTECTED] wrote:

 Hi,
 
 I'm trying to develop a class to handle an XML document, where
 the contents aren't so much indexed on a per-document basis,
 rather on an element basis. Each element has a unique ID, so
 I'm looking to create a class/method similar to Lucene's
 Document.Document(). By way of example, I'll use some XHTML
 markup to illustrate what I'm trying to do:
 
html
 base href=http://purl.org/ceryle/blat.xml/
 [...]
 body
   p id=p1
  some text to index...
   /p
   p id=p2
  some more text to index...
   /p
   p id=p3
  even more text to index...
   /p
 /body
/html
 
 I'd very much appreciate any help in explaining how I'd go about
 creating a method to return a Lucene Document to index this via
 ID. Would I want a separate Document per p? (There are many
 thousands of such elements.) Everything in my system, both at the
 document and the individual element level is done via URL, so
 the method should create URLs for each p element like
 
 http://purl.org/ceryle/blat.xml#p1
 http://purl.org/ceryle/blat.xml#p2
 http://purl.org/ceryle/blat.xml#p3
 etc.
 
 I don't need anyone to go to the trouble of coding this, just point
 me to how it might be done, or to any existing examples that do this
 kind of thing.
 
 Thanks very much!
 
 Murray
 

..
 Murray Altheim   
 http://kmi.open.ac.uk/people/murray/
 Knowledge Media Institute
 The Open University, Milton Keynes, Bucks, MK7 6AA, UK  
 .
 
If we can just get the people that can reconcile themselves
 to the new dispensation out of the way and then kill the few
 thousand people who can't reconcile themselves, then we can
 let the remaining 98 percent come back and live out their
 lives, Pike said. If we bomb the place to the ground, those
 peace-loving people won't have a home to live in. [...] If we
 simply pulverize the city, it would look bad on TV. -- John Pike
 
U.S., Iraqi troops mass for assault on Fallujah
STRATEGY: U.S. to employ snipers, robots to cut down casualties
  Matthew B. Stannard, San Francisco Chronicle
   

http://www.sfgate.com/cgi-bin/article.cgi?file=/c/a/2004/11/06/MNGHL9NBU11.DTL
 
We have a growing, maturing insurgency group. We see larger
 and more coordinated military attacks. They are getting better
 and they can self-regenerate. The idea there are x number of
 insurgents, and that when they're all dead we can get out is
 wrong. The insurgency has shown an ability to regenerate itself
 because there are people willing to fill the ranks of those who
 are killed. The political culture is more hostile to the US
 presence. The longer we stay, the more they are confirmed in
 that view. -- W Andrew Terrill
 
Far Graver Than Vietnam, Sidney Blumenthal, The Guardian
http://www.guardian.co.uk/comment/story/0,,1305360,00.html
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing MS Files

2004-11-10 Thread Otis Gospodnetic
That's one place to start.  The other one would be textmining.org, at
least for Word files.
I used both POI and Textmining API in Lucene in Action, and the latter
was much simpler to use.  You can also find some comments about both
libs in lucene-user archives.  People tend to like Textmining API
better.

Otis

--- Luke Shannon [EMAIL PROTECTED] wrote:

 I need to index Word, Excel and Power Point files.
 
 Is this the place to start?
 
 http://jakarta.apache.org/poi/
 
 Is there something better?
 
 Thanks,
 
 Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing MS Files

2004-11-10 Thread Luke Shannon
Thanks Otis. I am looking forward to this book. Any idea when it may be
released?

- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 11:54 AM
Subject: Re: Indexing MS Files


 That's one place to start.  The other one would be textmining.org, at
 least for Word files.
 I used both POI and Textmining API in Lucene in Action, and the latter
 was much simpler to use.  You can also find some comments about both
 libs in lucene-user archives.  People tend to like Textmining API
 better.

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  I need to index Word, Excel and Power Point files.
 
  Is this the place to start?
 
  http://jakarta.apache.org/poi/
 
  Is there something better?
 
  Thanks,
 
  Luke


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing MS Files

2004-11-10 Thread Otis Gospodnetic
As Manning publications said, you should be able to get it for your
grandma this Christmas.

Otis

--- Luke Shannon [EMAIL PROTECTED] wrote:

 Thanks Otis. I am looking forward to this book. Any idea when it may
 be
 released?
 
 - Original Message - 
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, November 10, 2004 11:54 AM
 Subject: Re: Indexing MS Files
 
 
  That's one place to start.  The other one would be textmining.org,
 at
  least for Word files.
  I used both POI and Textmining API in Lucene in Action, and the
 latter
  was much simpler to use.  You can also find some comments about
 both
  libs in lucene-user archives.  People tend to like Textmining API
  better.
 
  Otis
 
  --- Luke Shannon [EMAIL PROTECTED] wrote:
 
   I need to index Word, Excel and Power Point files.
  
   Is this the place to start?
  
   http://jakarta.apache.org/poi/
  
   Is there something better?
  
   Thanks,
  
   Luke
 
 
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing MS Files

2004-11-10 Thread Thierry Ferrero
 for loading the document
  PropertyValue propertyvalue[] = new PropertyValue[ 1 ];
  // Setting the flag for hidding the open document
  propertyvalue[ 0 ] = new PropertyValue();
  propertyvalue[ 0 ].Name = Hidden;
  propertyvalue[ 0 ].Value = new Boolean(true);


  // Loading the wanted document
  Object objectDocumentToStore =
  xcomponentloader.loadComponentFromURL(
  stringUrl, _blank, 0, propertyvalue );

  // Getting an object that will offer a simple way to store a document
to a URL.
  XStorable xstorable =
  ( XStorable ) UnoRuntime.queryInterface( XStorable.class,
  objectDocumentToStore );

  // Preparing properties for converting the document
  propertyvalue = new PropertyValue[2];
  // Setting the flag for overwriting
  propertyvalue[0] = new PropertyValue();
  propertyvalue[0].Name = Overwrite;
  propertyvalue[0].Value = new Boolean(true);

  // Setting the filter name
  propertyvalue[1] = new PropertyValue();
  propertyvalue[1].Name = FilterName;
  propertyvalue[1].Value = stringConvertType;

// Appending the favoured extension to the origin document name
//if(stringUrl.lastIndexOf(.)!=0){
   //stringUrl=stringUrl.substring(0,stringUrl.lastIndexOf(.));
  //}

if(namedoc.lastIndexOf(.)!=-1){
   namedoc=namedoc.substring(0,namedoc.lastIndexOf(.));
  }

  //stringConvertedFile = stringUrl + . + stringExtension;



stringConvertedFile=xbase.getAlias(local)+/oo_tmp/+namedoc+.+stringExt
ension;

stringConvertedFile=stringConvertedFile.replace( '\\', '/' );

  // Storing and converting the document
xstorable.storeToURL( stringConvertedFile, propertyvalue );

  // Getting the method dispose() for closing the document
  XComponent xcomponent =
  ( XComponent ) UnoRuntime.queryInterface( XComponent.class,
  xstorable );

  // Closing the converted document
  xcomponent.dispose();
}

 catch(NoConnectException ex ) {
  return(  );
}
 catch( IOException ex ) {
 return(  );
 }
catch( Exception ex ) {
return(  );
}


// Returning the name of the converted file
return( stringConvertedFile );
  }


- Original Message - 
From: Luke Shannon [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 5:59 PM
Subject: Re: Indexing MS Files


 Thanks Otis. I am looking forward to this book. Any idea when it may be
 released?

 - Original Message - 
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, November 10, 2004 11:54 AM
 Subject: Re: Indexing MS Files


  That's one place to start.  The other one would be textmining.org, at
  least for Word files.
  I used both POI and Textmining API in Lucene in Action, and the latter
  was much simpler to use.  You can also find some comments about both
  libs in lucene-user archives.  People tend to like Textmining API
  better.
 
  Otis
 
  --- Luke Shannon [EMAIL PROTECTED] wrote:
 
   I need to index Word, Excel and Power Point files.
  
   Is this the place to start?
  
   http://jakarta.apache.org/poi/
  
   Is there something better?
  
   Thanks,
  
   Luke
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing MS Files

2004-11-10 Thread Luke Shannon
Thanks. Grandmas around the world will certainly be surprised this
Christmas.

- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 12:18 PM
Subject: Re: Indexing MS Files


 As Manning publications said, you should be able to get it for your
 grandma this Christmas.

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  Thanks Otis. I am looking forward to this book. Any idea when it may
  be
  released?
 
  - Original Message - 
  From: Otis Gospodnetic [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Wednesday, November 10, 2004 11:54 AM
  Subject: Re: Indexing MS Files
 
 
   That's one place to start.  The other one would be textmining.org,
  at
   least for Word files.
   I used both POI and Textmining API in Lucene in Action, and the
  latter
   was much simpler to use.  You can also find some comments about
  both
   libs in lucene-user archives.  People tend to like Textmining API
   better.
  
   Otis
  
   --- Luke Shannon [EMAIL PROTECTED] wrote:
  
I need to index Word, Excel and Power Point files.
   
Is this the place to start?
   
http://jakarta.apache.org/poi/
   
Is there something better?
   
Thanks,
   
Luke
  
  
  
  -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing MS Files

2004-11-10 Thread Luke Shannon
This looks great. Thank you Thierry!

- Original Message - 
From: Thierry Ferrero [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 12:23 PM
Subject: Re: Indexing MS Files


 I used OpenOffice API to convert all Word and Excel version.
 For me it's the solution for complex Word and Excel document.
 http://api.openoffice.org/
 Good luck !

 // UNO API
 import com.sun.star.bridge.XUnoUrlResolver;
 import com.sun.star.uno.XComponentContext;
 import com.sun.star.uno.UnoRuntime;
 import com.sun.star.frame.XComponentLoader;
 import com.sun.star.frame.XStorable;
 import com.sun.star.beans.PropertyValue;
 import com.sun.star.beans.XPropertySet;
 import com.sun.star.lang.XComponent;
 import com.sun.star.lang.XMultiComponentFactory;
 import com.sun.star.connection.NoConnectException;
 import com.sun.star.io.IOException;


 /** This class implements a http servlet in order to convert an incoming
 document
  * with help of a running OpenOffice.org and to push the converted file
back
  * to the client.
  */
 public class DocConverter {

  private String stringHost;
  private String stringPort;
  private Xcontext xcontext;
  private Xbase xbase;

  public DocConverter(Xbase xbase,Xcontext xcontext,ServletContext sc) {

   this.xbase=xbase;
   this.xcontext=xcontext;
 stringHost=ApplicationUtil.getParameter(sc,openoffice.oohost);
 stringPort=ApplicationUtil.getParameter(sc,openoffice.ooport);
}

  public synchronized String convertToTxt(String namedoc, String pathdoc,
 String stringConvertType, String stringExtension) {

 String stringConvertedFile = this.convertDocument(namedoc,
pathdoc,
 stringConvertType, stringExtension);
   return stringConvertedFile;
  }


  /** This method converts a document to a given type by using a running
  * OpenOffice.org and saves the converted document to the specified
  * working directory.
  * @param stringDocumentName The full path name of the file on the server
to
 be converted.
  * @param stringConvertType Type to convert to.
  * @param stringExtension This string will be appended to the file name of
 the converted file.
  * @return The full path name of the converted file will be returned.
  * @see stringWorkingDirectory
  */
  private String convertDocument(String namedoc, String pathdoc, String
 stringConvertType, String stringExtension ) {

  String tagerr=;
 String stringUrl=;
 String stringConvertedFile = ;
 // Converting the document to the favoured type
 try {
   tagerr=0;
   // Composing the URL - suppression de l'extension
   stringUrl = pathdoc+/+namedoc;
  stringUrl=stringUrl.replace( '\\', '/' );
   /* Bootstraps a component context with the jurt base components
  registered. Component context to be granted to a component for
 running.
  Arbitrary values can be retrieved from the context. */
   XComponentContext xcomponentcontext =
   com.sun.star.comp.helper.Bootstrap.createInitialComponentContext(
 null );

   /* Gets the service manager instance to be used (or null). This
method
 has
  been added for convenience, because the service manager is a
often
 used
  object. */
   XMultiComponentFactory xmulticomponentfactory =
   xcomponentcontext.getServiceManager();
tagerr=2;
   /* Creates an instance of the component UnoUrlResolver which
  supports the services specified by the factory. */
   Object objectUrlResolver =
   xmulticomponentfactory.createInstanceWithContext(
   com.sun.star.bridge.UnoUrlResolver, xcomponentcontext );
// Create a new url resolver
   XUnoUrlResolver xurlresolver = ( XUnoUrlResolver )
   UnoRuntime.queryInterface( XUnoUrlResolver.class,
   objectUrlResolver );
 // Resolves an object that is specified as follow:
   // uno:connection description;protocol description;initial
object
 name
   Object objectInitial = xurlresolver.resolve(
   uno:socket,host= + stringHost + ,port= + stringPort +
 ;urp;StarOffice.ServiceManager );

   // Create a service manager from the initial object
   xmulticomponentfactory = ( XMultiComponentFactory )
   UnoRuntime.queryInterface( XMultiComponentFactory.class,
 objectInitial );
   // Query for the XPropertySet interface.
   XPropertySet xpropertysetMultiComponentFactory = ( XPropertySet )
   UnoRuntime.queryInterface( XPropertySet.class,
 xmulticomponentfactory );
// Get the default context from the office server.
   Object objectDefaultContext =
   xpropertysetMultiComponentFactory.getPropertyValue(
 DefaultContext );

   // Query for the interface XComponentContext.
   xcomponentcontext = ( XComponentContext ) UnoRuntime.queryInterface(
   XComponentContext.class, objectDefaultContext );

   /* A desktop environment contains tasks with one or more
  frames in which components can be loaded. Desktop is the
  environment

RE: Indexing process causes Tomcat to stop working

2004-10-28 Thread iouli . golovatyi
before scewing tomcat too much...

1.make it sure both indexing and reading processes use the same locking 
directory (i.e. set it explicitly, take a look in wiky how to)
2. try to execute queries from command line and see what happends
3. in case your queries use sorting, there is a memory leak it 1.4.1 - 
upgrade to 1.4.2

Regards,
J.






James Tyrrell [EMAIL PROTECTED]
28.10.2004 10:13
Please respond to Lucene Users List

 
To: [EMAIL PROTECTED]
cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
Subject:RE: Indexing process causes Tomcat to stop working
Category: 



From: Armbrust, Daniel C. [EMAIL PROTECTED]

Right got back to work with newly created  index to try these ideas,

So, are you creating the indexes from inside the tomcat runtime, or are 
you 
creating them on the command line (which would be in a different runtime 
than tomcat)?

I'm creating them on the command line using a variation on the standard 
shown in the demo (has some additional optimisation input that is set to 
default until I can fix this bug).

What happens to tomcat?  Does it hang - still running but not responsive? 
 
Or does it crash?
If it hangs, maybe you are running out of memory.  By default, Tomcat's 
limit is set pretty low...

It definately hangs when shutdown you can't access it, when re-started it 
just sits there trying to access  port 8080

There is no reason at all you should have to reboot... If you stop and 
start tomcat, (make sure it actually stopped - sometimes it requires a 
kill -9 when it really gets hung) it should start working again. 
Depending on your setup of Tomcat + apache, you may  have to restart 
apache 
as well to get them linked to each other again...

Good news this did work, however I never see tomcat in top or even using 
ps 
-A | grep tomcat, the only way I've found tomcat is using ps -auwx | grep 
tomcat. The output is

*after tomcat shutdown.sh run*
---
root  2266  0.0  3.8 243740 4860 pts/0   SOct26   0:36 
/opt/jdk1.4/bin/java -Djava.endorsed.dirs=/opt/tomcat/common/endorsed 
-classpath 
/opt/jdk1.4/lib/tools.jar:/opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/commons-logging-api.jar
 

-Dcatalina.base=/opt/tomcat -Dcatalina.home=/opt/tomcat 
-Djava.io.tmpdir=/opt/to
root 16050  0.0  0.4  3576  620 pts/0S08:41   0:00 grep tomcat
--

I did however find two java proccesses running so I duitifully used kill 
-9 
on both pid's, hey-presto when I restarted Tomcat it ran perfectly. So 
while 
I can work around this I think, I guess now the question becomes, does 

anyone have any advice as to what could be causing this? Bearing in mind I 

can still run java proccesses (even create new indexes) on the same 
machine 
so it is just Tomcat thats affected.

Meanwhile, I will try as Dan suggested to raise the default memory of 
Tomcat 
significantly and run another index (it seems a likely culprit).

Thanks for all the help thus far, its more than appreciated regards,

JT


Original Message-
From: James Tyrrell [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 27, 2004 10:49 AM
To: [EMAIL PROTECTED]
Subject: RE: Indexing process causes Tomcat to stop working

Aad,
   D'oh forgot to mention that mildly important info. Rather than
re-index I am just creating a new index each time, this makes things 
easier
to roll-back etc (which is what my boss wants). the command line is
something like java com.lucene.IndexHTML -create -index indexstore/ .. 
I
have wondered about whether sessions could be a problem, but I don't 
think
so, otherwise wouldn't a restart of Tomcat be sufficient rather than a
reboot? I even tried the killall command on java  tomcat then started
everything again to no avail.

cheers,

JT



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





RE: Indexing process causes Tomcat to stop working

2004-10-28 Thread James Tyrrell

From: [EMAIL PROTECTED]
Hello!
before scewing tomcat too much...
A little late but probably good advice thankfully it hasn't gone wrong
1.make it sure both indexing and reading processes use the same locking
directory (i.e. set it explicitly, take a look in wiky how to)
working on this not so good at Java yet (until recently I mostly worked on 
Php) , looked on the wiki how to's could you be more specific as I 
couldn't find much on locking directories. But I will struggle on

2. try to execute queries from command line and see what happends
I only exectute from the command line, so so all the info in previous posts 
is what happens

3. in case your queries use sorting, there is a memory leak it 1.4.1 -
upgrade to 1.4.2
My queries do use sorting! So I have placed the 1.4 final jar onto my 
classpath and have started 'another' index, as the company I work for is 
moving home tomorrow may not be able to tell you if that worked till next 
week mind.

To Dan, the increased memory allocation for Tomcat didn't work unfortunately 
but I do  know a lot more about catalina_opt and Tomcat now which has proved 
handy for other things.

cheers for all the advice people will keep you posted if I make a 
breakthrough
thanks for your patience, regards,

JT
Regards,
J.
James Tyrrell [EMAIL PROTECTED]
28.10.2004 10:13
Please respond to Lucene Users List
To: [EMAIL PROTECTED]
cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
Subject:RE: Indexing process causes Tomcat to stop working
Category:

From: Armbrust, Daniel C. [EMAIL PROTECTED]
Right got back to work with newly created  index to try these ideas,
So, are you creating the indexes from inside the tomcat runtime, or are
you
creating them on the command line (which would be in a different runtime
than tomcat)?
I'm creating them on the command line using a variation on the standard
shown in the demo (has some additional optimisation input that is set to
default until I can fix this bug).
What happens to tomcat?  Does it hang - still running but not responsive?
Or does it crash?
If it hangs, maybe you are running out of memory.  By default, Tomcat's
limit is set pretty low...
It definately hangs when shutdown you can't access it, when re-started it
just sits there trying to access  port 8080
There is no reason at all you should have to reboot... If you stop and
start tomcat, (make sure it actually stopped - sometimes it requires a
kill -9 when it really gets hung) it should start working again.
Depending on your setup of Tomcat + apache, you may  have to restart
apache
as well to get them linked to each other again...
Good news this did work, however I never see tomcat in top or even using
ps
-A | grep tomcat, the only way I've found tomcat is using ps -auwx | grep
tomcat. The output is
*after tomcat shutdown.sh run*
---
root  2266  0.0  3.8 243740 4860 pts/0   SOct26   0:36
/opt/jdk1.4/bin/java -Djava.endorsed.dirs=/opt/tomcat/common/endorsed
-classpath
/opt/jdk1.4/lib/tools.jar:/opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/commons-logging-api.jar
-Dcatalina.base=/opt/tomcat -Dcatalina.home=/opt/tomcat
-Djava.io.tmpdir=/opt/to
root 16050  0.0  0.4  3576  620 pts/0S08:41   0:00 grep tomcat
--
I did however find two java proccesses running so I duitifully used kill
-9
on both pid's, hey-presto when I restarted Tomcat it ran perfectly. So
while
I can work around this I think, I guess now the question becomes, does
anyone have any advice as to what could be causing this? Bearing in mind I
can still run java proccesses (even create new indexes) on the same
machine
so it is just Tomcat thats affected.
Meanwhile, I will try as Dan suggested to raise the default memory of
Tomcat
significantly and run another index (it seems a likely culprit).
Thanks for all the help thus far, its more than appreciated regards,
JT
Original Message-
From: James Tyrrell [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 27, 2004 10:49 AM
To: [EMAIL PROTECTED]
Subject: RE: Indexing process causes Tomcat to stop working

Aad,
   D'oh forgot to mention that mildly important info. Rather than
re-index I am just creating a new index each time, this makes things
easier
to roll-back etc (which is what my boss wants). the command line is
something like java com.lucene.IndexHTML -create -index indexstore/ ..
I
have wondered about whether sessions could be a problem, but I don't
think
so, otherwise wouldn't a restart of Tomcat be sufficient rather than a
reboot? I even tried the killall command on java  tomcat then started
everything again to no avail.

cheers,

JT



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED

RE: Indexing process causes Tomcat to stop working

2004-10-28 Thread Armbrust, Daniel C.
You want version 1.4.2, not version 1.4.

The website makes it hard to find 1.4.2, because the mirrors have not been updated yet.

Get 1.4.2 here:  http://cvs.apache.org/dist/jakarta/lucene/v1.4.2/
 

My queries do use sorting! So I have placed the 1.4 final jar onto my 
classpath and have started 'another' index, as the company I work for is 
moving home tomorrow may not be able to tell you if that worked till next 
week mind.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing process causes Tomcat to stop working

2004-10-27 Thread Aad Nales
James,

How do you kick off your reindex? Could it be a session timeout? 

cheers,
Aad


Hello,

I am a Java/Lucene/Tomcat newbie I know that does not bode well as a
start 
to a post but I really am in dire straits as far as Lucene goes so bear
with 
me. I am working on indexing and replacing search functionality for a 
website (about 10 gig in size, although only about 7 gig is indexed) I 
presently have a working model based on the luceneweb demo dispatched
with 
Lucene, this has already proven functional when tested on various sites 
(admittedly much smaller 200-400mb etc). However, issues occur when 
performing the index on the main site that I haven't found explained on
any 
of the Lucene forums thus far.

After a successful index and optimisation of the website (takes around
4hrs 
40m unoptimised) I can't get to the index.jsp or even access tomcat. My 
first thought was to restart tomcat. No joy and no access. Thinking the 
larger index had killed the test server I accessed apache on port 80,
which 
worked perfectly.  After a few checks I realised the test server was
fine, 
apache was fine, used the same application to create an index of the
tomcat 
docs so java was working. Confused I went back to the forums, FAQ's and 
groups to see if anyone had any similar problems and have come up with a

brief list of what my problem is not;

There is no index write.lock files found for Lucene in either /tmp or 
opt/tomcat/temp directories so the index is open to be searched. Nor
does 
'top' reveal anything overloading the system. Apache is running fine and

displays all relevant pages. Tomcat cannot be reached with a browser 
(neither the default congratulations page or the Luceneweb application) 
Tomcat was a fresh install as was Java, Tomcat logs show nothing
different 
to standard startup logs. So I logged the entire indexing process and
saw 
two errors occurring infrequently.

Parse Aborted: Encountered \ at line 6, column 129. //where these
values 
vary
Was expecting one of:
   ArgName ...
   = ...
   TagEnd ...

I'm satisfied this is just the HTML parser kicking off about some badly 
formatted HTML and is only affecting what is indexed but its here for 
completeness. The other error is more serious:

java.io.IOException: Pipe closed
   at java.io.PipedInputStream.receive(PipedInputStream.java:136)
   at java.io.PipedInputStream.receive(PipedInputStream.java:176)
   at java.io.PipedOutputStream.write(PipedOutputStream.java:129)
   at 
sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java:336)
   at 
sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java:395)
   at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136)
   at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:146)
   at java.io.OutputStreamWriter.write(OutputStreamWriter.java:204)
   at java.io.Writer.write(Writer.java:126)
   at 
org.apache.lucene.demo.html.HTMLParser.addText(HTMLParser.java:137)
   at 
org.apache.lucene.demo.html.HTMLParser.HTMLDocument(HTMLParser.java:203)
   at
org.apache.lucene.demo.html.ParserThread.run(ParserThread.java:31)

I'm again pretty sure that this is the same error that occurred once
before 
when I was using the maxFieldLength to limit the number of terms
recorded. 
I'm also confident it's a threading error and found the following post
by 
Doug Cutting that seemed to explain it
http://java2.5341.com/msg/80502.html 
however I am assuming that's what it is and haven't yet attempted to
change 
the threading system of the demo as yet due to my lack of java
knowledge.

The strange thing is after restarting the server all aspects of the
Lucene 
web application work perfectly stemming, alphanumeric indexing summaries
etc 
are all as expected, so I am left assuming due to this (and by running
out 
of options) that Lucene has somehow done something to Tomcat by doing
such a 
large index. Being that both run off Java I guess its something to do
with 
that but I have nowhere near enough experience in java to work out what

The system I am currently running on is Java - 1.4.2_05, Tomcat -
5.0.27, 
Lucene - 1.4.1, Linux version - 2.4.20-8 (gcc version 3.2.2 20030222
(Red 
Hat Linux 3.2.2-5)), Apache 2.0.42. I have not modified the mergeFactor
or 
MaxMergeDocuments nor am I using RAMdirectories. The processor is 800MHz
and 
there is 128mb of RAM.

If more info is required on setup, source code etc or you think this
should 
be moved to a tomcat forum just post.

Best regards and thanks in advance for any advice you can offer,

J Tyrrell



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing process causes Tomcat to stop working

2004-10-27 Thread James Tyrrell
Aad,
 D'oh forgot to mention that mildly important info. Rather than 
re-index I am just creating a new index each time, this makes things easier 
to roll-back etc (which is what my boss wants). the command line is 
something like java com.lucene.IndexHTML -create -index indexstore/ .. I 
have wondered about whether sessions could be a problem, but I don't think 
so, otherwise wouldn't a restart of Tomcat be sufficient rather than a 
reboot? I even tried the killall command on java  tomcat then started 
everything again to no avail.

cheers,
JT

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Indexing process causes Tomcat to stop working

2004-10-27 Thread Armbrust, Daniel C.
So, are you creating the indexes from inside the tomcat runtime, or are you creating 
them on the command line (which would be in a different runtime than tomcat)?

What happens to tomcat?  Does it hang - still running but not responsive?  Or does it 
crash?  

If it hangs, maybe you are running out of memory.  By default, Tomcat's limit is set 
pretty low...

There is no reason at all you should have to reboot... If you stop and start tomcat, 
(make sure it actually stopped - sometimes it requires a kill -9 when it really gets 
hung) it should start working again.  Depending on your setup of Tomcat + apache, you 
may  have to restart apache as well to get them linked to each other again...

Dan




-Original Message-
From: James Tyrrell [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 27, 2004 10:49 AM
To: [EMAIL PROTECTED]
Subject: RE: Indexing process causes Tomcat to stop working

Aad,
  D'oh forgot to mention that mildly important info. Rather than 
re-index I am just creating a new index each time, this makes things easier 
to roll-back etc (which is what my boss wants). the command line is 
something like java com.lucene.IndexHTML -create -index indexstore/ .. I 
have wondered about whether sessions could be a problem, but I don't think 
so, otherwise wouldn't a restart of Tomcat be sufficient rather than a 
reboot? I even tried the killall command on java  tomcat then started 
everything again to no avail.

cheers,

JT



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing Strategy for 20 million documents

2004-10-12 Thread Otis Gospodnetic

--- Christoph Kiehl [EMAIL PROTECTED] wrote:

 Otis Gospodnetic wrote:
 
  I would try putting everything in a single index first, and split
 it up
  only if I see performance issues.  
 
 Why would put everything into a single index? I found some benchmark 
 results on the list (starting with your post from 06/08/04) from
 which I 
 got the impression that the performance loss is very small if I
 choose 
 to search in multiple indexes with MultiSearcher instead of using one
 
 big index.

I think it's simpler to deal with a single index.  One directory, one
set of lock files, etc.  If you don't gain anything by having multiple
indices, why have them?

  Going from 1 index to N indices is
  not a lot of work (not a lot of Lucene-related code). 
 
 How do you get from 1 index to N indices without adding the documents
 again?

Yes, you would have to re-create N Lucene indices.

Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing numeric entities?

2004-10-12 Thread Damian Gajda
Yes You need to parse the entities Yourself. I implemented an HTML
entity parser as a part of http://objectledge.org project. You may use
it if it will fit Your needs. It is in a ledge-components project
module. See http://objectledge.org/modules/ledge-components/index.html

Have fun,
-- 
Damian Gajda
Caltha Sp. j.
http://www.caltha.pl/




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: indexing numeric entities?

2004-10-12 Thread Patel, Viral


-Original Message-
From: Damian Gajda [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 12, 2004 10:23 AM
To: Lucene Users List
Subject: Re: indexing numeric entities?


Yes You need to parse the entities Yourself. I implemented an HTML
entity parser as a part of http://objectledge.org project. You may use
it if it will fit Your needs. It is in a ledge-components project
module. See http://objectledge.org/modules/ledge-components/index.html

Have fun,
-- 
Damian Gajda
Caltha Sp. j.
http://www.caltha.pl/




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing Strategy for 20 million documents

2004-10-08 Thread Justin Swanhart
It depends on a lot of factors.  I myself use multiple indexes for
about 10M documents.
My documents are transient.  Each day I get about 400K and I remove
about 400K.  I
always remove an entire days documents at one time.  It is much
faster/easier to delete
the lucene index for the day that I am removing, then looping through
one big index and
removing the entries with the IndexReader.  Since my data is also
partitioned by day in
my database, I essentially do the same thing there with truncate table.

I use a ParallelMultiSearcher object to search the indexes.  I store
my indexes on a 14
disk 15k rpm  fibre channel RAID 1+0 array (striped mirrors).

I get very good performance in both updating and searching indexes.

On Fri, 8 Oct 2004 06:11:37 -0700 (PDT), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 Jeff,
 
 These questions are difficult to answer, because the answer depends on
 a number of factors, such as:
 - hardware (memory, disk speed, number of disks...)
 - index complexity and size (number of fields and their size)
 - number of queries/second
 - complexity of queries
 etc.
 
 I would try putting everything in a single index first, and split it up
 only if I see performance issues.  Going from 1 index to N indices is
 not a lot of work (not a lot of Lucene-related code).  If searching 1
 big index is too slow, split your index, put each index on a separate
 disk, and use ParallelMultiSearcher
 (http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ParallelMultiSearcher.html)
 to search your indices.
 
 Otis
 
 
 
 
 --- Jeff Munson [EMAIL PROTECTED] wrote:
 
  I am a new user of Lucene.  I am looking to index over 20 million
  documents (and a lot more someday) and am looking for ideas on the
  best
  indexing/search strategy.
 
  Which will optimize the Lucene search, one index or multiple indexes?
  Do I create multiple indexes and merge them all together?  Or do I
  create multiple indexes and search on the multiple indexes?
 
  Any helpful ideas would be appreciated!
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing numeric entities?

2004-10-07 Thread Daan Hoogland
Daan Hoogland wrote:

Daan Hoogland wrote:

  

Hello,

Does anyone do indexeing of numeric entities for japanese characters? I 
have (non-x)html containing those entities and need to index and search 
them.


 



Can the CJKAnalyzer index a string like #9679;#20837;#31038;? It 
seems to be ignored completely when used with the demo. There was talk 
on this list of fixes for the demo HTMLParser, do these adres this 
issue? When I look ate the code it seems that the entities should have 
been interpreted before indexing. What am I missing?

Any comment please?
Or a pointer to a howto for dumm^H^H^H^H^H westerners?
  

Indexing the attached document using the HTMLParser demo and the 
CJKAnalyzer, only the term japan is found in the content. This is not 
correct, is it?
Should I convert the entities by hand?


thanks,


  




-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing numeric entities?

2004-10-07 Thread Daan Hoogland
maybe inline?

html xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
 head
  titlejapan/title
 /head
 body bgcolor=#FF alink=black
  p

#12501;#12451;#12540;#12523;#12489;#12469;#12540;#12499;#12473;#12456;#12531;#12472;#12491;#12450;

  /p

/html

Indexing the above document using the HTMLParser demo and the 
CJKAnalyzer, only the term japan is found in the content. This is not 
correct, is it?
Should I convert the entities by hand?


Sorry for the mess I send before.


-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



re-indexing

2004-09-28 Thread Jason
I am having touble reindexing.
Basically what I want to do is:
1. Delete the old index
2. Write the new index.
The enviroment:
The index is search by a web app running from the Orion App Server. This
code runs fin and reindexes fine prior to any searches.  After the first
search against the index is completed the index ends up beiong read-only
( or not writeable), I cannot reindex and subsequently cannot search
because the index is incomplete.
1. Why doesn't IndexReader.delete(i) really delete the file. it seems to
just make anothe 1K file with a .del extension the IndexWriter still
cannot content with?
2. How can I make this work?
Thanks,
Jason
The code below produces the following output when run AFTER an initial
search against the index have be completed:
IndexerDrug-disableLuceneLocks: true
Directory: [EMAIL PROTECTED]:\lucene_index_drug
Deleted [0]: true
... (out put form for loop confirming deleted items)
Deleted [367]: true
Hit uncaught exception java.io.IOException
java.io.IOException: Cannot delete _ba.cfs
   at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:144)
   at
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:105)
   at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:193)
   at IndexerDrug.index(IndexerDrug.java:103)
   at IndexerDrug.main(IndexerDrug.java:246)
Exception in thread main
=-=-=-=-=-=-=-=-=-=-=-=-=-
My indexing code  (some items have been deleted to protect the innocent)
=-=-=-=-=-=-=-=-=-=-=-=-=-
import java.io.*;
import java.sql.*;
import javax.naming.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
public class IndexerDrug {
 private String sql = my query code ;
 public static String[] stopWords =
org.apache.lucene.analysis.standard.StandardAnalyzer.STOP_WORDS;
 public File indexDir = new File(C:\\lucene_index_drug\\);
 public Directory fsDir;
 public void index() throws IOException {
   try {
   // Delete old index
   fsDir = FSDirectory.getDirectory(indexDir, false);
   if (indexDir.list().length  0) {
   IndexReader reader = IndexReader.open(fsDir);
   
System.out.println(Directory:+reader.directory().toString());
   reader.unlock(fsDir);
   for (int i = 0; i  reader.maxDoc()-1; i++) {
   reader.delete(i);
   System.out.println(Deleted [+i+]:  
+reader.isDeleted(i));
   }
   reader.close();
   }
   }
   catch (Exception ex) {
   System.out.println(Error while deleting index:  
+ex.getMessage());
   }
   // Write new index
   Analyzer analyzer = new StandardAnalyzer(stopWords);
   IndexWriter writer = new IndexWriter(indexDir, analyzer, 
true);//  fails here *
   writer.mergeFactor = 1000;
   indexDirectory(writer);
   writer.setUseCompoundFile(true);
   writer.optimize();
   writer.close();

 }
 private void indexDirectory(IndexWriter writer) throws IOException {
   Connection c = null;
   ResultSet rs = null;
   Statement stmt = null;
   long startTime = System.currentTimeMillis();
   System.out.println(Start Time:  + new
java.sql.Timestamp(System.currentTimeMillis()).toString());
   try {
 Class.forName();
 c = DriverManager.getConnection( , , );
 stmt = c.createStatement();
 rs = stmt.executeQuery(this.sql);
 System.out.println(Query Completed:  + new
java.sql.Timestamp(System.currentTimeMillis()).toString());
 int total = 0;
 String resourceID = ;
 String resourceName = ;
 String summary = ;
 String shortSummary = ;
 String hciPick = ;
 String url = ;
 String format = ;
 String orgType = ;
 String holdingType = ;
 String indexText = ;
 String c_indexText = ;
 boolean ready = false;
 Document doc = null;
 String oldResourceID = null;
 String newResourceID = null;
 while (rs.next()) {
   newResourceID = rs.getString(resourceID)!= null ?
rs.getString(resourceID) : ;
   resourceID = newResourceID;
   resourceName = rs.getString(resourceName) != null ?
rs.getString(resourceName) : ;
   summary = rs.getString(summary) != null ?
rs.getString(summary) : ;
   if (summary.length()  300) {
 shortSummary = summary.substring(0, 300) + ...;
   } else {
 shortSummary = summary;
   }
   hciPick = rs.getString(hciPick) != null 
?rs.getString(hciPick) : ;
   url = rs.getString(url) != null ? rs.getString(url) : ;
   format = rs.getString(format) != null ? 
rs.getString(format): ;
   orgType = rs.getString(orgType) != null 
?rs.getString(orgType) : ;
   holdingType = rs.getString(holdingType) != null 
?rs.getString(holdingType) : ;
   indexText = rs.getString(indexText) != null 
?rs.getString(indexText) : ;

   if 

Re: re-indexing

2004-09-28 Thread Bo Gundersen
Jason wrote:
I am having touble reindexing.
Basically what I want to do is:
1. Delete the old index
2. Write the new index.
The enviroment:
The index is search by a web app running from the Orion App Server. This
code runs fin and reindexes fine prior to any searches.  After the first
search against the index is completed the index ends up beiong read-only
( or not writeable), I cannot reindex and subsequently cannot search
because the index is incomplete.
We have several apps running like this only on Tomcat and JBoss with no 
problems...

1. Why doesn't IndexReader.delete(i) really delete the file. it seems to
just make anothe 1K file with a .del extension the IndexWriter still
cannot content with?
Never tried the IndexReader.delete() method, we generally build the new 
index in a temporary directory and when the index is done we delete the 
current online directory (using java.io.File methods) and then rename 
the temp directory to online.

2. How can I make this work?
This may be just be silly, but do you remember to close your 
org.apache.lucene.search.IndexSearcher when you are done with your search?

--
Bo Gundersen
DBA/Software Developer
M.Sc.CS.
www.atira.dk
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing date ranges

2004-09-21 Thread Erik Hatcher
If it is unindexed, then you cannot query on it, so you do not have a 
choice.  The other option is to use a field that is indexed, not 
tokenized, and not stored (you have to use new Field(...) to accomplish 
that) if you don't want to store the field data.

Erik
On Sep 21, 2004, at 5:54 PM, Chris Fraschetti wrote:
is it most effecient to index or not index 'numeric' ranges that i
will do a range search byepoc_date:[110448 TO 820483200]
would be be better to treat it as Field.Keyword or Field.UnIndexed  ?
--
___
Chris Fraschetti, Student CompSci System Admin
University of San Francisco
e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-13 Thread Daniel Taurat
Hi Doug,
you are absolutely right about the older version of the JDK: it is 1.3.1 
(ibm).
Unfortunately we cannot upgrade since we are bound to IBM Portalserver 4 
environment.
Results:
I patched the Lucene1.4.1:
it has improved not much: after indexing 1897 Objects  the number of 
SegmentTermEnum is up to 17936.
To be realistic: This is even a deterioration :(((
My next check will be with a JDK1.4.2 for the test environment, but this 
can only be a reference run for now.

Thanks,
Daniel
Doug Cutting wrote:
It sounds like the ThreadLocal in TermInfosReader is not getting 
correctly garbage collected when the TermInfosReader is collected. 
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess 
is that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it 
works for you.

Doug
Daniel Taurat wrote:
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before 
FieldCache was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created 
was to after every 200K, documents create a temp directory, and 
merge them together, this was done to do the first production run, 
updates are now being handled incrementally

 

Exception in thread main java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems 
to be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced 
gc runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
 

Hi all
Reading the thread with interest, there is another way I've come 

across out
 

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 

swapping (which
 

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run 
out 

of memory.
 

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: Daniel Taurat 
[EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing 
large 

number of
 

documents

  

Daniel Aber schrieb:
 


On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

 

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been a fix 



about files
 

not being deleted after 1.4.1. Not sure

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-13 Thread Daniel Taurat
Okay,  reference test is done:
on JDK 1.4.2 Lucene 1.4.1 really seems to run fine: just a moderate 
number of SegmentTermEnums that is controlled by gc (about 500 for the 
1900 test objects).

Daniel Taurat wrote:
Hi Doug,
you are absolutely right about the older version of the JDK: it is 
1.3.1 (ibm).
Unfortunately we cannot upgrade since we are bound to IBM Portalserver 
4 environment.
Results:
I patched the Lucene1.4.1:
it has improved not much: after indexing 1897 Objects  the number of 
SegmentTermEnum is up to 17936.
To be realistic: This is even a deterioration :(((
My next check will be with a JDK1.4.2 for the test environment, but 
this can only be a reference run for now.

Thanks,
Daniel
Doug Cutting wrote:
It sounds like the ThreadLocal in TermInfosReader is not getting 
correctly garbage collected when the TermInfosReader is collected. 
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess 
is that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it 
works for you.

Doug
Daniel Taurat wrote:
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before 
FieldCache was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i 
created was to after every 200K, documents create a temp 
directory, and merge them together, this was done to do the first 
production run, updates are now being handled incrementally

 

Exception in thread main java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems 
to be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced 
gc runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
 

Hi all
Reading the thread with interest, there is another way I've 
come 


across out
 

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 


swapping (which
 

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run 
out 


of memory.
 

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: Daniel Taurat 
[EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing 
large 


number of
 

documents

 

Daniel Aber schrieb:
 
   

On Thursday 09 September 2004 19:47, Daniel Taurat wrote

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat
Daniel Aber schrieb:
On Thursday 09 September 2004 19:47, Daniel Taurat wrote:
 

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been a fix about files 
not being deleted after 1.4.1. Not sure if that could cause the problems 
you're experiencing.

Regards
Daniel
 

Well, it seems not to be files, it looks more like those SegmentTermEnum 
objects accumulating in memory.
#I've seen some discussion on these objects in the developer-newsgroup 
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...

Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Pete Lewis
Hi all

Reading the thread with interest, there is another way I've come across out
of memory errors when indexing large batches of documents.

If you have your heap space settings too high, then you get swapping (which
impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out of memory.

Can you check whether or not your garbage collection is being triggered?

Anomalously therefore if this is the case, by reducing the heap space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis

- Original Message - 
From: Daniel Taurat [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of
documents


 Daniel Aber schrieb:

 On Thursday 09 September 2004 19:47, Daniel Taurat wrote:
 
 
 
 I am facing an out of memory problem using  Lucene 1.4.1.
 
 
 
 Could you try with a recent CVS version? There has been a fix about files
 not being deleted after 1.4.1. Not sure if that could cause the problems
 you're experiencing.
 
 Regards
  Daniel
 
 
 
 Well, it seems not to be files, it looks more like those SegmentTermEnum
 objects accumulating in memory.
 #I've seen some discussion on these objects in the developer-newsgroup
 that had taken place some time ago.
 I am afraid this is some kind of runaway caching I have to deal with.
 Maybe not  correctly addressed in this newsgroup, after all...

 Anyway: any idea if there is an API command to re-init caches?

 Thanks,

 Daniel



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat
Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm jdk1.3.1 
that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to be 
1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type objects, 
that is): No effect.

regards,
Daniel
Pete Lewis wrote:
Hi all
Reading the thread with interest, there is another way I've come across out
of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get swapping (which
impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out of memory.
Can you check whether or not your garbage collection is being triggered?
Anomalously therefore if this is the case, by reducing the heap space you
can improve performance get rid of the out of memory errors.
Cheers
Pete Lewis
- Original Message - 
From: Daniel Taurat [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of
documents

 

Daniel Aber schrieb:
   

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

 

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been a fix about files
not being deleted after 1.4.1. Not sure if that could cause the problems
you're experiencing.
Regards
Daniel

 

Well, it seems not to be files, it looks more like those SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...
Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Ben Litchfield
 I can say that gc is not collecting these objects since I  forced gc
 runs when indexing every now and then (when parsing pdf-type objects,
 that is): No effect.

What PDF parser are you using?  Is the problem within the parser and not
lucene?  Are you releasing all resources?

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Rupinder Singh Mazara


hi all 

  I had a similar problem, i have  database of documents with 24 fields, and a average 
content of 7K, with  16M+ records

  i had to split the jobs into slabs of 1M each and merging the resulting indexes, 
submissions to our job queue looked like
 
  java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
  
 and i still had outofmemory exception , the solution that i created was to after 
every 200K, documents create a temp directory, and merge them together, this was done 
to do the first production run, updates are now being handled incrementally
 
  

Exception in thread main java.lang.OutOfMemoryError
at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number
of documents


Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm jdk1.3.1 
that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to be 
1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type objects, 
that is): No effect.

regards,

Daniel


Pete Lewis wrote:

Hi all

Reading the thread with interest, there is another way I've come 
across out
of memory errors when indexing large batches of documents.

If you have your heap space settings too high, then you get 
swapping (which
impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out 
of memory.

Can you check whether or not your garbage collection is being triggered?

Anomalously therefore if this is the case, by reducing the heap space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis

- Original Message - 
From: Daniel Taurat [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number of
documents


  

Daniel Aber schrieb:



On Thursday 09 September 2004 19:47, Daniel Taurat wrote:



  

I am facing an out of memory problem using  Lucene 1.4.1.




Could you try with a recent CVS version? There has been a fix 
about files
not being deleted after 1.4.1. Not sure if that could cause the problems
you're experiencing.

Regards
Daniel



  

Well, it seems not to be files, it looks more like those SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...

Anyway: any idea if there is an API command to re-init caches?

Thanks,

Daniel



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat
The Parser is pdfBox. pdf is about 25% of the over all indexing volume  
on the productive system. I also have word-docs and loads of hmtl 
resources to be indexed.
In my testing environment I merely have 5 pdf docs and still those 
permanent object hanging around, though.
Cheers,
Daniel

Ben Litchfield wrote:
I can say that gc is not collecting these objects since I  forced gc
runs when indexing every now and then (when parsing pdf-type objects,
that is): No effect.
   


What PDF parser are you using? Is the problem within the parser and not
lucene? Are you releasing all resources?
Ben
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after indexing 
my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before FieldCache 
was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all 

 I had a similar problem, i have  database of documents with 24 fields, and a average 
content of 7K, with  16M+ records
 i had to split the jobs into slabs of 1M each and merging the resulting indexes, 
submissions to our job queue looked like
 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally

 

Exception in thread main java.lang.OutOfMemoryError
at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)
 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number
of documents
Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm jdk1.3.1 
that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to be 
1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type objects, 
that is): No effect.

regards,
Daniel
Pete Lewis wrote:
   

Hi all
Reading the thread with interest, there is another way I've come 
 

across out
   

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 
 

swapping (which
   

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out 
 

of memory.
   

Can you check whether or not your garbage collection is being triggered?
Anomalously therefore if this is the case, by reducing the heap space you
can improve performance get rid of the out of memory errors.
Cheers
Pete Lewis
- Original Message - 
From: Daniel Taurat [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
 

number of
   

documents

 

Daniel Aber schrieb:
  

   

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:



 

I am facing an out of memory problem using  Lucene 1.4.1.
  

   

Could you try with a recent CVS version? There has been a fix 
 

about files
   

not being deleted after 1.4.1. Not sure if that could cause the problems
you're experiencing.
Regards
Daniel



 

Well, it seems not to be files, it looks more like those SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...
Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  

   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before FieldCache 
was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created 
was to after every 200K, documents create a temp directory, and merge 
them together, this was done to do the first production run, updates 
are now being handled incrementally

 

Exception in thread main java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to 
be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
  

Hi all
Reading the thread with interest, there is another way I've come 
across out
  

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 
swapping (which
  

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out 
of memory.
  

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: Daniel Taurat 
[EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number of
  

documents



Daniel Aber schrieb:
 
  

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

   


I am facing an out of memory problem using  Lucene 1.4.1.
 
  
Could you try with a recent CVS version? There has been a fix 


about files
  

not being deleted after 1.4.1. Not sure if that could cause the 
problems
you're experiencing.

Regards
Daniel

   

Well, it seems not to be files, it looks more like those 
SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the 
developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...

Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 
  
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Doug Cutting
It sounds like the ThreadLocal in TermInfosReader is not getting 
correctly garbage collected when the TermInfosReader is collected. 
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is 
that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it works 
for you.

Doug
Daniel Taurat wrote:
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before FieldCache 
was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created 
was to after every 200K, documents create a temp directory, and merge 
them together, this was done to do the first production run, updates 
are now being handled incrementally

 

Exception in thread main java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to 
be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
 

Hi all
Reading the thread with interest, there is another way I've come 

across out
 

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 

swapping (which
 

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out 

of memory.
 

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: Daniel Taurat 
[EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 

number of
 

documents

   

Daniel Aber schrieb:
 
 

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

  

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been a fix 


about files
 

not being deleted after 1.4.1. Not sure if that could cause the 
problems
you're experiencing.

Regards
Daniel

   

Well, it seems not to be files, it looks more like those 
SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the 
developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...

Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel

Re: indexing size

2004-09-09 Thread Bernhard Messer
Dmitry Serebrennikov wrote:
Niraj Alok wrote:
Hi PA,
Thanks for the detail ! Since we are using lucene to store the data 
also, I
guess I would not be able to use it.
 

By the way, I could be wrong, but I think the 35% figure you 
referenced in the your first e-mail actually does not include any 
stored fields. The deal with 35% was, I think, to illustrate that 
index data structures used for searching by Lucene are efficient. But 
Lucene does nothing special about stored content - no compression or 
anything like that. So you end up with the pure size of your data plus 
the 35% of the indexed data.
There will be a patch available to the end of this week, which allows 
you to store binary values compressed within a lucene index. It means 
that you will be able to store and retrieve whole documents within 
lucene in a very efficient way ;-)

regards
bernhard

Cheers.
Dmitry.
Regards,
Niraj
- Original Message -
From: petite_abeille [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, September 01, 2004 1:14 PM
Subject: Re: indexing size
 

Hi Niraj,
On Sep 01, 2004, at 06:45, Niraj Alok wrote:
  

If I make some of them Field.Unstored, I can see from the javadocs
that it
will be indexed and tokenized but not stored. If it is not stored, how
can I
use it while searching?

The different type of fields don't impact how you do your search. This
is always the same.
Using Unstored fields simply means that you use Lucene as a pure index
for search purpose only, not for storing any data.
Specifically, the assumption is that your original data lives somewhere
else, outside of Lucene. If this assumption is true, then you can index
everything as Unstored with the addition of one Keyword per document.
The Keyword field holds some sort of unique identifier which allows you
to retrieve the original data if necessary (e.g. a primary key, an URI,
what not).
Here is an example of this approach:
(1) For indexing, check the indexValuesWithID() method
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
SZIndex.java?view=markup
Note the addition of a Field.Keyword for each document and the use of
Field.UnStored for everything else
(2) For fetching, check objectsWithSpecificationAndHitsInStore()
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
SZFinder.java?view=markup
HTH.
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-09 Thread Daniel Taurat
Hi,
I am facing an out of memory problem using  Lucene 1.4.1.
I am  re-indexing a pretty large number ( about 30.000 ) of documents.
I identify old instances by checking for a unique ID field, delete those 
with indexReader.delete() and add the new document version.

HeapDump says I am having  a huge number of HashMaps with 
SegmentTermEnum objects (256891) .

IndexReader is closed directly after delete(term)...
Seems to me that this did not happen with version1.2 (same number of 
objects and  all...).
Has anyone an idea how I get  these hanging  objects? Or what to do in 
order to avoid them?

Thanks
Daniel
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-09 Thread Daniel Naber
On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

 I am facing an out of memory problem using Lucene 1.4.1.

Could you try with a recent CVS version? There has been a fix about files 
not being deleted after 1.4.1. Not sure if that could cause the problems 
you're experiencing.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing size

2004-09-08 Thread Dmitry Serebrennikov
Niraj Alok wrote:
Hi PA,
Thanks for the detail ! Since we are using lucene to store the data also, I
guess I would not be able to use it.
 

By the way, I could be wrong, but I think the 35% figure you referenced 
in the your first e-mail actually does not include any stored fields. 
The deal with 35% was, I think, to illustrate that index data structures 
used for searching by Lucene are efficient. But Lucene does nothing 
special about stored content - no compression or anything like that. So 
you end up with the pure size of your data plus the 35% of the indexed 
data.

Cheers.
Dmitry.
Regards,
Niraj
- Original Message -
From: petite_abeille [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, September 01, 2004 1:14 PM
Subject: Re: indexing size
 

Hi Niraj,
On Sep 01, 2004, at 06:45, Niraj Alok wrote:
   

If I make some of them Field.Unstored, I can see from the javadocs
that it
will be indexed and tokenized but not stored. If it is not stored, how
can I
use it while searching?
 

The different type of fields don't impact how you do your search. This
is always the same.
Using Unstored fields simply means that you use Lucene as a pure index
for search purpose only, not for storing any data.
Specifically, the assumption is that your original data lives somewhere
else, outside of Lucene. If this assumption is true, then you can index
everything as Unstored with the addition of one Keyword per document.
The Keyword field holds some sort of unique identifier which allows you
to retrieve the original data if necessary (e.g. a primary key, an URI,
what not).
Here is an example of this approach:
(1) For indexing, check the indexValuesWithID() method
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
SZIndex.java?view=markup
Note the addition of a Field.Keyword for each document and the use of
Field.UnStored for everything else
(2) For fetching, check objectsWithSpecificationAndHitsInStore()
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
SZFinder.java?view=markup
HTH.
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing size

2004-09-01 Thread Stephane James Vaucher
Hi Niraj,

I'd rather respond to the list as others may be interested in your
questions, and since I don't consider myself a guru, I appreciate being
corrected.

For a title, I'd say yes, use the Field Text(String name, String value)
constructor. Not the others that use a reader as they do not store the
value.

You want for it to be:
1) tokenised (so to have its fragments saved for searching, not only the
totality of the text)
2) indexed so to make it searchable
3) store as to make the field retrievable from the index

hth,
sv
p.s. my name is Stephane, it's been a while since I've been in Oz
that I haven't been called James

On Wed, 1 Sep 2004, Niraj Alok wrote:

 Hi James,

 Since this would be a minor issue hence I am not posting it on the lucene.

 Lets say I have one field as title which has a value of George Bush.
 I would need to search on that title and also retrieve its value. So you are
 saying that I should have it as Field.Text?

 Also, if I need to just search on that title but want to retrieve the
 value of another field content, then title should be unstored while
 content should be stored?

 Regards,
 Niraj
 - Original Message -
 From: Stephane James Vaucher [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, September 01, 2004 10:59 AM
 Subject: Re: indexing size


  On Wed, 1 Sep 2004, Niraj Alok wrote
   I was also thinking on the same lines.
   Actually the original code was written by some one else who has left and
 so
   I have to own this.
  
   At almost all the places, it is Field.Text and at some few places its
   Field.UnIndexed.
   I looked at the javadocs and found that there is Field.UnStored also.
  
   The problem is I am not too sure which one to change to what. It would
 be
   really enlightening if you could point the differences
   between those three and what would I need to change in my search code.
  
   If I make some of them Field.Unstored, I can see from the javadocs that
   it will be indexed and tokenized but not stored. If it is not stored,
   how can I use it while searching? Basically what is meant by indexed and
   stored, indexed and not stored and not indexed and stored?
 
  If all you need is to seach a field, you do not need to store it. If it is
  not stored it can still be tokenised and analysed by lucene. It will then
  be only stored as a set of token, but not as whole. You can thus use it
  for fields that you never need to retrieve from the index.
 
  For example:
  the quick brown fox jumped over the lazy dog.
 
  will be store in lucene only as tokens, not as a whole, so using a
  whitespace analyser using a stopword list {the}:
 
  You will have these tokens in lucene:
  quick
  brown
  fox
  jumped
  over
  dog
 
  You will NOT be able to retrieve the original text, but you will be able
  to search it.
 
  HTH,
  sv
 
  
   Regards,
   Niraj
   - Original Message -
   From: petite_abeille [EMAIL PROTECTED]
   To: Lucene Users List [EMAIL PROTECTED]
   Sent: Tuesday, August 31, 2004 8:57 PM
   Subject: Re: indexing size
  
  
   
On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote:
   
 You also have a large number of
 fields, and it looks like a lot (all?) of them are stored and
 indexed.
 That's what that large .fdt file indicated.  That file is  206 MB
 in
 size.
   
Try using Field.UnStored() to avoid storing all those data in your
indices as it's usually not necessary.
   
PA.
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing size

2004-09-01 Thread Niraj Alok
Thanks a lot Stephane and Otis for your detailed explanations. I am now on
the path to make a judicious choice between the different choices on offer
and hope to reduce the overall size. Will surely get back if there are any
more hiccups (hope not! )

Thanks again!

Niraj
- Original Message -
From: Stephane James Vaucher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, September 01, 2004 12:48 PM
Subject: Re: indexing size


 Hi Niraj,

 I'd rather respond to the list as others may be interested in your
 questions, and since I don't consider myself a guru, I appreciate being
 corrected.

 For a title, I'd say yes, use the Field Text(String name, String value)
 constructor. Not the others that use a reader as they do not store the
 value.

 You want for it to be:
 1) tokenised (so to have its fragments saved for searching, not only the
 totality of the text)
 2) indexed so to make it searchable
 3) store as to make the field retrievable from the index

 hth,
 sv
 p.s. my name is Stephane, it's been a while since I've been in Oz
 that I haven't been called James




Re: indexing size

2004-09-01 Thread petite_abeille
Hi Niraj,
On Sep 01, 2004, at 06:45, Niraj Alok wrote:
If I make some of them Field.Unstored, I can see from the javadocs  
that it
will be indexed and tokenized but not stored. If it is not stored, how  
can I
use it while searching?
The different type of fields don't impact how you do your search. This  
is always the same.

Using Unstored fields simply means that you use Lucene as a pure index  
for search purpose only, not for storing any data.

Specifically, the assumption is that your original data lives somewhere  
else, outside of Lucene. If this assumption is true, then you can index  
everything as Unstored with the addition of one Keyword per document.  
The Keyword field holds some sort of unique identifier which allows you  
to retrieve the original data if necessary (e.g. a primary key, an URI,  
what not).

Here is an example of this approach:
(1) For indexing, check the indexValuesWithID() method
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ 
SZIndex.java?view=markup

Note the addition of a Field.Keyword for each document and the use of  
Field.UnStored for everything else

(2) For fetching, check objectsWithSpecificationAndHitsInStore()
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ 
SZFinder.java?view=markup

HTH.
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing size

2004-09-01 Thread Niraj Alok
Hi PA,

Thanks for the detail ! Since we are using lucene to store the data also, I
guess I would not be able to use it.

Regards,
Niraj
- Original Message -
From: petite_abeille [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, September 01, 2004 1:14 PM
Subject: Re: indexing size


 Hi Niraj,

 On Sep 01, 2004, at 06:45, Niraj Alok wrote:

  If I make some of them Field.Unstored, I can see from the javadocs
  that it
  will be indexed and tokenized but not stored. If it is not stored, how
  can I
  use it while searching?

 The different type of fields don't impact how you do your search. This
 is always the same.

 Using Unstored fields simply means that you use Lucene as a pure index
 for search purpose only, not for storing any data.

 Specifically, the assumption is that your original data lives somewhere
 else, outside of Lucene. If this assumption is true, then you can index
 everything as Unstored with the addition of one Keyword per document.
 The Keyword field holds some sort of unique identifier which allows you
 to retrieve the original data if necessary (e.g. a primary key, an URI,
 what not).

 Here is an example of this approach:

 (1) For indexing, check the indexValuesWithID() method

 http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
 SZIndex.java?view=markup

 Note the addition of a Field.Keyword for each document and the use of
 Field.UnStored for everything else

 (2) For fetching, check objectsWithSpecificationAndHitsInStore()

 http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
 SZFinder.java?view=markup

 HTH.

 Cheers,

 PA.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing size

2004-08-31 Thread Otis Gospodnetic
Are you using pre-1.4.1 version of Lucene?  There was a bug in one of
the older versions that left multiple, old index files around, instead
of deleting them.  Maybe that's using up the disk space.  Give us your
index directory's 'ls -al' or 'dir'.

Otis

--- Niraj Alok [EMAIL PROTECTED] wrote:

 Hi Guys,
 
 If you have any ideas, please help me out. I have looked into most of
 the
 lucene archives and they are suggesting what I am currently doing. So
 the
 only possible solution for me right now would be to reduce the no. of
 fields
 which could severely change the logic used for searching.
 
 
 Regards,
 Niraj
 - Original Message -
 From: Niraj Alok [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Tuesday, August 31, 2004 11:17 AM
 Subject: indexing size
 
 
  Hi,
 
  I am indexing plain xml files , total size of which is around 100
 MB. I am
  creating two indexes for different modules, and they are stored in
 different
  directories as I am not merging them. The problem is that the
 combined
 size
  of these indexes is about 300 MB, ( 3 times the data size), which
 is in
  contrast to the 35% I have read it should create.
  Both these indexes have different fields and different data is
 stored in
  them and hence there is no duplication occuring.
 
  I have one indexwriter for each index. After both the indexes have
 been
  created, I am simply calling optimize on these two writers and
 closing
 them.
 
  Is there something I am doing wrong? I am using
 writer.addDocument(doc).
 
  Regards,
  Niraj
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing size

2004-08-31 Thread Niraj Alok
/2004  17:31   183,796 _4dkv.f75
21/08/2004  17:31   183,796 _4dkv.f76
21/08/2004  17:31   183,796 _4dkv.f77
21/08/2004  17:31   183,796 _4dkv.f78
21/08/2004  17:31   183,796 _4dkv.f79
21/08/2004  17:31   183,796 _4dkv.f8
21/08/2004  17:31   183,796 _4dkv.f80
21/08/2004  17:31   183,796 _4dkv.f81
21/08/2004  17:31   183,796 _4dkv.f82
21/08/2004  17:31   183,796 _4dkv.f83
21/08/2004  17:31   183,796 _4dkv.f84
21/08/2004  17:31   183,796 _4dkv.f85
21/08/2004  17:31   183,796 _4dkv.f86
21/08/2004  17:31   183,796 _4dkv.f87
21/08/2004  17:31   183,796 _4dkv.f88
21/08/2004  17:31   183,796 _4dkv.f89
21/08/2004  17:31   183,796 _4dkv.f9
21/08/2004  17:31   183,796 _4dkv.f90
21/08/2004  17:31   183,796 _4dkv.f91
21/08/2004  17:31   183,796 _4dkv.f92
21/08/2004  17:31   183,796 _4dkv.f93
21/08/2004  17:31   183,796 _4dkv.f94
21/08/2004  17:31   183,796 _4dkv.f95
21/08/2004  17:31   183,796 _4dkv.f96
21/08/2004  17:31   183,796 _4dkv.f97
21/08/2004  17:31   183,796 _4dkv.f98
21/08/2004  17:31   183,796 _4dkv.f99
21/08/2004  17:30   206,637,045 _4dkv.fdt
21/08/2004  17:30 1,470,368 _4dkv.fdx
21/08/2004  17:29 5,509 _4dkv.fnm
21/08/2004  17:3130,953,033 _4dkv.frq
21/08/2004  17:3129,334,297 _4dkv.prx
21/08/2004  17:31   225,415 _4dkv.tii
21/08/2004  17:3116,814,807 _4dkv.tis
 455 File(s)367,413,520 bytes
   2 Dir(s)   6,854,688,768 bytes free

Regards,
Niraj
- Original Message -
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, August 31, 2004 6:02 PM
Subject: Re: indexing size


 Are you using pre-1.4.1 version of Lucene?  There was a bug in one of
 the older versions that left multiple, old index files around, instead
 of deleting them.  Maybe that's using up the disk space.  Give us your
 index directory's 'ls -al' or 'dir'.

 Otis




Re: indexing size

2004-08-31 Thread Otis Gospodnetic
/2004  17:31   183,796 _4dkv.f58
 21/08/2004  17:31   183,796 _4dkv.f59
 21/08/2004  17:31   183,796 _4dkv.f6
 21/08/2004  17:31   183,796 _4dkv.f60
 21/08/2004  17:31   183,796 _4dkv.f61
 21/08/2004  17:31   183,796 _4dkv.f62
 21/08/2004  17:31   183,796 _4dkv.f63
 21/08/2004  17:31   183,796 _4dkv.f64
 21/08/2004  17:31   183,796 _4dkv.f65
 21/08/2004  17:31   183,796 _4dkv.f66
 21/08/2004  17:31   183,796 _4dkv.f67
 21/08/2004  17:31   183,796 _4dkv.f68
 21/08/2004  17:31   183,796 _4dkv.f69
 21/08/2004  17:31   183,796 _4dkv.f7
 21/08/2004  17:31   183,796 _4dkv.f70
 21/08/2004  17:31   183,796 _4dkv.f71
 21/08/2004  17:31   183,796 _4dkv.f72
 21/08/2004  17:31   183,796 _4dkv.f73
 21/08/2004  17:31   183,796 _4dkv.f74
 21/08/2004  17:31   183,796 _4dkv.f75
 21/08/2004  17:31   183,796 _4dkv.f76
 21/08/2004  17:31   183,796 _4dkv.f77
 21/08/2004  17:31   183,796 _4dkv.f78
 21/08/2004  17:31   183,796 _4dkv.f79
 21/08/2004  17:31   183,796 _4dkv.f8
 21/08/2004  17:31   183,796 _4dkv.f80
 21/08/2004  17:31   183,796 _4dkv.f81
 21/08/2004  17:31   183,796 _4dkv.f82
 21/08/2004  17:31   183,796 _4dkv.f83
 21/08/2004  17:31   183,796 _4dkv.f84
 21/08/2004  17:31   183,796 _4dkv.f85
 21/08/2004  17:31   183,796 _4dkv.f86
 21/08/2004  17:31   183,796 _4dkv.f87
 21/08/2004  17:31   183,796 _4dkv.f88
 21/08/2004  17:31   183,796 _4dkv.f89
 21/08/2004  17:31   183,796 _4dkv.f9
 21/08/2004  17:31   183,796 _4dkv.f90
 21/08/2004  17:31   183,796 _4dkv.f91
 21/08/2004  17:31   183,796 _4dkv.f92
 21/08/2004  17:31   183,796 _4dkv.f93
 21/08/2004  17:31   183,796 _4dkv.f94
 21/08/2004  17:31   183,796 _4dkv.f95
 21/08/2004  17:31   183,796 _4dkv.f96
 21/08/2004  17:31   183,796 _4dkv.f97
 21/08/2004  17:31   183,796 _4dkv.f98
 21/08/2004  17:31   183,796 _4dkv.f99
 21/08/2004  17:30   206,637,045 _4dkv.fdt
 21/08/2004  17:30 1,470,368 _4dkv.fdx
 21/08/2004  17:29 5,509 _4dkv.fnm
 21/08/2004  17:3130,953,033 _4dkv.frq
 21/08/2004  17:3129,334,297 _4dkv.prx
 21/08/2004  17:31   225,415 _4dkv.tii
 21/08/2004  17:3116,814,807 _4dkv.tis
  455 File(s)367,413,520 bytes
2 Dir(s)   6,854,688,768 bytes free
 
 Regards,
 Niraj
 - Original Message -
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Tuesday, August 31, 2004 6:02 PM
 Subject: Re: indexing size
 
 
  Are you using pre-1.4.1 version of Lucene?  There was a bug in one
 of
  the older versions that left multiple, old index files around,
 instead
  of deleting them.  Maybe that's using up the disk space.  Give us
 your
  index directory's 'ls -al' or 'dir'.
 
  Otis
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing size

2004-08-31 Thread petite_abeille
On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote:
You also have a large number of
fields, and it looks like a lot (all?) of them are stored and indexed.
That's what that large .fdt file indicated.  That file is  206 MB in
size.
Try using Field.UnStored() to avoid storing all those data in your 
indices as it's usually not necessary.

PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing size

2004-08-31 Thread Niraj Alok
I was also thinking on the same lines.
Actually the original code was written by some one else who has left and so
I have to own this.

At almost all the places, it is Field.Text and at some few places its
Field.UnIndexed.
I looked at the javadocs and found that there is Field.UnStored also.

The problem is I am not too sure which one to change to what. It would be
really enlightening if you could point the differences
between those three and what would I need to change in my search code.

If I make some of them Field.Unstored, I can see from the javadocs that it
will be indexed and tokenized but not stored. If it is not stored, how can I
use it while searching? Basically what is meant by indexed and stored,
indexed and not stored and not indexed and stored?


Regards,
Niraj
- Original Message -
From: petite_abeille [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, August 31, 2004 8:57 PM
Subject: Re: indexing size



 On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote:

  You also have a large number of
  fields, and it looks like a lot (all?) of them are stored and indexed.
  That's what that large .fdt file indicated.  That file is  206 MB in
  size.

 Try using Field.UnStored() to avoid storing all those data in your
 indices as it's usually not necessary.

 PA.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: indexing size

2004-08-31 Thread Stephane James Vaucher
On Wed, 1 Sep 2004, Niraj Alok wrote
 I was also thinking on the same lines.
 Actually the original code was written by some one else who has left and so
 I have to own this.

 At almost all the places, it is Field.Text and at some few places its
 Field.UnIndexed.
 I looked at the javadocs and found that there is Field.UnStored also.

 The problem is I am not too sure which one to change to what. It would be
 really enlightening if you could point the differences
 between those three and what would I need to change in my search code.

 If I make some of them Field.Unstored, I can see from the javadocs that
 it will be indexed and tokenized but not stored. If it is not stored,
 how can I use it while searching? Basically what is meant by indexed and
 stored, indexed and not stored and not indexed and stored?

If all you need is to seach a field, you do not need to store it. If it is
not stored it can still be tokenised and analysed by lucene. It will then
be only stored as a set of token, but not as whole. You can thus use it
for fields that you never need to retrieve from the index.

For example:
the quick brown fox jumped over the lazy dog.

will be store in lucene only as tokens, not as a whole, so using a
whitespace analyser using a stopword list {the}:

You will have these tokens in lucene:
quick
brown
fox
jumped
over
dog

You will NOT be able to retrieve the original text, but you will be able
to search it.

HTH,
sv


 Regards,
 Niraj
 - Original Message -
 From: petite_abeille [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Tuesday, August 31, 2004 8:57 PM
 Subject: Re: indexing size


 
  On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote:
 
   You also have a large number of
   fields, and it looks like a lot (all?) of them are stored and indexed.
   That's what that large .fdt file indicated.  That file is  206 MB in
   size.
 
  Try using Field.UnStored() to avoid storing all those data in your
  indices as it's usually not necessary.
 
  PA.
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing and Searching Database in Lucene

2004-08-20 Thread Aviran
You need to create a lucene index from the database.
Just  index the columns and the records from the database.
It will be useful to have also a field in lucene that contains the
database's primary key, so you can retrieve the actual record from the
database

Aviran

-Original Message-
From: sivalingam T [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 20, 2004 10:55 AM
To: [EMAIL PROTECTED]
Subject: Indexing and Searching Database in Lucene


  Hi

  Can we index and search database in Lucene Search Engine?
  if anybody have please send reply.


With Warm Regards,
Sivalingam.T

Sai Eswar Innovations (P) Ltd,
Chennai-92



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing and Searching Database in Lucene

2004-08-20 Thread Don Vaillancourt




Funy thing is I was thinking of doing something like this just today.
This is especially good when you perform a lot of queries using the
LIKE statement. Lucene would increase search performance a great deal.

Aviran wrote:

  You need to create a lucene index from the database.
Just  index the columns and the records from the database.
It will be useful to have also a field in lucene that contains the
database's primary key, so you can retrieve the actual record from the
database

Aviran

-Original Message-
From: sivalingam T [mailto:[EMAIL PROTECTED]] 
Sent: Friday, August 20, 2004 10:55 AM
To: [EMAIL PROTECTED]
Subject: Indexing and Searching Database in Lucene


  Hi

  Can we index and search database in Lucene Search Engine?
  if anybody have please send reply.


With Warm Regards,
Sivalingam.T

Sai Eswar Innovations (P) Ltd,
Chennai-92



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-- 

Don Vaillancourt
Director of Software Development


WEB IMPACT INC.
phone: 416-815-2000 ext. 245
fax: 416-815-2001
email: [EMAIL PROTECTED]
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright. If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing help

2004-07-08 Thread Grant Ingersoll
Hi John,

The source code is available from CVS, make it non-final and do what you need to do.  
Of course, you may have a hard time finding help later if you aren't using something 
everyone else is and your solution doesn't work...  :-)

If I understand correctly what you are trying to do, you already know all of the 
answers for indexing, you just want Lucene to do the retrieval side of the coin, 
correct?  I suppose a crazy idea might be to write a program that took your info and 
output it in the Lucene file format, but that seems a bit like overkill.

-Grant

 [EMAIL PROTECTED] 07/07/04 07:37PM 
Hi Doug:
 Thanks for the response!

 The solution you proposed is still a derivative of creating a
dummy document stream. Taking the same example, java (5), lucene (6),
VectorTokenStream would create a total of 11 Tokens whereas only 2 is
neccessary.

Given many documents with many terms and frequencies, it would
create many extra Token instances.

   The reason I was looking to derving the Field class is because I
can directly manipulate the FieldInfo by setting the frequency. But
the class is final...

   Any other suggestions?

Thanks

-John

On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote:
 John Wang wrote:
   While lucene tokenizes the words in the document, it counts the
  frequency and figures out the position, we are trying to bypass this
  stage: For each document, I have a set of words with a know frequency,
  e.g. java (5), lucene (6) etc. (I don't care about the position, so it
  can always be 0.)
 
   What I can do now is to create a dummy document, e.g. java java
  java java java lucene lucene lucene lucene lucene and pass it to
  lucene.
 
   This seems hacky and cumbersome. Is there a better alternative? I
  browsed around in the source code, but couldn't find anything.
 
 Write an analyzer that returns terms with the appropriate distribution.
 
 For example:
 
 public class VectorTokenStream extends TokenStream {
   private int term;
   private int freq;
   public VectorTokenStream(String[] terms, int[] freqs) {
 this.terms = terms;
 this.freqs = freqs;
   }
   public Token next() {
 if (freq == 0) {
   term++;
   if (term = terms.length)
 return null;
   freq = freqs[term];
 }
 freq--;
 return new Token(terms[term], 0, 0);
   }
 }
 
 Document doc = new Document();
 doc.add(Field.Text(content, ));
 indexWriter.addDocument(doc, new Analyzer() {
   public TokenStream tokenStream(String field, Reader reader) {
 return new VectorTokenStream(new String[] {java,lucene},
  new int[] {5,6});
   }
 });
 
Too bad the Field class is final, otherwise I can derive from it
  and do something on that line...
 
 Extending Field would not help.  That's why it's final.
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED] 
 For additional commands, e-mail: [EMAIL PROTECTED] 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing help

2004-07-08 Thread John Wang
Hi Grant:
 Thanks for the options. How likely will the lucene file formats change?

 Are there really no more optiosn? :(...

Thanks

-John

On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote:
 Hi John,
 
 The source code is available from CVS, make it non-final and do what you need to do. 
  Of course, you may have a hard time finding help later if you aren't using 
 something everyone else is and your solution doesn't work...  :-)
 
 If I understand correctly what you are trying to do, you already know all of the 
 answers for indexing, you just want Lucene to do the retrieval side of the coin, 
 correct?  I suppose a crazy idea might be to write a program that took your info and 
 output it in the Lucene file format, but that seems a bit like overkill.
 
 -Grant
 
  [EMAIL PROTECTED] 07/07/04 07:37PM 
 
 
 Hi Doug:
 Thanks for the response!
 
 The solution you proposed is still a derivative of creating a
 dummy document stream. Taking the same example, java (5), lucene (6),
 VectorTokenStream would create a total of 11 Tokens whereas only 2 is
 neccessary.
 
Given many documents with many terms and frequencies, it would
 create many extra Token instances.
 
   The reason I was looking to derving the Field class is because I
 can directly manipulate the FieldInfo by setting the frequency. But
 the class is final...
 
   Any other suggestions?
 
 Thanks
 
 -John
 
 On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote:
  John Wang wrote:
While lucene tokenizes the words in the document, it counts the
   frequency and figures out the position, we are trying to bypass this
   stage: For each document, I have a set of words with a know frequency,
   e.g. java (5), lucene (6) etc. (I don't care about the position, so it
   can always be 0.)
  
What I can do now is to create a dummy document, e.g. java java
   java java java lucene lucene lucene lucene lucene and pass it to
   lucene.
  
This seems hacky and cumbersome. Is there a better alternative? I
   browsed around in the source code, but couldn't find anything.
 
  Write an analyzer that returns terms with the appropriate distribution.
 
  For example:
 
  public class VectorTokenStream extends TokenStream {
private int term;
private int freq;
public VectorTokenStream(String[] terms, int[] freqs) {
  this.terms = terms;
  this.freqs = freqs;
}
public Token next() {
  if (freq == 0) {
term++;
if (term = terms.length)
  return null;
freq = freqs[term];
  }
  freq--;
  return new Token(terms[term], 0, 0);
}
  }
 
  Document doc = new Document();
  doc.add(Field.Text(content, ));
  indexWriter.addDocument(doc, new Analyzer() {
public TokenStream tokenStream(String field, Reader reader) {
  return new VectorTokenStream(new String[] {java,lucene},
   new int[] {5,6});
}
  });
 
 Too bad the Field class is final, otherwise I can derive from it
   and do something on that line...
 
  Extending Field would not help.  That's why it's final.
 
  Doug
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing help

2004-07-08 Thread John Wang
Hi Grant:

 I have something that would extract only the important words from
a document along with its importance, furthermore, these important
words may not be physically in the document, it could be synonyms to
some of the words in the document. So the output of a process for a
document is a list of word/importance pairs.

I want to be able to query using only these words on the document. 

   I don't think Lucene has such capability.

   Can you suggest what I can do with the analysers process in doing
this without replicating words/tokens?

Thanks

-John

On Thu, 08 Jul 2004 11:10:07 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote:
 Hey John,
 
 Those are just options, didn't say they were good ones!  :-)
 
 I guess the real question is, what is the background of what you are trying to do?  
 Presumably you have some other program that is generating frequencies for you, do 
 you really need that in the current form?  Can't the Lucene indexing engine act as a 
 stand-in for this process since your end result _should_ be the same?  The Lucene 
 Analyzer process is quite flexible, I bet you could even find a way to hook in your 
 existing tools into the Analyzer process.
 
 -Grant
 
  [EMAIL PROTECTED] 07/08/04 10:42AM 
 
 
 Hi Grant:
 Thanks for the options. How likely will the lucene file formats change?
 
 Are there really no more optiosn? :(...
 
 Thanks
 
 -John
 
 On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote:
  Hi John,
 
  The source code is available from CVS, make it non-final and do what you need to 
  do.  Of course, you may have a hard time finding help later if you aren't using 
  something everyone else is and your solution doesn't work...  :-)
 
  If I understand correctly what you are trying to do, you already know all of the 
  answers for indexing, you just want Lucene to do the retrieval side of the coin, 
  correct?  I suppose a crazy idea might be to write a program that took your info 
  and output it in the Lucene file format, but that seems a bit like overkill.
 
  -Grant
 
   [EMAIL PROTECTED] 07/07/04 07:37PM 
 
 
  Hi Doug:
  Thanks for the response!
 
  The solution you proposed is still a derivative of creating a
  dummy document stream. Taking the same example, java (5), lucene (6),
  VectorTokenStream would create a total of 11 Tokens whereas only 2 is
  neccessary.
 
 Given many documents with many terms and frequencies, it would
  create many extra Token instances.
 
The reason I was looking to derving the Field class is because I
  can directly manipulate the FieldInfo by setting the frequency. But
  the class is final...
 
Any other suggestions?
 
  Thanks
 
  -John
 
  On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote:
   John Wang wrote:
 While lucene tokenizes the words in the document, it counts the
frequency and figures out the position, we are trying to bypass this
stage: For each document, I have a set of words with a know frequency,
e.g. java (5), lucene (6) etc. (I don't care about the position, so it
can always be 0.)
   
 What I can do now is to create a dummy document, e.g. java java
java java java lucene lucene lucene lucene lucene and pass it to
lucene.
   
 This seems hacky and cumbersome. Is there a better alternative? I
browsed around in the source code, but couldn't find anything.
  
   Write an analyzer that returns terms with the appropriate distribution.
  
   For example:
  
   public class VectorTokenStream extends TokenStream {
 private int term;
 private int freq;
 public VectorTokenStream(String[] terms, int[] freqs) {
   this.terms = terms;
   this.freqs = freqs;
 }
 public Token next() {
   if (freq == 0) {
 term++;
 if (term = terms.length)
   return null;
 freq = freqs[term];
   }
   freq--;
   return new Token(terms[term], 0, 0);
 }
   }
  
   Document doc = new Document();
   doc.add(Field.Text(content, ));
   indexWriter.addDocument(doc, new Analyzer() {
 public TokenStream tokenStream(String field, Reader reader) {
   return new VectorTokenStream(new String[] {java,lucene},
new int[] {5,6});
 }
   });
  
  Too bad the Field class is final, otherwise I can derive from it
and do something on that line...
  
   Extending Field would not help.  That's why it's final.
  
   Doug
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional 

Re: indexing help

2004-07-08 Thread John Wang
Thanks Doug. I will do just that.

Just for my education, can you maybe elaborate on using the
implement an IndexReader that delivers a
synthetic index approach?

Thanks in advance

-John

On Thu, 08 Jul 2004 10:01:59 -0700, Doug Cutting [EMAIL PROTECTED] wrote:
 John Wang wrote:
   The solution you proposed is still a derivative of creating a
  dummy document stream. Taking the same example, java (5), lucene (6),
  VectorTokenStream would create a total of 11 Tokens whereas only 2 is
  neccessary.
 
 That's easy to fix.  We just need to reuse the token:
 
 public class VectorTokenStream extends TokenStream {
   private int term = -1;
   private int freq = 0;
   private Token token;
   public VectorTokenStream(String[] terms, int[] freqs) {
 this.terms = terms;
 this.freqs = freqs;
   }
   public Token next() {
 if (freq == 0) {
   term++;
   if (term = terms.length)
 return null;
   token = new Token(terms[term], 0, 0);
   freq = freqs[term];
 }
 freq--;
 return token;
   }
 }
 
 Then only two tokens are created, as you desire.
 
 If you for some reason don't want to create a dummy document stream,
 then you could instead implement an IndexReader that delivers a
 synthetic index for a single document.  Then use
 IndexWriter.addIndexes() to turn this into a real, FSDirectory-based
 index.  However that would be a lot more work and only very marginally
 faster.  So I'd stick with the approach I've outlined above.  (Note:
 this code has not been compiled or run.  It may have bugs.)
 
 
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing help

2004-07-08 Thread Doug Cutting
John Wang wrote:
Just for my education, can you maybe elaborate on using the
implement an IndexReader that delivers a
synthetic index approach?
IndexReader is an abstract class.  It has few data fields, and few 
non-static methods that are not implemented in terms of abstract 
methods.  So, in effect, it is an interface.

When Lucene indexes a token stream it creates a single-document index 
that is then merged with other single- and multi-document indexes to 
form an index that is searched.  You could bypass the first step of this 
(indexing a token stream) by instead directly implementing all of 
IndexReader's abstract methods to return the same thing as the 
single-document index that Lucene would create.  This would be 
marginally faster, as no Token objects would be created at all.  But, 
since IndexReader has a lot of abstract methods, it would be a lot of 
work.  I didn't really mean it as a practical suggestion.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing help

2004-07-07 Thread Doug Cutting
John Wang wrote:
 While lucene tokenizes the words in the document, it counts the
frequency and figures out the position, we are trying to bypass this
stage: For each document, I have a set of words with a know frequency,
e.g. java (5), lucene (6) etc. (I don't care about the position, so it
can always be 0.)
 What I can do now is to create a dummy document, e.g. java java
java java java lucene lucene lucene lucene lucene and pass it to
lucene.
 This seems hacky and cumbersome. Is there a better alternative? I
browsed around in the source code, but couldn't find anything.
Write an analyzer that returns terms with the appropriate distribution.
For example:
public class VectorTokenStream extends TokenStream {
  private int term;
  private int freq;
  public VectorTokenStream(String[] terms, int[] freqs) {
this.terms = terms;
this.freqs = freqs;
  }
  public Token next() {
if (freq == 0) {
  term++;
  if (term = terms.length)
return null;
  freq = freqs[term];
}
freq--;
return new Token(terms[term], 0, 0);
  }
}
Document doc = new Document();
doc.add(Field.Text(content, ));
indexWriter.addDocument(doc, new Analyzer() {
  public TokenStream tokenStream(String field, Reader reader) {
return new VectorTokenStream(new String[] {java,lucene},
 new int[] {5,6});
  }
});
  Too bad the Field class is final, otherwise I can derive from it
and do something on that line...
Extending Field would not help.  That's why it's final.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing help

2004-07-07 Thread John Wang
Hi Doug:
 Thanks for the response!

 The solution you proposed is still a derivative of creating a
dummy document stream. Taking the same example, java (5), lucene (6),
VectorTokenStream would create a total of 11 Tokens whereas only 2 is
neccessary.

Given many documents with many terms and frequencies, it would
create many extra Token instances.

   The reason I was looking to derving the Field class is because I
can directly manipulate the FieldInfo by setting the frequency. But
the class is final...

   Any other suggestions?

Thanks

-John

On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote:
 John Wang wrote:
   While lucene tokenizes the words in the document, it counts the
  frequency and figures out the position, we are trying to bypass this
  stage: For each document, I have a set of words with a know frequency,
  e.g. java (5), lucene (6) etc. (I don't care about the position, so it
  can always be 0.)
 
   What I can do now is to create a dummy document, e.g. java java
  java java java lucene lucene lucene lucene lucene and pass it to
  lucene.
 
   This seems hacky and cumbersome. Is there a better alternative? I
  browsed around in the source code, but couldn't find anything.
 
 Write an analyzer that returns terms with the appropriate distribution.
 
 For example:
 
 public class VectorTokenStream extends TokenStream {
   private int term;
   private int freq;
   public VectorTokenStream(String[] terms, int[] freqs) {
 this.terms = terms;
 this.freqs = freqs;
   }
   public Token next() {
 if (freq == 0) {
   term++;
   if (term = terms.length)
 return null;
   freq = freqs[term];
 }
 freq--;
 return new Token(terms[term], 0, 0);
   }
 }
 
 Document doc = new Document();
 doc.add(Field.Text(content, ));
 indexWriter.addDocument(doc, new Analyzer() {
   public TokenStream tokenStream(String field, Reader reader) {
 return new VectorTokenStream(new String[] {java,lucene},
  new int[] {5,6});
   }
 });
 
Too bad the Field class is final, otherwise I can derive from it
  and do something on that line...
 
 Extending Field would not help.  That's why it's final.
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing incrementally concurrently

2004-07-05 Thread Erik Hatcher
On Jul 5, 2004, at 9:00 AM, Michael Wechner wrote:
If several users are saving documents on the server concurrently
and during saving the index shall be updated incrementally ... do
I have to make sure that it's going to be threadsave or does Lucene
take care of this?
Only a single IndexWriter instance at a time can be used - so you will 
need to coordinate things.  Multiple threads can share a single 
IndexWriter though, so no worries there.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing incrementally concurrently

2004-07-05 Thread Michael Wechner
Erik Hatcher wrote:
On Jul 5, 2004, at 9:00 AM, Michael Wechner wrote:
If several users are saving documents on the server concurrently
and during saving the index shall be updated incrementally ... do
I have to make sure that it's going to be threadsave or does Lucene
take care of this?

Only a single IndexWriter instance at a time can be used - so you will 
need to coordinate things.  Multiple threads can share a single 
IndexWriter though, so no worries there.

ok. Thanks very much for the info
Michi
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing with 1.4-rc3 only yields single .cfs file

2004-06-16 Thread iouli . golovatyi
Otis,
can You explain please why 1.4-rc3 leaves old files like _*.cfs in index 
folder after optimization.
The reference to them can be found also  in deletetable file. Is it a bug? 






We switched from multi-file to compound index structure in one of the
recent RCs.  This should be mentioned in CHANGES.txt file.  The change
was made to make it more difficult for people to reach 'Too many open
files' situations.

Otis


--- Claude Devarenne [EMAIL PROTECTED] wrote:
 Hi,
 
 I just upgraded to 1.4-rc3 and re-indexed my data.  I did not change 
 any code and noticed that in the index directory there is a single
 .cfs 
 file which I am guessing stands for compound file system.  Search
 works 
 fine but after checking out the latest from CVS I did not see this 
 mentioned in the fileformats documentation.  Is this the normal 
 behavior for indexes from now on or is something else going on?  When
 
 creating the index I see the .tis, .frq and other files being
 created. 
 Maybe I need to update my indexer, sorry if I did not RTFM.
 
 Claude
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Re: indexing with 1.4-rc3 only yields single .cfs file

2004-06-15 Thread Otis Gospodnetic
We switched from multi-file to compound index structure in one of the
recent RCs.  This should be mentioned in CHANGES.txt file.  The change
was made to make it more difficult for people to reach 'Too many open
files' situations.

Otis


--- Claude Devarenne [EMAIL PROTECTED] wrote:
 Hi,
 
 I just upgraded to 1.4-rc3 and re-indexed my data.  I did not change 
 any code and noticed that in the index directory there is a single
 .cfs 
 file which I am guessing stands for compound file system.  Search
 works 
 fine but after checking out the latest from CVS I did not see this 
 mentioned in the fileformats documentation.  Is this the normal 
 behavior for indexes from now on or is something else going on?  When
 
 creating the index I see the .tis, .frq and other files being
 created.  
 Maybe I need to update my indexer, sorry if I did not RTFM.
 
 Claude
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing japanese PDF documents

2004-03-22 Thread Otis Gospodnetic
I have not tried these other tools yet.
Have you asked Ben Litchfield, the PDFBox author, about handling of
Japanese text?

Otis

--- Chandan Tamrakar [EMAIL PROTECTED] wrote:
 I am using latest PDFbox library for parsing . I can parse a english
 documents successfully but when I parse a document containing english
 and
 japanese I do not get as I expected .
 
 Have anyone tried using PDFBox library for parsing a japanese
 documents ? Or
 do i need to use other parser like xPDF ,Jpedal ?
 
 Thanks in advace
 Chandan
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing japanese PDF documents

2004-03-22 Thread Ben Litchfield

Yes he did, but I was away the past couple days.  As this is more of a
PDFBox issue I responded in the PDFBox forums, please follow the thread
there if you are interested.

Ben



On Mon, 22 Mar 2004, Otis Gospodnetic wrote:

 I have not tried these other tools yet.
 Have you asked Ben Litchfield, the PDFBox author, about handling of
 Japanese text?

 Otis

 --- Chandan Tamrakar [EMAIL PROTECTED] wrote:
  I am using latest PDFbox library for parsing . I can parse a english
  documents successfully but when I parse a document containing english
  and
  japanese I do not get as I expected .
 
  Have anyone tried using PDFBox library for parsing a japanese
  documents ? Or
  do i need to use other parser like xPDF ,Jpedal ?
 
  Thanks in advace
  Chandan
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing HTML

2004-03-19 Thread Chandan Tamrakar
How do I index a HTM document which may have any encoding like
EUC,SJIS,Western European or UTF 8. Can  I parse and extract the html into
string and than convert into Text file in UNICODE ?
Is this an appropiate way  to index HTML files ? Can anyone suggest me a
simple parser other than a parser found in demo of lucene ?

Also how do i find the encoding  of files ? Whenever there are ANSI text
files containing japanese characters i am not able to convert into UTF-16
lucene is indexing properly when I convert into SJIS

thnks
chandan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing multiple instances of the same field for each document

2004-03-01 Thread Roy Klein
I don't have access to the process that created the XML, it was done in
the past.

As I stated in the beginning of this thread, this is just an example of
the type of thing I'm trying to accomplish.

I think the real issue herein is that the fields are being inserted in
reverse order.  Here's the comments in the code (for Document.add()):

  /** Adds a field to a document.  Several fields may be added with
   * the same name.  In this case, if the fields are indexed, their text
is
   * treated as though appended for the purposes of search. */

I guess it doesn't specify the order they're appended, however, when I
read that comment, I thought that it meant in the order added.  It's a
pretty simple change to the Document class to make this work as I'd
expect it.  From Doug's initial response, I think he expected this
behavior as well.


Thanks again for all your help!

Roy


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Sunday, February 29, 2004 9:10 AM
To: Lucene Users List
Subject: Re: Indexing multiple instances of the same field for each
document


What you are doing is really the job of an Analyzer.  You are doing 
pre-analysis, when instead you could do all of this within the context 
of a custom analyzer and avoid many of these issues altogether.

Do you use the XML only during indexing?  If so, you could bypass the 
whole conversion to XML and then back through Digester all within an 
analyzer.

Or am I missing something that prevents you from doing it this way?

Erik


On Feb 28, 2004, at 10:05 PM, Roy Klein wrote:
 Erik,
 Here's a brief example of the type of thing I'm trying to do:

 I have a file that contains the words:

 The quick brown fox jumped over the lazy dog.

 I run that file through a utility that produces the following xml
 document:
 document
   field name=wordposition1
 wordThe/word
   /field
   field name=wordposition2
 wordquick/word
 wordfast/word
 wordspeedy/word
   /field
   field name=wordposition3
 wordbrown/word
 wordtan/word
 worddark/word
   /field
   .
   .
   .

 I parse that document (via the digester), and add all the words from 
 each of the fields to one lucene field: contents.  The tricky part 
 is that I want to have each word position contain all the words at 
 that position in the lucene index.  I.e. word location 1 in the index 
 contains The, word location 2: quick, fast, and speedy, word 
 location 3: brown, tan, and dark, etc.

 That way, all the following phrase queries will match this document:
   fast tan
   quick brown
   fast brown

 I wrote a TermAnalyzer that adds all the words from a field into the

 index at the same position. (via setPositionIncrement(0)).  That way I

 can simply add each set of words to the contents field, and it'll 
 just keep adding them to the same field.  However, since it's 
 reversing them,
 I can't match phrases.


 Roy


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing multiple instances of the same field for each document

2004-03-01 Thread Roy Klein
Thanks Doug!

I was in the midst of testing my fix to it and noticed your checkin...

Roy

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 01, 2004 12:33 PM
To: Lucene Users List
Subject: Re: Indexing multiple instances of the same field for each
document


Erik Hatcher wrote:
 On Feb 27, 2004, at 6:17 PM, Doug Cutting wrote:
 
 I think it's document.add().  Fields are pushed onto the front, 
 rather
 than added to the end.
 
 
 Ah, ok DocumentFieldList/DocumentFieldEnumeration are the 
 culprits.
 
 This is certainly a bug.

Yes, a bug that's been there since the genesis of Lucene, six years ago.

  It is surprising that something like this could go so long unnoticed.

I just fixed this in CVS.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple instances of the same field for each document

2004-02-29 Thread Markus Spath
Roy Klein wrote:

Erik,

Indexing a single field in chunks solves a design problem I'm working
on. It's not the only way to do it, but, it would certainly be the most
straightforward.  However, if using this method makes phrase searching
unusable, then I'll have to go another route.
hmm, wouldn't it be easier to index only one term for a list of synomys instead 
of indexing each synonym for one term?

quick, fast, speedy - quick (both when building the index and building the query)

this also would solve your problems with the (somehow counterintuative but 
probably well reasoned) behaviour of lucene to add Fields with the same name at 
the beginning instead of appending them.

Markus

Here's a brief example of the type of thing I'm trying to do:

I have a file that contains the words:

The quick brown fox jumped over the lazy dog.

I run that file through a utility that produces the following xml
document:
document
  field name=wordposition1
wordThe/word
  /field
  field name=wordposition2
wordquick/word
wordfast/word
wordspeedy/word
  /field
  field name=wordposition3
wordbrown/word
wordtan/word
worddark/word
  /field
  .
  .
  .
I parse that document (via the digester), and add all the words from
each of the fields to one lucene field: contents.  The tricky part is
that I want to have each word position contain all the words at that
position in the lucene index.  I.e. word location 1 in the index
contains The, word location 2: quick, fast, and speedy, word
location 3: brown, tan, and dark, etc.
That way, all the following phrase queries will match this document:
fast tan
quick brown
  fast brown


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Indexing multiple instances of the same field for each document

2004-02-29 Thread Roy Klein
Hi Markus,

What you're saying would work if I wasn't concerned about query
performance.

If I add the synonym's at document index time, then I only process the
word quick once (when I insert the doc into the index).

If I process each query to convert fast and speedy to quick at
query time, then I might wind up processing those words millions of
times. (once for each query)   Yes, I could come up with a cache so that
the processing is at a minimum, however, it still makes more sense to do
it once, at index time.

Roy

-Original Message-
From: Markus Spath [mailto:[EMAIL PROTECTED] 
Sent: Sunday, February 29, 2004 5:45 AM
To: Lucene Users List
Subject: Re: Indexing multiple instances of the same field for each
document


Roy Klein wrote:

 Erik,
 
 Indexing a single field in chunks solves a design problem I'm working 
 on. It's not the only way to do it, but, it would certainly be the 
 most straightforward.  However, if using this method makes phrase 
 searching unusable, then I'll have to go another route.
 

hmm, wouldn't it be easier to index only one term for a list of synomys
instead 
of indexing each synonym for one term?

quick, fast, speedy - quick (both when building the index and building
the query)

this also would solve your problems with the (somehow counterintuative
but 
probably well reasoned) behaviour of lucene to add Fields with the same
name at 
the beginning instead of appending them.


Markus

 Here's a brief example of the type of thing I'm trying to do:
 
 I have a file that contains the words:
 
 The quick brown fox jumped over the lazy dog.
 
 I run that file through a utility that produces the following xml
 document:
 document
   field name=wordposition1
 wordThe/word
   /field
   field name=wordposition2
 wordquick/word
 wordfast/word
 wordspeedy/word
   /field
   field name=wordposition3
 wordbrown/word
 wordtan/word
 worddark/word
   /field
   .
   .
   .
 
 I parse that document (via the digester), and add all the words from 
 each of the fields to one lucene field: contents.  The tricky part 
 is that I want to have each word position contain all the words at 
 that position in the lucene index.  I.e. word location 1 in the index 
 contains The, word location 2: quick, fast, and speedy, word 
 location 3: brown, tan, and dark, etc.
 
 That way, all the following phrase queries will match this document:
   fast tan
   quick brown
   fast brown
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple instances of the same field for each document

2004-02-29 Thread Erik Hatcher
What you are doing is really the job of an Analyzer.  You are doing 
pre-analysis, when instead you could do all of this within the context 
of a custom analyzer and avoid many of these issues altogether.

Do you use the XML only during indexing?  If so, you could bypass the 
whole conversion to XML and then back through Digester all within an 
analyzer.

Or am I missing something that prevents you from doing it this way?

	Erik

On Feb 28, 2004, at 10:05 PM, Roy Klein wrote:
Erik,
Here's a brief example of the type of thing I'm trying to do:
I have a file that contains the words:

The quick brown fox jumped over the lazy dog.

I run that file through a utility that produces the following xml
document:
document
  field name=wordposition1
wordThe/word
  /field
  field name=wordposition2
wordquick/word
wordfast/word
wordspeedy/word
  /field
  field name=wordposition3
wordbrown/word
wordtan/word
worddark/word
  /field
  .
  .
  .
I parse that document (via the digester), and add all the words from
each of the fields to one lucene field: contents.  The tricky part is
that I want to have each word position contain all the words at that
position in the lucene index.  I.e. word location 1 in the index
contains The, word location 2: quick, fast, and speedy, word
location 3: brown, tan, and dark, etc.
That way, all the following phrase queries will match this document:
fast tan
quick brown
  fast brown
I wrote a TermAnalyzer that adds all the words from a field into the
index at the same position. (via setPositionIncrement(0)).  That way I
can simply add each set of words to the contents field, and it'll 
just
keep adding them to the same field.  However, since it's reversing 
them,
I can't match phrases.

Roy


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing multiple instances of the same field for each docume nt

2004-02-28 Thread Moray McConnachie \(OA\)
- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, February 27, 2004 4:07 PM
Subject: Re: Indexing multiple instances of the same field for each docume
nt

  Does this mean that whenever I want to do keyword searches, I must
  avoid
  QueryParser?

 Not necessarily.  This is a bit of an involved issue, and I posted a
 more extensive reply on this a few weeks ago (pasting a bit of our
 Lucene in Action discussion on it - perhaps search for
 KeywordAnalyzer to find that mail)

 Look into PerFieldAnalyzerWrapper.

Thanks for this tip, I've mostly done it now using this route - I guess one
could also derive a new Analyzer that does a switch on the basis of
FieldName but that wouldn't be so flexible.

I see from the DocumentWriter class that all keyword fields are indexed
exactly, including case-sensitivity. This really tripped me up, since my
version of the KeywordAnalyzer (left by Eric as an exercise to the reader)
was applying the LowerCaseFilter, and therefore I got no matches.

I guess the best way to handle this problem, other than getting the
application to transform values prior to query or indexing, is actually to
tokenize the field after all, but use the same KeywordAnalyzer to do it!

Yours,
Moray McConnachie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple instances of the same field for each docume nt

2004-02-28 Thread Erik Hatcher
On Feb 28, 2004, at 5:38 PM, Moray McConnachie (OA) wrote:
- Original Message -
I guess the best way to handle this problem, other than getting the
application to transform values prior to query or indexing, is 
actually to
tokenize the field after all, but use the same KeywordAnalyzer to do 
it!
Bingo... this is the same thinking I've had on this subject.  Why even 
bother with Field.Keyword and the confusion that occurs with 
QueryParser and such?  Just use a KeywordAnalyzer and 
PerFieldAnalyzerWrapper setup instead for both indexing and 
querying at least that seems a more confusion-free route to go in a 
lot of ways.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  1   2   3   >