from:"Bernhard Messer"

Re: Access Lucene from PHP or Perl

2005-02-11 Thread Bernhard Messer

why not using something like XML/RPC ?
Bernhard
Greetings.
Can anyone point me to a how-to tutorial on how to
access Lucene from a web page generated by PHP pr
Perl? I've been looking but couldn't find anything.
Thanks a lot.
And
__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Retrieve all documents - possible?

2005-02-07 Thread Bernhard Messer

you could use something like:
int maxDoc = reader.maxDoc();
for (int i = 0; i  maxDoc; i++) {
   Document doc = reader.document(i);
}
Bernhard
Hi,
is it possible to retrieve ALL documents from a Lucene index? This should
then actually not be a search...
Karl
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Disk space used by optimize

2005-02-04 Thread Bernhard Messer


However, three times the space sounds a bit too much, or I make a
mistake in the book. :)
 

there already was  a discussion about disk usage during index optimize. 
Please have a look to the developers list at: 
http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1797569 
http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1797569
where i made some measurements about the disk usage within lucene.
At that time i proposed a patch which was reducing disk total used disk 
size from 3 times to a little more than 2 times of the final index size. 
Together with Christoph we implemented some improvements to the 
optimization patch and finally commit the changes.

Bernhard
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: English and French documents together / analysis, indexing, searching

2005-01-20 Thread Bernhard Messer

i think the easiest way ist to use Lucene's StandardAnalyzer. If you 
want to use the snowball stemmers, you have to add a language guesser to 
get the language for the particular document before creating the analyzer.

regards
Bernhard
[EMAIL PROTECTED] schrieb:
Greetings everyone
I wonder is there a solution for analyzing both English and French 
documents using the same analyzer.
Reason being is that we have predominantly English documents but there 
are some French, yet it all has to go into the same index
and be searchable from the same location during any perticular search. 
Is there a way to analyze both types of documents with
a same analyzer (and which one)?

I've looked around and I see there's a SnowBall analyzer but you have 
to specify the language of analysis, and I do not know that
ahead of time during indexing nor do I know it most of the time during 
searching (users would like to search in both document types).

There's also the issue of letter accents in french words and searching 
for the same (how are they indexed at the first place even)?
Has anyone dealt with this before and how did you solve the problem?

thanks
-pedja

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TermPositionVector

2005-01-20 Thread Bernhard Messer

Siddarth,
i tested your code and the return is true and not false as you 
wrote. I assume that there is somethinf else which is wrong.

Bernhard
Siddharth Vijayakrishnan schrieb:
Hi,
I am adding a field to a document in the index as follows
doc.add(new Field(contents,reader,Field.TermVector.WITH_POSITIONS))
Later,I query the index and get the document id of this document. The
following code, however, prints false.
TermFreqVector tfv = reader.getTermFreqVector(docId,contents);
System.out.println(Is a TermPositionVector   + (tfv instanceof
TermPositionVector));
Using Field.TermVector.WITH_POSITIONS_OFFSETS, while creating the
field, also produces the same result.
Can someone tell me why this is happening ? 

Thanks,
Siddharth
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: English and French documents together / analysis, indexing, searching

2005-01-20 Thread Bernhard Messer


Right now I am using StandardAnalyzer but the results are not what I'd 
hope for. Also since my understanding is that we should use the same 
analyzer for searching that was used for indexing,
even if I can manage to guess the language during indexing and apply 
to the SnowBall analyzer I wouldn't be able to use SnowBall for 
searching because users want to search through both
English and French and I suppose I would not get the same results if 
used with StandardAnalyzer?
you could try to create a more complex query and expand it into both 
languages using different analyzers. Would this solve your problem ?


Another problem with StandardAnalyzer is that it breaks up some words 
that should not be broken (in our case document identifiers such as 
ABC-1234 etc) but that's a secondary issue...
This is a behaviour is implemented in StandardTokenizer used by 
StandardAnalyzer. Look at the documentation of StandardTokenizer:

Many applications have specific tokenizer needs.  If this tokenizer does
not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based tokenizer.
Bernhard
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: English and French documents together / analysis, indexing, searching

2005-01-20 Thread Bernhard Messer


you could try to create a more complex query and expand it into both 
languages using different analyzers. Would this solve your problem ?

Would that mean I would have to actually conduct two searches (one in 
English and one in French) then merge the results and display them to 
the user?
It sounds to me like a long way around, so then actually writing an 
analyzer that has the language guesser might be a better solution on 
the long run?
It's no problem to guess the language based on the document corpus. But 
how do you want to guess the language of a simple Term Query ? What if 
your users are searching for names like George Bush ? You can't guess 
the language of such a query and you have to expand it into both 
languages. I don't see an easier way for solving that problem.


This is a behaviour is implemented in StandardTokenizer used by 
StandardAnalyzer. Look at the documentation of StandardTokenizer:

Many applications have specific tokenizer needs.  If this tokenizer 
does not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based 
tokenizer.

Hmm I feel this is beyond my abilities at the moment, writing my own 
tokenizer, without more in-depth knowledge of everything else.
Perhaps I'll try taking the StandardTokenizer and expand it or change 
it based on other tokenziers available in Lucene such as 
WhiteSpaceTokenizer.
What's about using the WhitespaceAnalyzer directly ? Maybe this fits 
more into your requirement and you could use it for both languages.

Bernhard
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: problem indexing large document collction on windows xp

2004-12-30 Thread Bernhard Messer

Thilo,
thanks for your effort. Could you please open a new entry in Bugzilla, 
mark it as [PATCH] and add the diff file with your changes. This ensures 
that the sources and the information will not get lost in the huge 
universe of mailing lists. As soon there is time, one of the comitters 
will review and decide if it should be committed.

Bernhard
Hello
I encoutered a problem when i tried to index large document collections
(about 20 mio documents).
The indexing failed with the IOException:
Cannot delete deletables
I tried different times (with the same document collection) and allways
received the error, but after a different number
of documents.
The exception is thrown after failing to delete the specfied file at
line 212 in FSDirectory.java.
I found the following cure:
after the lines
 

 if (nu.exists())
if (!nu.delete()){
   

i replaced
 

 throw new IOException(Cannot delete  + to);
   

with
 

  while(nu.exists()){
  nu.delete();
  System.out.println(delete loop);
  try {
  Thread.sleep(5000);
  } catch (InterruptedException e) {
  throw new RuntimeException(e);
  }
   

That is, now i retry deleting the file until it is successful.
After the changes, i was able to index all documents.
From the fact that i observed several times
 delete loop
 
on the output console, it can be deduced that the 
body of the while loop was reached (and left) several times.

I am running lucene on windows xp.
Regards
Thilo
   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: CFS file?

2004-12-22 Thread Bernhard Messer

Steve Rajavuori schrieb:
Can someone tell me the purpose of the .CFS files? The Index File Formats
page does not mention this type of file.
 

uuuh, you're right, it is not documented at fileformats.html.
Since Lucene 1.4, the individual index files are stored per default 
within one single compound file which has the file extension .cvs . You 
can switch that behaviour off by setting the public static member 
IndexWriter.useCompoundFile to false.

Bernhard
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing with Lucene 1.4.3

2004-12-17 Thread Bernhard Messer


That looks right to me, assuming you have done an optimize.  All of your
index segments are merged into the one .cfs file (which is large,
right?).  Try searching -- it should work.
 

Chuck is right, the index looks fine and will be searchable. Since lucene version 1.4, the index is stored per default using the compound file format. The index files you are missing are merged within one compound file which has the extension cfs. You can disable the compound file option using 
IndexWriters setUseCompoundFile(false).

Bernhard
  -Original Message-
  From: Hetan Shah [mailto:[EMAIL PROTECTED]
  Sent: Thursday, December 16, 2004 11:00 AM
  To: Lucene Users List
  Subject: Indexing with Lucene 1.4.3
  
  Hello,
  
  I have been trying to index around 6000 documents using IndexHTML
from
  1.4.3 and at the end of indexing in my index directory I only have 3
  files.
  segments
  deletable and
  _5en.cfs
  
  Can someone tell me what is going on and where are the actual index
  files? How can I resolve this issue?
  Thanks.
  -H
  
  
 
-
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: auto-generate uid?

2004-11-22 Thread Bernhard Messer

Just to clarify. I have a Field 'uid' those value is an unique integer. 
I  use it as a key to the document stored externally. I don't mean 
Lucene's  internal document number.

I was wonder if there is a method to query the highest value of a 
field,  perhaps something like:

  IndexReader.maxTerm('uid')
what you could do is to write your own IndexWriter class by extending 
the original one found in org.apache.lucene.index.IndexWriter. Than you 
have direct access to lucene's segment counter which could provide you a 
unique id for each document in the index. Those id's would stay sticky 
even if you modify the index after the intial creation process.

is that the hint you need to start ?
regards
Bernhard

What would the purpose of an auto-generated UID be?
But no, Lucene does not generate UID's for you.  Documents are 
numbered  internally by their insertion order.  This number changes, 
however, when  documents are deleted in the middle and the index is 
optimized.

Erik
On Nov 22, 2004, at 1:50 PM, aurora wrote:
Is there a way to auto-generate uid in Lucene? Even it is just a way 
to  query the highest uid and let the application add one to it will 
do.

Thanks.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing .ppt

2004-11-16 Thread Bernhard Messer

Hi,
i tested the implementation. It seems to work with basic Powerpoint 
slides. The problem i have is that it doesn't extract special characters 
like german umlaute. Does anybody already adressed the problem ?

thanks
Bernhard
Magnus Johansson schrieb:
There's some code using POI at
http://www.mail-archive.com/poi-user@jakarta.apache.org/msg04809.html
/magnus
Luke Shannon wrote:
Hey All;
Anyone know a good API for parsing MS powerpoint files?
Luke
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: about Stemming

2004-11-13 Thread Bernhard Messer

Miguel Angel schrieb:
Hi, I have used the DEMOS of lucene and I want to know as it is
possible to be added  Stemming for my applications.
 

have a look to the lucene-sandbox. Under contributions there are 
stemmers for many different languages.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Transaction in Lucene

2004-11-08 Thread Bernhard Messer

The message No tvx file can appear, if you have term vectors enabled during 
index and the documents you are adding have empty fields. As an example, if you try to 
index html documents, where many of them don't have a valid html title, the message will 
raise up. Looking at the term vector relevant code, this is nothing you have to worry 
about, it is just a status message. Otis is right, it is planned for future releases to 
avoid System.out.println() statements within lucene.
regards
Bernhard 

Otis Gospodnetic schrieb:
I'm not sure about the tvx error, but I think I recall somebody
changing some code around it a month or two ago.  I also believe
System.out.println is on the TODO list for elimination.
Otis
--- commandor [EMAIL PROTECTED] wrote:
 

Hello,
I came across the following problem with No tvx file. 

How could I manage to get it?
I like to have transaction processes in Lucene.
After my reading dev-lucene and user-lucene lists and analysing what
people suggested
I made up my own.
The problem in my case is that I had to make several changes and only
than make commit.
That's why I did the following:
1. Turn off Lucene lock (setting the corresponding system variable =
false)
2. Start the loop (from the first document to the last one to change
in the index)
2.1. Open IndexReader 
2.2. Get a document by its id
2.3. Store it as local variable
2.4. IndexReader.delete(document id)
2.5. IndexReader.close()
2.6. Merge new Terms (changes) and old ones in the document I
retrieved
2.6. Open IndexWriter
2.7. Add a new made document

3. end of loop
4. Waiting for other action ends in my programm I close IndexWriter.
The Result:
Everything works fine but I had No tvx file
I really worried about it cause I read what for tvx file...
Might anybody explain me what I did wrong?
In spite of your answer I did like the following: the way of logging
messages
This message appeared with the help of System.out.println()
Investigating the code of Lucene I found a lot of places of using
System.out 

I guess it is not a very good solution espessially in so beautiful
search/indexing API. 
I guess Lucene must have a normal log to write its messages.

Thanks in advance...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does lucene makes any compression

2004-11-08 Thread Bernhard Messer

The lucene version from CVS head does now have a option to store and 
compress whole text files (binary fields within a lucene document) thru 
GZip. The index itself is not GZip compressed. Due to the nature of how 
the index is created and stored, it is very effectiv regarding to 
diskspace without the need of additional compression.

I have no idea if the new functionality is already adapted within the c# 
port.

regards
Bernhard
abdulrahman galal schrieb:
i got the c# of lucene thanks god @ 
http://sourceforge.net/projects/nlucene

what about the new version that include the compression facility ?
you did n't replay on my qustion does it compress original text files 
and its indexs like Great MG 

thanks alot
_
FREE pop-up blocking with the new MSN Toolbar - get it now! 
http://toolbar.msn.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: LUCENE INDEX STATISTICS

2004-10-30 Thread Bernhard Messer

please take a look at http://jakarta.apache.org/lucene/docs/benchmarks.html
bernhard
Karthik N S schrieb:
Hi Guys
Apologies.
Can some body provide approximate Statics about the following factor for
Developement  and Deployment of Lucene   [ it may be usefull for Pro's
Developers ]
a)  Creation Indexing
   1) X  [ Say 100 Million ] of  number of documents  Y  [ Kilobytes ]
with  Z no of Fields
Hardware requirement [ RAM / Os / Processor / HardDisk Space  /
Other  Specific Details  ]
Software [ Jdk Version / Lucene Version / Appserver Version ]
2) X [Say 100 Million]  number  to create  Merged Indexes
 Hardware requirement [ RAM / Os / Processor / HardDisk Space  /
Other  Specific Details  ]
 Software [ Jdk Version / Lucene Version / Appserver Version ]
b)Searching  on Indexes   [ 2  number of Persons Searching  per  Sec  ]
   1) X  [ Say 100 Million ] of  number of documents  Y  [ Kilobytes ]
with  Z no of Fields
   Hardware requirement [ RAM / Os / Processor / HardDisk Space  /
Other  Specific Details  ]
   Software [ Jdk Version / Lucene Version / Appserver Version ]
2)X [Say 100 Million]  number of Merged Indexes
Hardware requirement [ RAM / Os / Processor / HardDisk Space  /
Other  Specific Details  ]
Software [ Jdk Version / Lucene Version / Appserver Version ]

Thx in Advance
Karthik
 WITH WARM REGARDS
 HAVE A NICE DAY
 [ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching against index in memory

2004-10-30 Thread Bernhard Messer

Ravi schrieb:
If I have a document set of 10,000 docs and my merge factor is 1000, for
every 1000 documents, Lucene creates a new segment. By the time Lucene
indexes 4500 documents, index will have 4000 documents on the disk and
index for 500 documents is stored in memory. How can I search against
this index at the same time from a different JVM? I can access the 4000
docs on the disk. But what about those in the memory on the indexing
box? Is there a way to do this? 

 

currently, i'm not sure if there can be a solution to solve it. the 
easiest way would be to reduce the merge factor so that not to many 
documents will be in memory. but this will slow down your index process 
also.

bernhard
Thanks
Ravi. 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Seraching in Keyword Field

2004-09-29 Thread Bernhard Messer

Hi,
try that query: 

MyKeywordField:ABC
regards
bernhard
Rosen Marinov wrote:
Hi all, 

I have a Keyword field in my Lucene docs.
And i am tring to execure some queries on this field.
1. MyKeywordField:([ABC TO ABC]) - this query is OK and returns expecting result 

2. MyKeywordField:(ABC) - but this returning nothing
I am using SimpleAnalyzer - is the problem in analyzer?
 If yes, which i have to use to make query 2 working?
How can i make query 2 working?
I know that Keyword fields are not analyzed, than may be the problem is not in 
analyzer.
But for QueryParser i use again SimpleAnalyzer, may be here is my mistake?
However, how to make a query 2 to work properly (as i expect)?
I know that it will find only fields with exact ABC value, is this true expecting?
Best Regars
Rosen
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing size

2004-09-09 Thread Bernhard Messer

Dmitry Serebrennikov wrote:
Niraj Alok wrote:
Hi PA,
Thanks for the detail ! Since we are using lucene to store the data 
also, I
guess I would not be able to use it.
 

By the way, I could be wrong, but I think the 35% figure you 
referenced in the your first e-mail actually does not include any 
stored fields. The deal with 35% was, I think, to illustrate that 
index data structures used for searching by Lucene are efficient. But 
Lucene does nothing special about stored content - no compression or 
anything like that. So you end up with the pure size of your data plus 
the 35% of the indexed data.
There will be a patch available to the end of this week, which allows 
you to store binary values compressed within a lucene index. It means 
that you will be able to store and retrieve whole documents within 
lucene in a very efficient way ;-)

regards
bernhard

Cheers.
Dmitry.
Regards,
Niraj
- Original Message -
From: petite_abeille [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, September 01, 2004 1:14 PM
Subject: Re: indexing size
 

Hi Niraj,
On Sep 01, 2004, at 06:45, Niraj Alok wrote:
  

If I make some of them Field.Unstored, I can see from the javadocs
that it
will be indexed and tokenized but not stored. If it is not stored, how
can I
use it while searching?

The different type of fields don't impact how you do your search. This
is always the same.
Using Unstored fields simply means that you use Lucene as a pure index
for search purpose only, not for storing any data.
Specifically, the assumption is that your original data lives somewhere
else, outside of Lucene. If this assumption is true, then you can index
everything as Unstored with the addition of one Keyword per document.
The Keyword field holds some sort of unique identifier which allows you
to retrieve the original data if necessary (e.g. a primary key, an URI,
what not).
Here is an example of this approach:
(1) For indexing, check the indexValuesWithID() method
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
SZIndex.java?view=markup
Note the addition of a Field.Keyword for each document and the use of
Field.UnStored for everything else
(2) For fetching, check objectsWithSpecificationAndHitsInStore()
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
SZFinder.java?view=markup
HTH.
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Full web search engine package using Lucene

2004-09-08 Thread Bernhard Messer

Anne Y. Zhang wrote:
Thanks, David. But it seems that this is downloadable. 
Could you please provide me the link for download?
Thank you very much!
 

http://www.nutch.org/release/
Ya
- Original Message - 
From: David Spencer [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, September 08, 2004 2:43 PM
Subject: Re: Full web search engine package using Lucene

 

Anne Y. Zhang wrote:
   

Hi, I am assistanting a professor for a IR course. 
We need to provide the student with a full-fuctioned
search engine package, and the professor prefers it
being powered by lucene. Since I am new to lucene,
can anyone provide me some information that where
can I get the package? We also want the package 
contains the crawling function. Thank you very much!
 

http://www.nutch.org/
   

Ya

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: complex searche (newbie)

2004-08-26 Thread Bernhard Messer

hi,
in general the query parser doesn't allow queries which start with a 
wildcard Those queries could end up with very long response times and 
block your system. This is not what you want.

I'm not sure if i understand what you want to do. I expect that you have 
a field within a lucene document with name type. For this field you 
can have different values like contact,account etc. Now you want to 
search all documents where type is contact. So the query to do this 
would be type:contact, nothing else is required.

can you try that and give some feedback ?
best regards
Bernhard
Wermus Fernando wrote:
I am using multifieldQueryParse to look up some models. I have several
models: account, contacts, tasks, etc. The user chooses models and a
query string to look up. Besides fields for searching, I add some
conditions to the query string.
If he puts john to look up and chooses contacts, I add to the query
string the following
Query string: john and type:contact
But, If he wants to look up any contact, multifieldQueryParse throws an
exception. In these case, the query string is the following:
Query string: * and type:contact
Am I choosing the wrong QueryParser or is there another easy way to look
up several fields and the same time any content?
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: weird lock behavior

2004-08-26 Thread Bernhard Messer

hi,
the IndexReader class provides some public static methodes to check if 
an index is locked. If this is the case, there is also a method to 
unlock an existing index. You could do something like:

Directory dir = FSDirectory.getDirectory(indexDir, false);
if (IndexReader.isLocked(dir)) {
   IndexReader.unlock(dir);
}
dir.close();
You also should catch the possible IOException in case of an error or 
the index can't be unlocked.

fun with it
Bernhard
[EMAIL PROTECTED] wrote:
Hi,
I experienced following situation:
Suddenly my query became too slow (c.10sec instead of c.1sec) and the 
number of returned hits changed from c. 2000 to c.1800.

Tracing the case I've found locking file abc...-commit.lck. After 
deletion of this file everything turned back to normal behavior, i.e. I 
got my 2000 hits in 1sec.

There were no concurent writing or reading processes running parallely.
Probably the lock file was lost because of abnormal termination ( during 
development it's ok, but may happen in production as well)
My question is how to handle such situation,  find out and repair in case 
it happens (in real life there are many concurensy processes and I have no 
idea which lock file to kill).


 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Bernhard Messer

Avi,
i would prefer the second approach. If you already store the date time 
when the doc was index, you could use the following trick to get the 
last document added to the index:

   IndexReader ir = IndexReader.open(/tmp/testindex);
 
   int maxDoc = ir.maxDoc();
   while (--maxDoc  0) {
 if (!ir.isDeleted(maxDoc)) {
   Document doc = ir.document(maxDoc);
   System.out.println(doc.getField(indexDate));
   break;
 }
   }

What do you think about the implementation, no extra properties, nothing 
to worry about. Every information is within you index.

regards
Bernhard
Avi Drissman wrote:
I've used Lucene for a long time, but only in the most basic way. I 
have a custom analyzer and a slightly hacked query parser, but in 
general it's the basic add document/remove document/query documents 
cycle.

In my system, I'm indexing a store of external documents, maintaining 
an index for full-text querying. However, I might be turned off when 
documents are added, and then when I'm restarted, I'm going to need to 
determine the timestamp of the last document added to the index so 
that I can pick up where I left off.

There are three approaches to doing this, two using Lucene. I don't 
know how I would do the two Lucene approaches, or even if they're 
possible.

1. Just keep a file in parallel with the index, reading and writing 
the timestamp of the last indexed document in it. I know how to do 
this, but I don't like the idea of keeping a separate file.

2. Drop a timestamp onto each document as it's indexed. I've attached 
timestamp fields to documents in the past so that I could do range 
queries on them. However, I don't know how to do a query like the 
document with the latest timestamp or even if that's possible.

3. Create a dummy document (with some unique field identifier so you 
could quickly query for it) with a field last timestamp. This is a 
global value storage approach, as you could just store any field 
with any value on it. But I'd be updating this timestamp field a lot, 
which means that every time I updated the index I'd have to remove 
this special document and reindex it. Is there any way to update the 
value of a field in a document directly in the index without removing 
and adding it again to the index? The field I'd want to update would 
just be stored, not indexed or tokenized.

Thanks for your help in guiding my exploration into the capabilities 
of Lucene.

Avi

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: integration of lucene with pdfbox

2004-08-24 Thread Bernhard Messer

Santosh,
please have a look to the lucene demo package. There are several samples 
(IndexFiles.java) showing how to add a document to a writer.

regards
Bernhard
Santosh wrote:
I dont know how to add lucene document to index, i know how to add given
directory.
any body please tell me how to add lucene document to index
- Original Message -
From: Ben Litchfield [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, August 23, 2004 8:13 PM
Subject: Re: integration of lucene with pdfbox
 

If you can use lucene on its own then you already know how to add a lucene
Document to the index.  So you need to be able to take a PDF and get a
lucene Document.
org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument()
does that for you.
Ben
On Mon, 23 Aug 2004, Santosh wrote:
   

I have downloaded pdfbox and lucene and kept jar files in the class
 

path, I am able to work with both of them independently but how can I
integrate both
 

regards
Santosh kumar
---SOFTPRO DISCLAIMER--
Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.
If you are not an intended or authorised recipient of this E-MAIL or
have received it in error, You are notified that any use, copying or
dissemination  of the information contained in this E-MAIL in any
manner whatsoever is strictly prohibited. Please delete it immediately
and notify the sender by E-MAIL.
In such a case reading, reproducing, printing or further dissemination
of this E-MAIL is strictly prohibited and may be unlawful.
SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
hereto is free from computer viruses or other defects.
The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
those of the author and are not necessarily those of SOFTPRO SYSTEMS.

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: term frequency data of terms of all documents

2004-08-24 Thread Bernhard Messer

Serkan,
it's easier using the IndexReader class to get the information you need. 
If you just need the doc frequency of each term you could use the sample.

IndexReader ir = null;
   try {
   if (!IndexReader.indexExists(tmp/index))
 return;
   ir = IndexReader.open(/tmp/index);
   TermEnum termEnum = ir.terms();
   while (termEnum.next()) {
 Term t = termEnum.term();
 System.out.println(t.text() +  --  + ir.docFreq(t));

   }
   }
   catch (IOException e) {
   System.out.println(e.toString());
   }
   finally {
   if (ir != null) {
   try {
   ir.close();
   } catch (IOException e) {
   System.err.println(IOException, opened IndexReader 
can't be closed:  + e.toString());
   }
   }
   }

hope this helps,
Bernhard
Serkan Oktar wrote:
I want to build a list of terms of all documents and their frequency data. 
It seems the information I need is in tis and tii files. However I havent't found a way to handle them till now.

How can I get the term frequency data?
Thanks ,
Serkan
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: speeding up queries (MySQL faster)

2004-08-21 Thread Bernhard Messer

Yonik,
there is another synchronized block in CSInputStream which could block 
your second cpu out. Do you think there is a chance to recreate the 
index (maybe a smaller subset) without compound file option enabled and 
run your test again, so that we can see if this helps ?

regards
Bernhard
Otis Gospodnetic wrote:
Ah, you may be right (no stack trace in email any more).  Somebody
recenly identified a few bottlenecks that, if I recall correctly, were
related to synchronized blocks.  I believe Doug committed some
improvements, but I can't remember which version of Lucene that is in. 
It's definitely in 1.4.1.

Otis
--- Yonik Seeley [EMAIL PROTECTED] wrote:
 

--- Otis Gospodnetic [EMAIL PROTECTED]
wrote:
   

The bottleneck seems to be disk IO.
 

But it's not.  Linux is caching the whole file, and
there really isn't any disk activity at all.  Most of
the threads are blocked on InputStream.refill, not
waiting for the disk, but waiting for their turn into
the synchronized block to read from the disk (which is
why I asked about cacheing above that level).
CPU is a constant 50% on a dual CPU system (meaning
100% of 1 cpu).
-Yonik
__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index Size

2004-08-19 Thread Bernhard Messer

Rob,
as Doug and Paul already mentioned, the index size is definitely to big :-(.
What could raise the problem, especially when running on a windows 
platform, is that an IndexReader is open during the whole index process. 
During indexing, the writer creates temporary segment files which will 
be merged into bigger segments. If done, the old segment files will be 
deleted. If there is an open IndexReader, the environment is unable to 
unlock the files and they still stay in the index directory. You will 
end up with an index, several times bigger than the dataset.

Can you check your code for any open IndexReaders when indexing, or 
paste the relevant part to the list so we could have a look on it.

hope this helps
Bernhard
Rob Jose wrote:
Hello
I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes.  The size of the documents I have indexed is around 2.5 GB.  The size of the Lucene indexes is around 287 GB.  Does this seem correct?  I am not storing the contents of the file, just indexing and tokenizing.  I am using Lucene 1.3 final.  Can you guys let me know what you are experiencing?  I don't want to go into production with something that I should be configuring better.  

I am not sure if this helps, but I have a temp index and a real index.  I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter.  I have also set the production writer setUseCompoundFile to true.  I did not set this on the temp index.  The last thing that I do before closing the production writer is to call the optimize method.  

I would really appreciate any ideas to get the index size smaller if it is at all 
possible.
Thanks
Rob
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Access Lucene from PHP or Perl

Re: Retrieve all documents - possible?

Re: Disk space used by optimize

Re: English and French documents together / analysis, indexing, searching

Re: TermPositionVector

Re: English and French documents together / analysis, indexing, searching

Re: English and French documents together / analysis, indexing, searching

Re: problem indexing large document collction on windows xp

Re: CFS file?

Re: Indexing with Lucene 1.4.3

Re: auto-generate uid?

Re: Parsing .ppt

Re: about Stemming

Re: Transaction in Lucene

Re: Does lucene makes any compression

Re: LUCENE INDEX STATISTICS

Re: Searching against index in memory

Re: Seraching in Keyword Field

Re: indexing size

Re: Full web search engine package using Lucene

Re: complex searche (newbie)

Re: weird lock behavior

Re: Advanced timestamp usage (or global value storage)

Re: integration of lucene with pdfbox

Re: term frequency data of terms of all documents

Re: speeding up queries (MySQL faster)

Re: Index Size

27 matches

Site Navigation

Mail list logo

Footer information