RE: Which searched words are found in a document

2004-05-26 Thread Nader S. Henein
Take a look at the highlighter code, you could implement this on the front
end while processing the page.

Nader

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, May 25, 2004 10:51 AM
To: [EMAIL PROTECTED]
Subject: Which searched words are found in a document


Hi,

I have the following question:
Is there an easy way to see which words from a query were found in a
resulting document?

So if I search for 'cat OR dog' and get a result document with only 'cat' in
it. I would like to ask the searcher object or something to tell me that for
the result document 'cat' was the only word found.

I did see it is somehow possible with the explain method, but this does not
give a clean answer. I can also get the contents of the document and do an
indexof for each search term but there could be quite a lot in our case.

Any suggestions?

Thanks,

Edvard Scheffers



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: SELECTIVE Indexing

2004-05-26 Thread Nader S. Henein
So you basically only want to index parts of your document within table
Foo Bar /table tags, 

I'm not sure if there's an easier way, but here's what I do:
1)  Parse XML files using JDOM (or any XML parser that floats your boat)
into a Map or an ArrayList 
2)  Create a Lucene document and loop through the aforementioned structure
(Map or ArrayList) adding field, value pairs to it like so
contentDoc.add(new Field(fieldName,fieldValue,true,true,true) ) ;

So all you would need to do is just put an if statement around the later
statement to the effect of 

If (  fieldName.equalsIgnoreCase(table) == 0   ) {
contentDoc.add(new Field(fieldName,fieldValue,true,true,true) ) ;
}


This may be overkill, someone feel free to correct me if I'm wrong

Nader

-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 19, 2004 1:01 PM
To: Lucene Users List
Subject: RE: SELECTIVE Indexing


Hey Lucene Users

My original intension for indexing was to
index certain portions of  HTML [ not the whole Document ],
if Jtidy is not supporting this then what are my optionals

Karthik

-Original Message-
From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 19, 2004 1:43 PM
To: 'Lucene Users List'
Subject: RE: SELECTIVE Indexing


I doubt if it can be used as a plug in.
Would be good to know if it can be used as a plug in.

Regards,
Kiran.

-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED]
Sent: 17 May 2004 12:30
To: Lucene Users List
Subject: RE: SELECTIVE Indexing


Hi

Can I Use TIDY [as plug in ] with Lucene ...


with regards
Karthik

-Original Message-
From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED]
Sent: Monday, May 17, 2004 3:27 PM
To: 'Lucene Users List'
Subject: RE: SELECTIVE Indexing



Try using Tidy.
Creates a Document of the html and allows you to apply xpath. Hope this
helps.

Kiran.

-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED]
Sent: 17 May 2004 11:59
To: Lucene Users List
Subject: SELECTIVE Indexing



Hi all

   Can Some Body tell me How to Index  CERTAIN PORTION OF THE HTML FILE Only

   ex:-
table .
   

 /table


with regards
Karthik




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Which searched words are found in a document

2004-05-26 Thread Edvard Scheffers
I looked at the highlighter code, but the query term extracter retrieves
the terms from the original query. While I only want the found terms, the
best way is probably to parse the result of the explain method.

Edvard

 Take a look at the highlighter code, you could implement this on the front
 end while processing the page.

 Nader

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, May 25, 2004 10:51 AM
 To: [EMAIL PROTECTED]
 Subject: Which searched words are found in a document


 Hi,

 I have the following question:
 Is there an easy way to see which words from a query were found in a
 resulting document?

 So if I search for 'cat OR dog' and get a result document with only 'cat'
 in
 it. I would like to ask the searcher object or something to tell me that
 for
 the result document 'cat' was the only word found.

 I did see it is somehow possible with the explain method, but this does
 not
 give a clean answer. I can also get the contents of the document and do an
 indexof for each search term but there could be quite a lot in our case.

 Any suggestions?

 Thanks,

 Edvard Scheffers



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Memo: RE: RE: Query parser and minus signs

2004-05-26 Thread alex . bourne




I switched to indexing using a text field instead of keyword, then I tried
the following based on various pieces of advice:

PerFieldAnalyzerWrapper pfaw = new PerFieldAnalyzerWrapper(new 
ChineseAnalyzer());
pfaw.addAnalyzer(language, new WhitespaceAnalyzer());

try
{
query = MultiFieldQueryParser.parse(queryString, new 
String[]{contents, keywords, title, language}, (Analyzer) pfaw);
System.out.println(Parsed query:  + query.toString());
}
catch (ParseException e)
{
error = true;
e.printStackTrace();
}

I have tried both language:zh-HK and  language:zh\-HK (which appears in
the debugger as language:zh\\-HK) as the query, and neither return any
hits. I've tried stepping through the code to see what is being indexed
(which looks OK at least to a relative beginner like myself), and also
through the search code but I'm still none the wiser.

Am I doing something wrong, or have I completely missed the point ??



To:Alex BOURNE/IBEU/[EMAIL PROTECTED]
cc:
bcc:

Subject:RE: RE: Query parser and minus signs


remember luke does not display the indexed tokens but the stored field.  So
you would expect to see en-uk in the field.

doc.add(Field.Keyword(locale,test-uk));

are you adding to the document like this?

Also what analyzer you using to pass the query?

org.apache.lucene.analysis.WhitespaceAnalyzer : parses as locale:en-uk
org.apache.lucene.analysis.SimpleAnalyzer : parses as locale:en uk
org.apache.lucene.analysis.standard.StandardAnalyzer : parses as locale:en
uk

Try using whitespace analyzer in Luke and see how it's interpreting the
query.  If you are storing as a keyword but searching with tokens, it may
be your problem.



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: 24 May 2004 09:50
To: Lucene Users List
Subject: RE: RE: Query parser and minus signs






I tried this, but no it does not work. I'm concerned that escaping the
minus symbol does not appear to work. The field is indexed as a keyword so
is not tokenized - I've checked the contents using luke which confirms
this.




David Townsend [EMAIL PROTECTED] on 21 May 2004 17:02

Please respond to Lucene Users List [EMAIL PROTECTED]

To:Lucene Users List [EMAIL PROTECTED]
cc:
bcc:

Subject:RE: RE: Query parser and minus signs


Doesn't en UK as a phrase query work?

You're probably indexing it as a text field so it's being tokenised.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: 21 May 2004 16:47
To: Lucene Users List
Subject: Memo: RE: Query parser and minus signs






Hmm, we may have to if there is no work around. We're not using java
locales, but were trying to stick to the ISO standard which uses hyphens.




Ryan Sonnek [EMAIL PROTECTED] on 21 May 2004 16:38

Please respond to Lucene Users List [EMAIL PROTECTED]

To:Lucene Users List [EMAIL PROTECTED]
cc:
bcc:

Subject:RE: Query parser and minus signs


if you're dealing with locales, why not use java's built in locale syntax
(ex: en_UK, zh_HK)?

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Friday, May 21, 2004 10:36 AM
 To: [EMAIL PROTECTED]
 Subject: Query parser and minus signs






 Hi All,

 I'm using Lucene on a site that has split content with a
 branch containing
 pages in English and a separate branch in Chinese.  Some of
 the chinese
 pages include some (untranslatable) English words, so when a search is
 carried out in either language you can get pages from the
 wrong branch. To
 combat this we introduced a language field into the index
 which contains
 the standard language codes: en-UK and zh-HK.

 When you parse a query  e.g. language:en\-UK you could
 reasonably expect
 the search to recover all pages with the language field set
 to en-UK (the
 minus symbol should be escaped by the backslash according to the FAQ).
 Unfortunately the parser seems to return en UK as the
 parsed query and
 hence returns no documents.

 Has anyone else had this problem, or could suggest a
 workaround ?? as I
 have
 yet to find a solution in the mailing archives or elsewhere.

 Many thanks in advance,

 Alex Bourne



 _

 This transmission has been issued by a member of the HSBC Group
 (HSBC) for the information of the addressee only and should not be
 reproduced and / or distributed to any other person. Each page
 attached hereto must be read in conjunction with any disclaimer which
 forms part of it. This transmission is neither an offer nor
 the solicitation
 of an offer to sell or purchase any investment. Its contents
 are based
 on information obtained from sources believed to be reliable but HSBC
 makes no representation and accepts no responsibility or
 liability as to
 its completeness or accuracy.


 

Re: Memo: RE: RE: Query parser and minus signs

2004-05-26 Thread Erik Hatcher
What is the value of your Parsed query: output?
On May 26, 2004, at 8:39 AM, [EMAIL PROTECTED] wrote:


I switched to indexing using a text field instead of keyword, then I 
tried
the following based on various pieces of advice:

PerFieldAnalyzerWrapper pfaw = new 
PerFieldAnalyzerWrapper(new ChineseAnalyzer());
pfaw.addAnalyzer(language, new WhitespaceAnalyzer());

try
{
query = MultiFieldQueryParser.parse(queryString, new 
String[]{contents, keywords, title, language}, (Analyzer) 
pfaw);
System.out.println(Parsed query:  + 
query.toString());
}
catch (ParseException e)
{
error = true;
e.printStackTrace();
}

I have tried both language:zh-HK and  language:zh\-HK (which 
appears in
the debugger as language:zh\\-HK) as the query, and neither return 
any
hits. I've tried stepping through the code to see what is being indexed
(which looks OK at least to a relative beginner like myself), and also
through the search code but I'm still none the wiser.

Am I doing something wrong, or have I completely missed the point ??

To:Alex BOURNE/IBEU/[EMAIL PROTECTED]
cc:
bcc:
Subject:RE: RE: Query parser and minus signs
remember luke does not display the indexed tokens but the stored 
field.  So
you would expect to see en-uk in the field.

doc.add(Field.Keyword(locale,test-uk));
are you adding to the document like this?
Also what analyzer you using to pass the query?
org.apache.lucene.analysis.WhitespaceAnalyzer : parses as locale:en-uk
org.apache.lucene.analysis.SimpleAnalyzer : parses as locale:en uk
org.apache.lucene.analysis.standard.StandardAnalyzer : parses as 
locale:en
uk

Try using whitespace analyzer in Luke and see how it's interpreting the
query.  If you are storing as a keyword but searching with tokens, it 
may
be your problem.


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: 24 May 2004 09:50
To: Lucene Users List
Subject: RE: RE: Query parser and minus signs


I tried this, but no it does not work. I'm concerned that escaping the
minus symbol does not appear to work. The field is indexed as a 
keyword so
is not tokenized - I've checked the contents using luke which confirms
this.


David Townsend [EMAIL PROTECTED] on 21 May 2004 17:02
Please respond to Lucene Users List [EMAIL PROTECTED]
To:Lucene Users List [EMAIL PROTECTED]
cc:
bcc:
Subject:RE: RE: Query parser and minus signs
Doesn't en UK as a phrase query work?
You're probably indexing it as a text field so it's being tokenised.
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: 21 May 2004 16:47
To: Lucene Users List
Subject: Memo: RE: Query parser and minus signs


Hmm, we may have to if there is no work around. We're not using java
locales, but were trying to stick to the ISO standard which uses 
hyphens.


Ryan Sonnek [EMAIL PROTECTED] on 21 May 2004 16:38
Please respond to Lucene Users List [EMAIL PROTECTED]
To:Lucene Users List [EMAIL PROTECTED]
cc:
bcc:
Subject:RE: Query parser and minus signs
if you're dealing with locales, why not use java's built in locale 
syntax
(ex: en_UK, zh_HK)?

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Friday, May 21, 2004 10:36 AM
To: [EMAIL PROTECTED]
Subject: Query parser and minus signs


Hi All,
I'm using Lucene on a site that has split content with a
branch containing
pages in English and a separate branch in Chinese.  Some of
the chinese
pages include some (untranslatable) English words, so when a search is
carried out in either language you can get pages from the
wrong branch. To
combat this we introduced a language field into the index
which contains
the standard language codes: en-UK and zh-HK.
When you parse a query  e.g. language:en\-UK you could
reasonably expect
the search to recover all pages with the language field set
to en-UK (the
minus symbol should be escaped by the backslash according to the FAQ).
Unfortunately the parser seems to return en UK as the
parsed query and
hence returns no documents.
Has anyone else had this problem, or could suggest a
workaround ?? as I
have
yet to find a solution in the mailing archives or elsewhere.
Many thanks in advance,
Alex Bourne

_
This transmission has been issued by a member of the HSBC Group
(HSBC) for the information of the addressee only and should not be
reproduced and / or distributed to any other person. Each page
attached hereto must be read in conjunction with any disclaimer which
forms part of it. This transmission is neither an offer nor
the solicitation
of an offer to sell or purchase any investment. Its contents
are based
on information obtained from sources believed to be reliable but HSBC
makes no representation and accepts no responsibility or
liability as to
its completeness or accuracy.

Memo: Re: RE: RE: Query parser and minus signs

2004-05-26 Thread alex . bourne




Being a bit of a newbie I had tried putting -language:zh-HK by itself,
where it seems it will always return no results unless you combine it with
a positive term. However I then tried this and it does not seem to build
the query I had hoped for:

Query: hsbc
Parsed query: contents:hsbc keywords:hsbc title:hsbc language:hsbc
Hits: 206

Query: hsbc -language:zh-HK
Parsed query: (contents:hsbc -language:zh -contents:hk) (keywords:hsbc -language:zh 
-keywords:hk) (title:hsbc -language:zh -title:hk) (language:hsbc
-language:zh -language:HK)
Hits: 169
Not quite what I was expecting from the parsed query - the zh and HK are now separated.

Query: hsbc -language:zh\-HK
Parsed query: (contents:hsbc -language:zh\-HK) (keywords:hsbc -language:zh\-HK) 
(title:hsbc -language:zh\-HK) (language:hsbc -language:zh\-HK)
Hits: 206
And I'm guessing here, but I don't think the slash is escaping, does it just become 
part of the query??






Erik Hatcher [EMAIL PROTECTED] on 26 May 2004 15:11

Please respond to Lucene Users List [EMAIL PROTECTED]

To:Lucene Users List [EMAIL PROTECTED]
cc:
bcc:

Subject:Re: RE: RE: Query parser and minus signs


What is the value of your Parsed query: output?


On May 26, 2004, at 8:39 AM, [EMAIL PROTECTED] wrote:





 I switched to indexing using a text field instead of keyword, then I
 tried
 the following based on various pieces of advice:

 PerFieldAnalyzerWrapper pfaw = new
 PerFieldAnalyzerWrapper(new ChineseAnalyzer());
 pfaw.addAnalyzer(language, new WhitespaceAnalyzer());

 try
 {
 query = MultiFieldQueryParser.parse(queryString, new
 String[]{contents, keywords, title, language}, (Analyzer)
 pfaw);
 System.out.println(Parsed query:  +
 query.toString());
 }
 catch (ParseException e)
 {
 error = true;
 e.printStackTrace();
 }

 I have tried both language:zh-HK and  language:zh\-HK (which
 appears in
 the debugger as language:zh\\-HK) as the query, and neither return
 any
 hits. I've tried stepping through the code to see what is being indexed
 (which looks OK at least to a relative beginner like myself), and also
 through the search code but I'm still none the wiser.

 Am I doing something wrong, or have I completely missed the point ??



 To:Alex BOURNE/IBEU/[EMAIL PROTECTED]
 cc:
 bcc:

 Subject:RE: RE: Query parser and minus signs


 remember luke does not display the indexed tokens but the stored
 field.  So
 you would expect to see en-uk in the field.

 doc.add(Field.Keyword(locale,test-uk));

 are you adding to the document like this?

 Also what analyzer you using to pass the query?

 org.apache.lucene.analysis.WhitespaceAnalyzer : parses as locale:en-uk
 org.apache.lucene.analysis.SimpleAnalyzer : parses as locale:en uk
 org.apache.lucene.analysis.standard.StandardAnalyzer : parses as
 locale:en
 uk

 Try using whitespace analyzer in Luke and see how it's interpreting the
 query.  If you are storing as a keyword but searching with tokens, it
 may
 be your problem.



 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: 24 May 2004 09:50
 To: Lucene Users List
 Subject: RE: RE: Query parser and minus signs






 I tried this, but no it does not work. I'm concerned that escaping the
 minus symbol does not appear to work. The field is indexed as a
 keyword so
 is not tokenized - I've checked the contents using luke which confirms
 this.




 David Townsend [EMAIL PROTECTED] on 21 May 2004 17:02

 Please respond to Lucene Users List [EMAIL PROTECTED]

 To:Lucene Users List [EMAIL PROTECTED]
 cc:
 bcc:

 Subject:RE: RE: Query parser and minus signs


 Doesn't en UK as a phrase query work?

 You're probably indexing it as a text field so it's being tokenised.

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: 21 May 2004 16:47
 To: Lucene Users List
 Subject: Memo: RE: Query parser and minus signs






 Hmm, we may have to if there is no work around. We're not using java
 locales, but were trying to stick to the ISO standard which uses
 hyphens.




 Ryan Sonnek [EMAIL PROTECTED] on 21 May 2004 16:38

 Please respond to Lucene Users List [EMAIL PROTECTED]

 To:Lucene Users List [EMAIL PROTECTED]
 cc:
 bcc:

 Subject:RE: Query parser and minus signs


 if you're dealing with locales, why not use java's built in locale
 syntax
 (ex: en_UK, zh_HK)?

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Friday, May 21, 2004 10:36 AM
 To: [EMAIL PROTECTED]
 Subject: Query parser and minus signs






 Hi All,

 I'm using Lucene on a site that has split content with a
 branch containing
 pages in English and a separate branch in Chinese.  Some of
 the chinese
 pages include some (untranslatable) English words, so when a search is
 carried out in either language you can get pages from the
 

Re: Memo: Re: RE: RE: Query parser and minus signs

2004-05-26 Thread Erik Hatcher
On May 26, 2004, at 10:48 AM, [EMAIL PROTECTED] wrote:
Query: hsbc -language:zh-HK
Parsed query: (contents:hsbc -language:zh -contents:hk) (keywords:hsbc 
-language:zh -keywords:hk) (title:hsbc -language:zh -title:hk) 
(language:hsbc
-language:zh -language:HK)
Hits: 169
Not quite what I was expecting from the parsed query - the zh and HK 
are now separated.
I think I can safely say that you are not running the latest version of 
Lucene.  This has been corrected in the 1.4 versions.

I've tested this with Wal-Mart (without the quote) and QueryParser, 
and it works as expected.


Query: hsbc -language:zh\-HK
Parsed query: (contents:hsbc -language:zh\-HK) (keywords:hsbc 
-language:zh\-HK) (title:hsbc -language:zh\-HK) (language:hsbc 
-language:zh\-HK)
Hits: 206
And I'm guessing here, but I don't think the slash is escaping, does 
it just become part of the query??
Now that is odd.
QueryParser is an awkward beast at times, and combining it with 
MultiFieldQueryParser (which I'd recommend against, as you can see with 
the odd queries it built for you) gets even more confusing.

Hopefully the latest Lucene 1.4 RC release will fix up your situation.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Asian languages

2004-05-26 Thread Christophe Lombart
Which  asian languages are supported by Lucene ?
What about corean, japanese, thaï, ... ?
If they are not yet supported, what I need to do ?
Thanks,
Christophe
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Memory usage

2004-05-26 Thread James Dunn
Hello,

I was wondering if anyone has had problems with memory
usage and MultiSearcher.

My index is composed of two sub-indexes that I search
with a MultiSearcher.  The total size of the index is
about 3.7GB with the larger sub-index being 3.6GB and
the smaller being 117MB.

I am using Lucene 1.3 Final with the compound file
format.

Also I search across about 50 fields but I don't use
wildcard or range queries. 

Doing repeated searches in this way seems to
eventually chew up about 500MB of memory which seems
excessive to me.

Does anyone have any ideas where I could look to
reduce the memory my queries consume?

Thanks,

Jim




__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Memory usage

2004-05-26 Thread wallen
This sounds like a memory leakage situation.  If you are using tomcat I
would suggest you make sure you are on a recent version, as it is known to
have some memory leaks in version 4.  It doesn't make sense that repeated
queries would use more memory that the most demanding query unless objects
are not getting freed from memory.

-Will

-Original Message-
From: James Dunn [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 3:02 PM
To: [EMAIL PROTECTED]
Subject: Memory usage


Hello,

I was wondering if anyone has had problems with memory
usage and MultiSearcher.

My index is composed of two sub-indexes that I search
with a MultiSearcher.  The total size of the index is
about 3.7GB with the larger sub-index being 3.6GB and
the smaller being 117MB.

I am using Lucene 1.3 Final with the compound file
format.

Also I search across about 50 fields but I don't use
wildcard or range queries. 

Doing repeated searches in this way seems to
eventually chew up about 500MB of memory which seems
excessive to me.

Does anyone have any ideas where I could look to
reduce the memory my queries consume?

Thanks,

Jim




__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Problem Indexing Large Document Field

2004-05-26 Thread Gilberto Rodriguez
I am trying to index a field in a Lucene document with about 90,000 
characters. The problem is that it only indexes part of the document. 
It seems to only index about 65,00 characters. So, if I search on terms 
that are at the beginning of the text, the search works, but it fails 
for terms that are at the end of the document.

Is there a limitation on how many characters can be stored in a 
document field? Any help would be appreciated, thanks

Gilberto Rodriguez
Software Engineer
 
370 CenterPointe Circle, Suite 1178
Altamonte Springs, FL 32701-3451
 
407.339.1177 (Ext.112)  phone
407.339.6704  fax
[EMAIL PROTECTED]  email
www.conviveon.com  web

This e-mail contains legally privileged and confidential information 
intended only for the individual or entity named within the message. If 
the reader of this message is not the intended recipient, or the agent 
responsible to deliver it to the intended recipient, the recipient is 
hereby notified that any review, dissemination, distribution or copying 
of this communication is prohibited. If this communication was received 
in error, please notify me by reply e-mail and delete the original 
message.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Memory usage

2004-05-26 Thread James Dunn
Will,

Thanks for your response.  It may be an object leak. 
I will look into that.

I just ran some more tests and this time I create a
20GB index by repeatedly merging my large index into
itself.

When I ran my test query against that index I got an
OutOfMemoryError on the very first query.  I have my
heap set to 512MB.  Should a query against a 20GB
index require that much memory?  I page through the
results 100 at a time, so I should never have more
than 100 Document objects in memory.  

Any help would be appreciated, thanks!

Jim
--- [EMAIL PROTECTED] wrote:
 This sounds like a memory leakage situation.  If you
 are using tomcat I
 would suggest you make sure you are on a recent
 version, as it is known to
 have some memory leaks in version 4.  It doesn't
 make sense that repeated
 queries would use more memory that the most
 demanding query unless objects
 are not getting freed from memory.
 
 -Will
 
 -Original Message-
 From: James Dunn [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, May 26, 2004 3:02 PM
 To: [EMAIL PROTECTED]
 Subject: Memory usage
 
 
 Hello,
 
 I was wondering if anyone has had problems with
 memory
 usage and MultiSearcher.
 
 My index is composed of two sub-indexes that I
 search
 with a MultiSearcher.  The total size of the index
 is
 about 3.7GB with the larger sub-index being 3.6GB
 and
 the smaller being 117MB.
 
 I am using Lucene 1.3 Final with the compound file
 format.
 
 Also I search across about 50 fields but I don't use
 wildcard or range queries. 
 
 Doing repeated searches in this way seems to
 eventually chew up about 500MB of memory which seems
 excessive to me.
 
 Does anyone have any ideas where I could look to
 reduce the memory my queries consume?
 
 Thanks,
 
 Jim
 
 
   
   
 __
 Do you Yahoo!?
 Friends.  Fun.  Try the all-new Yahoo! Messenger.
 http://messenger.yahoo.com/ 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem Indexing Large Document Field

2004-05-26 Thread James Dunn
Gilberto,

Look at the IndexWriter class.  It has a property,
maxFieldLength, which you can set to determine the max
number of characters to be stored in the index.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html

Jim

--- Gilberto Rodriguez
[EMAIL PROTECTED] wrote:
 I am trying to index a field in a Lucene document
 with about 90,000 
 characters. The problem is that it only indexes part
 of the document. 
 It seems to only index about 65,00 characters. So,
 if I search on terms 
 that are at the beginning of the text, the search
 works, but it fails 
 for terms that are at the end of the document.
 
 Is there a limitation on how many characters can be
 stored in a 
 document field? Any help would be appreciated,
 thanks
 
 
 Gilberto Rodriguez
 Software Engineer
    
 370 CenterPointe Circle, Suite 1178
 Altamonte Springs, FL 32701-3451
    
 407.339.1177 (Ext.112) • phone
 407.339.6704 • fax
 [EMAIL PROTECTED] • email
 www.conviveon.com • web
  
 This e-mail contains legally privileged and
 confidential information 
 intended only for the individual or entity named
 within the message. If 
 the reader of this message is not the intended
 recipient, or the agent 
 responsible to deliver it to the intended recipient,
 the recipient is 
 hereby notified that any review, dissemination,
 distribution or copying 
 of this communication is prohibited. If this
 communication was received 
 in error, please notify me by reply e-mail and
 delete the original 
 message.
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem Indexing Large Document Field

2004-05-26 Thread Gilberto Rodriguez
Thanks,  James... That solved the problem.
On May 26, 2004, at 4:15 PM, James Dunn wrote:
Gilberto,
Look at the IndexWriter class.  It has a property,
maxFieldLength, which you can set to determine the max
number of characters to be stored in the index.
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/ 
IndexWriter.html

Jim
--- Gilberto Rodriguez
[EMAIL PROTECTED] wrote:
I am trying to index a field in a Lucene document
with about 90,000
characters. The problem is that it only indexes part
of the document.
It seems to only index about 65,00 characters. So,
if I search on terms
that are at the beginning of the text, the search
works, but it fails
for terms that are at the end of the document.
Is there a limitation on how many characters can be
stored in a
document field? Any help would be appreciated,
thanks
Gilberto Rodriguez
Software Engineer
  
370 CenterPointe Circle, Suite 1178
Altamonte Springs, FL 32701-3451
  
407.339.1177 (Ext.112)  phone
407.339.6704  fax
[EMAIL PROTECTED]  email
www.conviveon.com  web

This e-mail contains legally privileged and
confidential information
intended only for the individual or entity named
within the message. If
the reader of this message is not the intended
recipient, or the agent
responsible to deliver it to the intended recipient,
the recipient is
hereby notified that any review, dissemination,
distribution or copying
of this communication is prohibited. If this
communication was received
in error, please notify me by reply e-mail and
delete the original
message.

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]



__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Gilberto Rodriguez
Software Engineer
 
370 CenterPointe Circle, Suite 1178
Altamonte Springs, FL 32701-3451
 
407.339.1177 (Ext.112)  phone
407.339.6704  fax
[EMAIL PROTECTED]  email
www.conviveon.com  web

This e-mail contains legally privileged and confidential information  
intended only for the individual or entity named within the message. If  
the reader of this message is not the intended recipient, or the agent  
responsible to deliver it to the intended recipient, the recipient is  
hereby notified that any review, dissemination, distribution or copying  
of this communication is prohibited. If this communication was received  
in error, please notify me by reply e-mail and delete the original  
message.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Memory usage

2004-05-26 Thread Erik Hatcher
How big are your actual Documents?  Are you caching Hits?  It stores, 
internally, up to 200 documents.

Erik
On May 26, 2004, at 4:08 PM, James Dunn wrote:
Will,
Thanks for your response.  It may be an object leak.
I will look into that.
I just ran some more tests and this time I create a
20GB index by repeatedly merging my large index into
itself.
When I ran my test query against that index I got an
OutOfMemoryError on the very first query.  I have my
heap set to 512MB.  Should a query against a 20GB
index require that much memory?  I page through the
results 100 at a time, so I should never have more
than 100 Document objects in memory.
Any help would be appreciated, thanks!
Jim
--- [EMAIL PROTECTED] wrote:
This sounds like a memory leakage situation.  If you
are using tomcat I
would suggest you make sure you are on a recent
version, as it is known to
have some memory leaks in version 4.  It doesn't
make sense that repeated
queries would use more memory that the most
demanding query unless objects
are not getting freed from memory.
-Will
-Original Message-
From: James Dunn [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 3:02 PM
To: [EMAIL PROTECTED]
Subject: Memory usage
Hello,
I was wondering if anyone has had problems with
memory
usage and MultiSearcher.
My index is composed of two sub-indexes that I
search
with a MultiSearcher.  The total size of the index
is
about 3.7GB with the larger sub-index being 3.6GB
and
the smaller being 117MB.
I am using Lucene 1.3 Final with the compound file
format.
Also I search across about 50 fields but I don't use
wildcard or range queries.
Doing repeated searches in this way seems to
eventually chew up about 500MB of memory which seems
excessive to me.
Does anyone have any ideas where I could look to
reduce the memory my queries consume?
Thanks,
Jim


__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Memory usage

2004-05-26 Thread Doug Cutting
James Dunn wrote:
Also I search across about 50 fields but I don't use
wildcard or range queries. 
Lucene uses one byte of RAM per document per searched field, to hold the 
normalization values.  So if you search a 10M document collection with 
50 fields, then you'll end up using 500MB of RAM.

If you're using unanalyzed fields, then an easy workaround to reduce the 
number of fields is to combine many in a single field.  So, instead of, 
e.g., using an f1 field with value abc, and an f2 field with value 
efg, use a single field named f with values 1_abc and 2_efg.

We could optimize this in Lucene.  If no values of an indexed field are 
analyzed, then we could store no norms for the field and hence read none 
into memory.  This wouldn't be too hard to implement...

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Memory usage

2004-05-26 Thread James Dunn
Erik,

Thanks for the response.  

My actual documents are fairly small.  Most docs only
have about 10 fields.  Some of those fields are
stored, however, like the OBJECT_ID, NAME and DESC
fields.  The stored fields are pretty small as well. 
None should be more than 4KB and very few will
approach that limit.

I'm also using the default maxFieldSize value of
1.  

I'm not caching hits, either.

Could it be my query?  I have about 80 total unique
fields in the index although no document has all 80. 
My query ends up looking like this:

+(F1:test F2:test ..  F80:test)

From previous mails that doesn't look like an enormous
amount of fields to be searching against.  Is there
some formula for the amount of memory required for a
query based on the number of clauses and terms?

Jim



--- Erik Hatcher [EMAIL PROTECTED] wrote:
 How big are your actual Documents?  Are you caching
 Hits?  It stores, 
 internally, up to 200 documents.
 
   Erik
 
 
 On May 26, 2004, at 4:08 PM, James Dunn wrote:
 
  Will,
 
  Thanks for your response.  It may be an object
 leak.
  I will look into that.
 
  I just ran some more tests and this time I create
 a
  20GB index by repeatedly merging my large index
 into
  itself.
 
  When I ran my test query against that index I got
 an
  OutOfMemoryError on the very first query.  I have
 my
  heap set to 512MB.  Should a query against a 20GB
  index require that much memory?  I page through
 the
  results 100 at a time, so I should never have more
  than 100 Document objects in memory.
 
  Any help would be appreciated, thanks!
 
  Jim
  --- [EMAIL PROTECTED] wrote:
  This sounds like a memory leakage situation.  If
 you
  are using tomcat I
  would suggest you make sure you are on a recent
  version, as it is known to
  have some memory leaks in version 4.  It doesn't
  make sense that repeated
  queries would use more memory that the most
  demanding query unless objects
  are not getting freed from memory.
 
  -Will
 
  -Original Message-
  From: James Dunn [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, May 26, 2004 3:02 PM
  To: [EMAIL PROTECTED]
  Subject: Memory usage
 
 
  Hello,
 
  I was wondering if anyone has had problems with
  memory
  usage and MultiSearcher.
 
  My index is composed of two sub-indexes that I
  search
  with a MultiSearcher.  The total size of the
 index
  is
  about 3.7GB with the larger sub-index being 3.6GB
  and
  the smaller being 117MB.
 
  I am using Lucene 1.3 Final with the compound
 file
  format.
 
  Also I search across about 50 fields but I don't
 use
  wildcard or range queries.
 
  Doing repeated searches in this way seems to
  eventually chew up about 500MB of memory which
 seems
  excessive to me.
 
  Does anyone have any ideas where I could look to
  reduce the memory my queries consume?
 
  Thanks,
 
  Jim
 
 
 
 
  __
  Do you Yahoo!?
  Friends.  Fun.  Try the all-new Yahoo! Messenger.
  http://messenger.yahoo.com/
 
 
 

-
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
 
 
 

-
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
 
 
 
  __
  Do You Yahoo!?
  Tired of spam?  Yahoo! Mail has the best spam
 protection around
  http://mail.yahoo.com
 
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



classic scenario

2004-05-26 Thread Adrian Dumitru
I salute the Lucene community!
it will be a great help for me if I get your valuable opinions on the
following issue; I know I could've find more answers to my questions from
reading the documentation but I did invest some time on this and still
have these questions:

I am (also) building a web crawler, a topic specific one to be more
precise, for a vortal. I recently learned about Lucene and I'd very much
like to use it in order to handle keyword specific searched on the info
that I collect.
I suspect this is a classic project, at least for Lucene, probably
something like this has been addressed already on this disussion list, I'm
interested to hear any experience anyone might have with this subject.
My crawler goes on the internet, extracts/parse/ranks and saves websites,
most of the information is also categoriezed and stored in the database
but I also save about 10 top pages from each site in the filesystem.
The first question is: should I care about indexing these files at the
time I extract them from internet? Or should I index them later, when I
make them available for search?
If yes, then can I still name my files the way I want?(i.e. are there any
constraints in the filenames from Lucene perspective?)
Is it an OK idea to have the same files repository (or index) where the
crawler writes (indexes files) and the search function searches? I guess
performance issues are important here.
Can I still organize the files that I save the way I want? (I planned to
write all the files from a given website on different folders...and the
folders will have as name the id from my database)
I maintain a taxonomy (list of categories)...each website will fall into
one or more of these categories, also each website will have a rank. Does
Lucene have something that I should be aware of related to what I said?

I guess that's it for now...this is more like a pet project for me, a pet
which keeps growing :) I wouldn't mind any help and opinions you can
provide, source code samples, etc.

Big thanks in advance and good luck on your work.
adrian.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-26 Thread James Dunn
Doug,

Thanks!  

I just asked a question regarding how to calculate the
memory requirements for a search.  Does this memory
only get used only during the search operation itself,
or is it referenced by the Hits object or anything
else after the actual search completes?

Thanks again,

Jim


--- Doug Cutting [EMAIL PROTECTED] wrote:
 James Dunn wrote:
  Also I search across about 50 fields but I don't
 use
  wildcard or range queries. 
 
 Lucene uses one byte of RAM per document per
 searched field, to hold the 
 normalization values.  So if you search a 10M
 document collection with 
 50 fields, then you'll end up using 500MB of RAM.
 
 If you're using unanalyzed fields, then an easy
 workaround to reduce the 
 number of fields is to combine many in a single
 field.  So, instead of, 
 e.g., using an f1 field with value abc, and an
 f2 field with value 
 efg, use a single field named f with values
 1_abc and 2_efg.
 
 We could optimize this in Lucene.  If no values of
 an indexed field are 
 analyzed, then we could store no norms for the field
 and hence read none 
 into memory.  This wouldn't be too hard to
 implement...
 
 Doug
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-26 Thread Doug Cutting
It is cached by the IndexReader and lives until the index reader is 
garbage collected.  50-70 searchable fields is a *lot*.  How many are 
analyzed text, and how many are simply keywords?

Doug
James Dunn wrote:
Doug,
Thanks!  

I just asked a question regarding how to calculate the
memory requirements for a search.  Does this memory
only get used only during the search operation itself,
or is it referenced by the Hits object or anything
else after the actual search completes?
Thanks again,
Jim
--- Doug Cutting [EMAIL PROTECTED] wrote:
James Dunn wrote:
Also I search across about 50 fields but I don't
use
wildcard or range queries. 
Lucene uses one byte of RAM per document per
searched field, to hold the 
normalization values.  So if you search a 10M
document collection with 
50 fields, then you'll end up using 500MB of RAM.

If you're using unanalyzed fields, then an easy
workaround to reduce the 
number of fields is to combine many in a single
field.  So, instead of, 
e.g., using an f1 field with value abc, and an
f2 field with value 
efg, use a single field named f with values
1_abc and 2_efg.

We could optimize this in Lucene.  If no values of
an indexed field are 
analyzed, then we could store no norms for the field
and hence read none 
into memory.  This wouldn't be too hard to
implement...

Doug

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]


	
		
__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Problem Indexing Large Document Field

2004-05-26 Thread wallen
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite
r.html#DEFAULT_MAX_FIELD_LENGTH

maxFieldLength
public int maxFieldLengthThe maximum number of terms that will be indexed
for a single field in a document. This limits the amount of memory required
for indexing, so that collections with very large files will not crash the
indexing process by running out of memory.
Note that this effectively truncates large documents, excluding from the
index terms that occur further in the document. If you know your source
documents are large, be sure to set this value high enough to accomodate the
expected size. If you set it to Integer.MAX_VALUE, then the only limit is
your memory, but you should anticipate an OutOfMemoryError.

By default, no more than 10,000 terms will be indexed for a field. 



-Original Message-
From: Gilberto Rodriguez [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 4:04 PM
To: [EMAIL PROTECTED]
Subject: Problem Indexing Large Document Field


I am trying to index a field in a Lucene document with about 90,000 
characters. The problem is that it only indexes part of the document. 
It seems to only index about 65,00 characters. So, if I search on terms 
that are at the beginning of the text, the search works, but it fails 
for terms that are at the end of the document.

Is there a limitation on how many characters can be stored in a 
document field? Any help would be appreciated, thanks


Gilberto Rodriguez
Software Engineer
   
370 CenterPointe Circle, Suite 1178
Altamonte Springs, FL 32701-3451
   
407.339.1177 (Ext.112) • phone
407.339.6704 • fax
[EMAIL PROTECTED] • email
www.conviveon.com • web
 
This e-mail contains legally privileged and confidential information 
intended only for the individual or entity named within the message. If 
the reader of this message is not the intended recipient, or the agent 
responsible to deliver it to the intended recipient, the recipient is 
hereby notified that any review, dissemination, distribution or copying 
of this communication is prohibited. If this communication was received 
in error, please notify me by reply e-mail and delete the original 
message.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem Indexing Large Document Field

2004-05-26 Thread Gilberto Rodriguez
Yeap, that was the problem...  I just needed to increase the  
maxFieldLength number.

Thanks...
On May 26, 2004, at 5:56 PM, [EMAIL PROTECTED] wrote:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/ 
IndexWrite
r.html#DEFAULT_MAX_FIELD_LENGTH

maxFieldLength
public int maxFieldLengthThe maximum number of terms that will be  
indexed
for a single field in a document. This limits the amount of memory  
required
for indexing, so that collections with very large files will not crash  
the
indexing process by running out of memory.
Note that this effectively truncates large documents, excluding from  
the
index terms that occur further in the document. If you know your source
documents are large, be sure to set this value high enough to  
accomodate the
expected size. If you set it to Integer.MAX_VALUE, then the only limit  
is
your memory, but you should anticipate an OutOfMemoryError.

By default, no more than 10,000 terms will be indexed for a field.

-Original Message-
From: Gilberto Rodriguez [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 4:04 PM
To: [EMAIL PROTECTED]
Subject: Problem Indexing Large Document Field
I am trying to index a field in a Lucene document with about 90,000
characters. The problem is that it only indexes part of the document.
It seems to only index about 65,00 characters. So, if I search on terms
that are at the beginning of the text, the search works, but it fails
for terms that are at the end of the document.
Is there a limitation on how many characters can be stored in a
document field? Any help would be appreciated, thanks
Gilberto Rodriguez
Software Engineer
  
370 CenterPointe Circle, Suite 1178
Altamonte Springs, FL 32701-3451
  
407.339.1177 (Ext.112)  phone
407.339.6704  fax
[EMAIL PROTECTED]  email
www.conviveon.com  web

This e-mail contains legally privileged and confidential information
intended only for the individual or entity named within the message. If
the reader of this message is not the intended recipient, or the agent
responsible to deliver it to the intended recipient, the recipient is
hereby notified that any review, dissemination, distribution or copying
of this communication is prohibited. If this communication was received
in error, please notify me by reply e-mail and delete the original
message.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Gilberto Rodriguez
Software Engineer
 
370 CenterPointe Circle, Suite 1178
Altamonte Springs, FL 32701-3451
 
407.339.1177 (Ext.112)  phone
407.339.6704  fax
[EMAIL PROTECTED]  email
www.conviveon.com  web

This e-mail contains legally privileged and confidential information  
intended only for the individual or entity named within the message. If  
the reader of this message is not the intended recipient, or the agent  
responsible to deliver it to the intended recipient, the recipient is  
hereby notified that any review, dissemination, distribution or copying  
of this communication is prohibited. If this communication was received  
in error, please notify me by reply e-mail and delete the original  
message.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Number query not working

2004-05-26 Thread Reece . 1247688
Hi,



I have a bunch of digits in a field.  When I do this search it returns
nothing:



  myField:001085609805100



It returns the correct document
when I add a * to the end like this:



  myField:001085609805100* --
added the *



I'm not sure what is happening here.  I'm thinking that Lucene
is doing some number conversion internally when it sees only digits.  When
I add the * maybe it presumes it is still a string.  



How do I get a string
of digits to work without adding a *?



Thanks,

Reece

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Number query not working

2004-05-26 Thread Reece . 1247688
Hi,



It looks like its because I'm using the SimpleAnalyzer instead of the
StandardAnalyzer.  What is the SimpleAnalyzer to this query to make it not
work?



Thanks,

Reece



--- Lucene Users List [EMAIL PROTECTED]
wrote:

Hi,

 

 I have a bunch of digits in a field.  When I do this search
it returns

 nothing:

 

   myField:001085609805100

 

 It returns
the correct document

 when I add a * to the end like this:

 

   myField:001085609805100*
--

 added the *

 

 I'm not sure what is happening here.  I'm thinking
that Lucene

 is doing some number conversion internally when it sees only
digits.  When

 I add the * maybe it presumes it is still a string.  




 How do I get a string

 of digits to work without adding a *?

 


Thanks,

 Reece

 

 -

 To unsubscribe, e-mail: [EMAIL PROTECTED]

 For
additional commands, e-mail: [EMAIL PROTECTED]

 

 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-26 Thread James Dunn
Doug,

We only search on analyzed text fields.  There are a
couple of additional fields in the index like
OBJECT_ID that are keywords but we don't search
against those, we only use them once we get a result
back to find the thing that document represents.

Thanks,

Jim

--- Doug Cutting [EMAIL PROTECTED] wrote:
 It is cached by the IndexReader and lives until the
 index reader is 
 garbage collected.  50-70 searchable fields is a
 *lot*.  How many are 
 analyzed text, and how many are simply keywords?
 
 Doug
 
 James Dunn wrote:
  Doug,
  
  Thanks!  
  
  I just asked a question regarding how to calculate
 the
  memory requirements for a search.  Does this
 memory
  only get used only during the search operation
 itself,
  or is it referenced by the Hits object or anything
  else after the actual search completes?
  
  Thanks again,
  
  Jim
  
  
  --- Doug Cutting [EMAIL PROTECTED] wrote:
  
 James Dunn wrote:
 
 Also I search across about 50 fields but I don't
 
 use
 
 wildcard or range queries. 
 
 Lucene uses one byte of RAM per document per
 searched field, to hold the 
 normalization values.  So if you search a 10M
 document collection with 
 50 fields, then you'll end up using 500MB of RAM.
 
 If you're using unanalyzed fields, then an easy
 workaround to reduce the 
 number of fields is to combine many in a single
 field.  So, instead of, 
 e.g., using an f1 field with value abc, and an
 f2 field with value 
 efg, use a single field named f with values
 1_abc and 2_efg.
 
 We could optimize this in Lucene.  If no values of
 an indexed field are 
 analyzed, then we could store no norms for the
 field
 and hence read none 
 into memory.  This wouldn't be too hard to
 implement...
 
 Doug
 
 
  
 

-
  
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  
  
  
  
  
  
  __
  Do you Yahoo!?
  Friends.  Fun.  Try the all-new Yahoo! Messenger.
  http://messenger.yahoo.com/ 
  
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Number query not working

2004-05-26 Thread Reece . 1247688
Whoa!  I reread my last post and the last sentence didn't make much sense.
 This is what I meant to say:



What is the SimpleAnalyzer doing to this
query to make it not work?



--- Lucene Users List [EMAIL PROTECTED]
wrote:

Hi,

 

 It looks like its because I'm using the SimpleAnalyzer
instead of the

 StandardAnalyzer.  What is the SimpleAnalyzer to this query
to make it not

 work?

 

 Thanks,

 Reece

 

 --- Lucene Users List
[EMAIL PROTECTED]

 wrote:

 Hi,

  

  I have a bunch
of digits in a field.  When I do this search

 it returns

  nothing:

  

myField:001085609805100

  

  It returns

 the correct
document

  when I add a * to the end like this:

  

myField:001085609805100*

 --

  added the *

  

  I'm not sure what is happening here.  I'm
thinking

 that Lucene

  is doing some number conversion internally when
it sees only

 digits.  When

  I add the * maybe it presumes it is still
a string.  

 

 

  How do I get a string

  of digits to work without
adding a *?

  

 

 Thanks,

  Reece

  

  -

 

  To unsubscribe, e-mail: [EMAIL PROTECTED]

  For

 additional commands, e-mail: [EMAIL PROTECTED]

  

  

 

 

 -

 To unsubscribe, e-mail: [EMAIL PROTECTED]

 For
additional commands, e-mail: [EMAIL PROTECTED]

 

 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Number query not working

2004-05-26 Thread Erik Hatcher
On May 26, 2004, at 6:38 PM, [EMAIL PROTECTED] wrote:
It looks like its because I'm using the SimpleAnalyzer instead of the
StandardAnalyzer.  What is the SimpleAnalyzer to this query to make it 
not
work?
http://wiki.apache.org/jakarta-lucene/AnalysisParalysis
It is a good idea to analyze the analyzer.  Do a .toString output of 
the Query and you'll see clearly what happened.

Erik

Thanks,
Reece
--- Lucene Users List [EMAIL PROTECTED]
wrote:
Hi,
I have a bunch of digits in a field.  When I do this search
it returns
nothing:
  myField:001085609805100
It returns
the correct document
when I add a * to the end like this:
  myField:001085609805100*
--
added the *
I'm not sure what is happening here.  I'm thinking
that Lucene
is doing some number conversion internally when it sees only
digits.  When
I add the * maybe it presumes it is still a string.

How do I get a string
of digits to work without adding a *?

Thanks,
Reece
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For
additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Asian languages

2004-05-26 Thread Chandan Tamrakar
CJKAnalyzer suports chinese , japanese and korean languages , Im not sure
about the thai .
i got a CJKAnalyzer from lucene sandbox
- Original Message - 
From: Christophe Lombart [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, May 27, 2004 12:01 AM
Subject: Asian languages


 Which  asian languages are supported by Lucene ?
 What about corean, japanese, thaï, ... ?
 If they are not yet supported, what I need to do ?

 Thanks,
 Christophe

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Range Query Sombody HELP please

2004-05-26 Thread Karthik N S

Hi
Lucene developers

Is it possible to do Search and retrieve relevant information on the Indexed
Document
within in specific range settings which may be  similar to an

Query in SQL  =  select  *  from BOOKSHELF where  book1  between 100 and 200

ex:-

   search_word  ,   Book between  100   AND   200

[ Note:- where Book uniquefield  hit info which is already Indexed ]


Sombody Please Help me   :(


with regards
Karthik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]