Re: Lucene FieldCache memory requirements

2009-11-03 Thread Michael McCandless
On Mon, Nov 2, 2009 at 9:27 PM, Fuad Efendi f...@efendi.ca wrote:
 I believe this is correct estimate:

 C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]

   same as
 [String1_Document_Count + ... + String10_Document_Count + ...]
 x [4 bytes per DocumentID]

That's right.

Except: as Mark said, you'll also need transient memory = pointer (4
or 8 bytes) * (1+maxdoc), while the FieldCache is being loaded.  After
it's done being loaded, this sizes down to the number of unique terms.

But, if Lucene did the basic int packing, which really we should do,
since you only have 10 unique values, with a naive 4 bits per doc
encoding, you'd only need 1/8th the memory usage.  We could do a bit
better by encoding more than one document at a time...

Mike


RE: Lucene FieldCache memory requirements

2009-11-03 Thread Fuad Efendi
Sorry Mike, Mark, I am confused again...

Yes, I need some more memory for processing (while FieldCache is being
loaded), obviously, but it was not main subject...

With StringIndexCache, I have 10 arrays (cardinality of this field is 10)
storing  (int) Lucene Document ID.

 Except: as Mark said, you'll also need transient memory = pointer (4
 or 8 bytes) * (1+maxdoc), while the FieldCache is being loaded.

Ok, I see it:
  final int[] retArray = new int[reader.maxDoc()];
  String[] mterms = new String[reader.maxDoc()+1];

I can't track right now (limited in time), I think mterms is local variable
and will size down to 0...



So that correct formula is... weird one... if you don't want unexpected OOM
or overloaded GC (WeakHashMaps...):

  [some heap] + [Non-Tokenized_Field_Count] x [maxdoc] x [4 bytes + 8
bytes]

(for 64-bit)


-Fuad


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: November-03-09 5:00 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene FieldCache memory requirements
 
 On Mon, Nov 2, 2009 at 9:27 PM, Fuad Efendi f...@efendi.ca wrote:
  I believe this is correct estimate:
 
  C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]
 
    same as
  [String1_Document_Count + ... + String10_Document_Count + ...]
  x [4 bytes per DocumentID]
 
 That's right.
 
 Except: as Mark said, you'll also need transient memory = pointer (4
 or 8 bytes) * (1+maxdoc), while the FieldCache is being loaded.  After
 it's done being loaded, this sizes down to the number of unique terms.
 
 But, if Lucene did the basic int packing, which really we should do,
 since you only have 10 unique values, with a naive 4 bits per doc
 encoding, you'd only need 1/8th the memory usage.  We could do a bit
 better by encoding more than one document at a time...
 
 Mike




RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Any thoughts regarding the subject? I hope FieldCache doesn't use more than
6 bytes per document-field instance... I am too lazy to research Lucene
source code, I hope someone can provide exact answer... Thanks


 Subject: Lucene FieldCache memory requirements
 
 Hi,
 
 
 Can anyone confirm Lucene FieldCache memory requirements? I have 100
 millions docs with non-tokenized field country (10 different countries);
I
 expect it requires array of (int, long), size of array 100,000,000,
 without any impact of country field length;
 
 it requires 600,000,000 bytes: int is pointer to document (Lucene
document
 ID),  and long is pointer to String value...
 
 Am I right, is it 600Mb just for this country (indexed, non-tokenized,
 non-boolean) field and 100 millions docs? I need to calculate exact
minimum RAM
 requirements...
 
 I believe it shouldn't depend on cardinality (distribution) of field...
 
 Thanks,
 Fuad
 
 
 
 





Re: Lucene FieldCache memory requirements

2009-11-02 Thread Michael McCandless
Which FieldCache API are you using?  getStrings?  or getStringIndex
(which is used, under the hood, if you sort by this field).

Mike

On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
 Any thoughts regarding the subject? I hope FieldCache doesn't use more than
 6 bytes per document-field instance... I am too lazy to research Lucene
 source code, I hope someone can provide exact answer... Thanks


 Subject: Lucene FieldCache memory requirements

 Hi,


 Can anyone confirm Lucene FieldCache memory requirements? I have 100
 millions docs with non-tokenized field country (10 different countries);
 I
 expect it requires array of (int, long), size of array 100,000,000,
 without any impact of country field length;

 it requires 600,000,000 bytes: int is pointer to document (Lucene
 document
 ID),  and long is pointer to String value...

 Am I right, is it 600Mb just for this country (indexed, non-tokenized,
 non-boolean) field and 100 millions docs? I need to calculate exact
 minimum RAM
 requirements...

 I believe it shouldn't depend on cardinality (distribution) of field...

 Thanks,
 Fuad










RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
I am not using Lucene API directly; I am using SOLR which uses Lucene
FieldCache for faceting on non-tokenized fields...
I think this cache will be lazily loaded, until user executes sorted (by
this field) SOLR query for all documents *:* - in this case it will be fully
populated...


 Subject: Re: Lucene FieldCache memory requirements
 
 Which FieldCache API are you using?  getStrings?  or getStringIndex
 (which is used, under the hood, if you sort by this field).
 
 Mike
 
 On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
  Any thoughts regarding the subject? I hope FieldCache doesn't use more
than
  6 bytes per document-field instance... I am too lazy to research Lucene
  source code, I hope someone can provide exact answer... Thanks
 
 
  Subject: Lucene FieldCache memory requirements
 
  Hi,
 
 
  Can anyone confirm Lucene FieldCache memory requirements? I have 100
  millions docs with non-tokenized field country (10 different
countries);
  I
  expect it requires array of (int, long), size of array 100,000,000,
  without any impact of country field length;
 
  it requires 600,000,000 bytes: int is pointer to document (Lucene
  document
  ID),  and long is pointer to String value...
 
  Am I right, is it 600Mb just for this country (indexed,
non-tokenized,
  non-boolean) field and 100 millions docs? I need to calculate exact
  minimum RAM
  requirements...
 
  I believe it shouldn't depend on cardinality (distribution) of field...
 
  Thanks,
  Fuad
 
 
 
 
 
 
 
 




Re: Lucene FieldCache memory requirements

2009-11-02 Thread Michael McCandless
OK I think someone who knows how Solr uses the fieldCache for this
type of field will have to pipe up.

For Lucene directly, simple strings would consume an pointer (4 or 8
bytes depending on whether your JRE is 64bit) per doc, and the string
index would consume an int (4 bytes) per doc.  (Each also consume
negligible (for your case) memory to hold the actual string values).

Note that for your use case, this is exceptionally wasteful.  If
Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
then it'd take much fewer bits to reference the values, since you have
only 10 unique string values.

Mike

On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote:
 I am not using Lucene API directly; I am using SOLR which uses Lucene
 FieldCache for faceting on non-tokenized fields...
 I think this cache will be lazily loaded, until user executes sorted (by
 this field) SOLR query for all documents *:* - in this case it will be fully
 populated...


 Subject: Re: Lucene FieldCache memory requirements

 Which FieldCache API are you using?  getStrings?  or getStringIndex
 (which is used, under the hood, if you sort by this field).

 Mike

 On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
  Any thoughts regarding the subject? I hope FieldCache doesn't use more
 than
  6 bytes per document-field instance... I am too lazy to research Lucene
  source code, I hope someone can provide exact answer... Thanks
 
 
  Subject: Lucene FieldCache memory requirements
 
  Hi,
 
 
  Can anyone confirm Lucene FieldCache memory requirements? I have 100
  millions docs with non-tokenized field country (10 different
 countries);
  I
  expect it requires array of (int, long), size of array 100,000,000,
  without any impact of country field length;
 
  it requires 600,000,000 bytes: int is pointer to document (Lucene
  document
  ID),  and long is pointer to String value...
 
  Am I right, is it 600Mb just for this country (indexed,
 non-tokenized,
  non-boolean) field and 100 millions docs? I need to calculate exact
  minimum RAM
  requirements...
 
  I believe it shouldn't depend on cardinality (distribution) of field...
 
  Thanks,
  Fuad
 
 
 
 
 
 
 
 





RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi

Thank you very much Mike,

I found it:
org.apache.solr.request.SimpleFacets
...
// TODO: future logic could use filters instead of the fieldcache if
// the number of terms in the field is small enough.
counts = getFieldCacheCounts(searcher, base, field, offset,limit,
mincount, missing, sort, prefix);
...
FieldCache.StringIndex si =
FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
final String[] terms = si.lookup;
final int[] termNum = si.order;
...


So that 64-bit requires more memory :)


Mike, am I right here?
[(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
(64-bit JVM)
1.2Gb RAM for this...

Or, may be I am wrong:
 For Lucene directly, simple strings would consume an pointer (4 or 8
 bytes depending on whether your JRE is 64bit) per doc, and the string
 index would consume an int (4 bytes) per doc.

[8 bytes (64bit)] x [number of documents (100mlns)]? 
0.8Gb

Kind of Map between String and DocSet, saving 4 bytes... Key is String,
and Value is array of 64-bit pointers to Document. Why 64-bit (for 64-bit
JVM)? I always thought it is (int) documentId...

Am I right?


Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!

 Note that for your use case, this is exceptionally wasteful.  
This is probably very common case... I think it should be confirmed by
Lucene developers too... FieldCache is warmed anyway, even when we don't use
SOLR...

 
-Fuad







 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: November-02-09 6:00 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene FieldCache memory requirements
 
 OK I think someone who knows how Solr uses the fieldCache for this
 type of field will have to pipe up.
 
 For Lucene directly, simple strings would consume an pointer (4 or 8
 bytes depending on whether your JRE is 64bit) per doc, and the string
 index would consume an int (4 bytes) per doc.  (Each also consume
 negligible (for your case) memory to hold the actual string values).
 
 Note that for your use case, this is exceptionally wasteful.  If
 Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
 then it'd take much fewer bits to reference the values, since you have
 only 10 unique string values.
 
 Mike
 
 On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote:
  I am not using Lucene API directly; I am using SOLR which uses Lucene
  FieldCache for faceting on non-tokenized fields...
  I think this cache will be lazily loaded, until user executes sorted (by
  this field) SOLR query for all documents *:* - in this case it will be
fully
  populated...
 
 
  Subject: Re: Lucene FieldCache memory requirements
 
  Which FieldCache API are you using?  getStrings?  or getStringIndex
  (which is used, under the hood, if you sort by this field).
 
  Mike
 
  On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
   Any thoughts regarding the subject? I hope FieldCache doesn't use
more
  than
   6 bytes per document-field instance... I am too lazy to research
Lucene
   source code, I hope someone can provide exact answer... Thanks
  
  
   Subject: Lucene FieldCache memory requirements
  
   Hi,
  
  
   Can anyone confirm Lucene FieldCache memory requirements? I have 100
   millions docs with non-tokenized field country (10 different
  countries);
   I
   expect it requires array of (int, long), size of array
100,000,000,
   without any impact of country field length;
  
   it requires 600,000,000 bytes: int is pointer to document (Lucene
   document
   ID),  and long is pointer to String value...
  
   Am I right, is it 600Mb just for this country (indexed,
  non-tokenized,
   non-boolean) field and 100 millions docs? I need to calculate exact
   minimum RAM
   requirements...
  
   I believe it shouldn't depend on cardinality (distribution) of
field...
  
   Thanks,
   Fuad
  
  
  
  
  
  
  
  
 
 
 




Re: Lucene FieldCache memory requirements

2009-11-02 Thread Mark Miller
It also briefly requires more memory than just that - it allocates an
array the size of maxdoc+1 to hold the unique terms - and then sizes down.

Possibly we can use the getUnuiqeTermCount method in the flexible
indexing branch to get rid of that - which is why I was thinking it
might be a good idea to drop the unsupported exception in that method
for things like multi reader and just do the work to get the right
number (currently there is a comment that the user should do that work
if necessary, making the call unreliable for this).

Fuad Efendi wrote:
 Thank you very much Mike,

 I found it:
 org.apache.solr.request.SimpleFacets
 ...
 // TODO: future logic could use filters instead of the fieldcache if
 // the number of terms in the field is small enough.
 counts = getFieldCacheCounts(searcher, base, field, offset,limit,
 mincount, missing, sort, prefix);
 ...
 FieldCache.StringIndex si =
 FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
 final String[] terms = si.lookup;
 final int[] termNum = si.order;
 ...


 So that 64-bit requires more memory :)


 Mike, am I right here?
 [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
 (64-bit JVM)
 1.2Gb RAM for this...

 Or, may be I am wrong:
   
 For Lucene directly, simple strings would consume an pointer (4 or 8
 bytes depending on whether your JRE is 64bit) per doc, and the string
 index would consume an int (4 bytes) per doc.
 

 [8 bytes (64bit)] x [number of documents (100mlns)]? 
 0.8Gb

 Kind of Map between String and DocSet, saving 4 bytes... Key is String,
 and Value is array of 64-bit pointers to Document. Why 64-bit (for 64-bit
 JVM)? I always thought it is (int) documentId...

 Am I right?


 Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!

   
 Note that for your use case, this is exceptionally wasteful.  
   
 This is probably very common case... I think it should be confirmed by
 Lucene developers too... FieldCache is warmed anyway, even when we don't use
 SOLR...

  
 -Fuad







   
 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: November-02-09 6:00 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene FieldCache memory requirements

 OK I think someone who knows how Solr uses the fieldCache for this
 type of field will have to pipe up.

 For Lucene directly, simple strings would consume an pointer (4 or 8
 bytes depending on whether your JRE is 64bit) per doc, and the string
 index would consume an int (4 bytes) per doc.  (Each also consume
 negligible (for your case) memory to hold the actual string values).

 Note that for your use case, this is exceptionally wasteful.  If
 Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
 then it'd take much fewer bits to reference the values, since you have
 only 10 unique string values.

 Mike

 On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote:
 
 I am not using Lucene API directly; I am using SOLR which uses Lucene
 FieldCache for faceting on non-tokenized fields...
 I think this cache will be lazily loaded, until user executes sorted (by
 this field) SOLR query for all documents *:* - in this case it will be
   
 fully
   
 populated...


   
 Subject: Re: Lucene FieldCache memory requirements

 Which FieldCache API are you using?  getStrings?  or getStringIndex
 (which is used, under the hood, if you sort by this field).

 Mike

 On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
 
 Any thoughts regarding the subject? I hope FieldCache doesn't use
   
 more
   
 than
   
 6 bytes per document-field instance... I am too lazy to research
   
 Lucene
   
 source code, I hope someone can provide exact answer... Thanks


   
 Subject: Lucene FieldCache memory requirements

 Hi,


 Can anyone confirm Lucene FieldCache memory requirements? I have 100
 millions docs with non-tokenized field country (10 different
 
 countries);
   
 I
   
 expect it requires array of (int, long), size of array
 
 100,000,000,
   
 without any impact of country field length;

 it requires 600,000,000 bytes: int is pointer to document (Lucene
 
 document
   
 ID),  and long is pointer to String value...

 Am I right, is it 600Mb just for this country (indexed,
 
 non-tokenized,
   
 non-boolean) field and 100 millions docs? I need to calculate exact
 
 minimum RAM
   
 requirements...

 I believe it shouldn't depend on cardinality (distribution) of
 
 field...
   
 Thanks,
 Fuad




 


   

   


   


-- 
- Mark

http://www.lucidimagination.com





RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi

Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no
difference between maxdoc and maxdoc + 1 for such estimate... difference is
between 0.4Gb and 1.2Gb...


So, let's vote ;)

A. [maxdoc] x [8 bytes ~ pointer to String object]

B. [maxdoc] x [8 bytes ~ pointer to Document object]

C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID] 
- same as [String1_Document_Count + ... + String10_Document_Count] x [4
bytes ~ DocumentID]

D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...]


Please confirm that it is Pointer to Object and not Lucene Document ID... I
hope it is (int) Document ID...





 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: November-02-09 6:52 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene FieldCache memory requirements
 
 It also briefly requires more memory than just that - it allocates an
 array the size of maxdoc+1 to hold the unique terms - and then sizes down.
 
 Possibly we can use the getUnuiqeTermCount method in the flexible
 indexing branch to get rid of that - which is why I was thinking it
 might be a good idea to drop the unsupported exception in that method
 for things like multi reader and just do the work to get the right
 number (currently there is a comment that the user should do that work
 if necessary, making the call unreliable for this).
 
 Fuad Efendi wrote:
  Thank you very much Mike,
 
  I found it:
  org.apache.solr.request.SimpleFacets
  ...
  // TODO: future logic could use filters instead of the
fieldcache if
  // the number of terms in the field is small enough.
  counts = getFieldCacheCounts(searcher, base, field,
offset,limit,
  mincount, missing, sort, prefix);
  ...
  FieldCache.StringIndex si =
  FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
  final String[] terms = si.lookup;
  final int[] termNum = si.order;
  ...
 
 
  So that 64-bit requires more memory :)
 
 
  Mike, am I right here?
  [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
  (64-bit JVM)
  1.2Gb RAM for this...
 
  Or, may be I am wrong:
 
  For Lucene directly, simple strings would consume an pointer (4 or 8
  bytes depending on whether your JRE is 64bit) per doc, and the string
  index would consume an int (4 bytes) per doc.
 
 
  [8 bytes (64bit)] x [number of documents (100mlns)]?
  0.8Gb
 
  Kind of Map between String and DocSet, saving 4 bytes... Key is
String,
  and Value is array of 64-bit pointers to Document. Why 64-bit (for
64-bit
  JVM)? I always thought it is (int) documentId...
 
  Am I right?
 
 
  Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!
 
 
  Note that for your use case, this is exceptionally wasteful.
 
  This is probably very common case... I think it should be confirmed by
  Lucene developers too... FieldCache is warmed anyway, even when we don't
use
  SOLR...
 
 
  -Fuad
 
 
 
 
 
 
 
 
  -Original Message-
  From: Michael McCandless [mailto:luc...@mikemccandless.com]
  Sent: November-02-09 6:00 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Lucene FieldCache memory requirements
 
  OK I think someone who knows how Solr uses the fieldCache for this
  type of field will have to pipe up.
 
  For Lucene directly, simple strings would consume an pointer (4 or 8
  bytes depending on whether your JRE is 64bit) per doc, and the string
  index would consume an int (4 bytes) per doc.  (Each also consume
  negligible (for your case) memory to hold the actual string values).
 
  Note that for your use case, this is exceptionally wasteful.  If
  Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
  then it'd take much fewer bits to reference the values, since you have
  only 10 unique string values.
 
  Mike
 
  On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote:
 
  I am not using Lucene API directly; I am using SOLR which uses Lucene
  FieldCache for faceting on non-tokenized fields...
  I think this cache will be lazily loaded, until user executes sorted
(by
  this field) SOLR query for all documents *:* - in this case it will be
 
  fully
 
  populated...
 
 
 
  Subject: Re: Lucene FieldCache memory requirements
 
  Which FieldCache API are you using?  getStrings?  or getStringIndex
  (which is used, under the hood, if you sort by this field).
 
  Mike
 
  On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
 
  Any thoughts regarding the subject? I hope FieldCache doesn't use
 
  more
 
  than
 
  6 bytes per document-field instance... I am too lazy to research
 
  Lucene
 
  source code, I hope someone can provide exact answer... Thanks
 
 
 
  Subject: Lucene FieldCache memory requirements
 
  Hi,
 
 
  Can anyone confirm Lucene FieldCache memory requirements? I have
100
  millions docs with non-tokenized field country (10 different
 
  countries);
 
  I
 
  expect it requires array of (int, long), size of array
 
  100,000,000

Re: Lucene FieldCache memory requirements

2009-11-02 Thread Mark Miller
Fuad Efendi wrote:
 Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no
 difference between maxdoc and maxdoc + 1 for such estimate... difference is
 between 0.4Gb and 1.2Gb...

   
I'm not sure I understand - but I didn't mean to imply the +1 on maxdoc
meant anything. The issue is that in the end, it only needs a String
array the size of String[UniqueTerms] - but because it can't easily
figure out that number, it first creates an array of String[MaxDoc+1] -
so with a ton of docs and a few uniques, you get a temp boost in the RAM
reqs until it sizes it down. A pointer for each doc.

-- 
- Mark

http://www.lucidimagination.com





RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
I just did some tests in a completely new index (Slave), sort by
low-distributed non-tokenized Field (such as Country) takes milliseconds,
but sort (ascending) on tokenized field with heavy distribution took 30
seconds (initially). Second sort (descending) took milliseconds. Generic
query *.*; FieldCache is not used for tokenized fields... how it is sorted
:)
Fortunately, no any OOM.
-Fuad




RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Mark,

I don't understand this: 
 so with a ton of docs and a few uniques, you get a temp boost in the RAM
 reqs until it sizes it down.

Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it is
not cache?


And this:
 A pointer for each doc.

Why can't we use (int) DocumentID? For me, it is natural; 64-bit pointer to
an object in RAM is not natural (in Lucene world)...


So, is it [maxdoc]x[4-bytes], or [maxdoc]x[8-bytes]?... 
-Fuad







RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
I believe this is correct estimate:

 C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]

   same as 
 [String1_Document_Count + ... + String10_Document_Count + ...] 
 x [4 bytes per DocumentID]


So, for 100 millions docs we need 400Mb for each(!) non-tokenized field.
Although FieldCacheImpl is based on WeakHashMap (somewhere...), we can't
rely on sizing down with SOLR faceting features


I think I finally found the answer...

  /** Expert: Stores term text values and document ordering data. */
  public static class StringIndex {
...   
/** All the term values, in natural order. */
public final String[] lookup;

/** For each document, an index into the lookup array. */
public final int[] order;
...
  }



Another API:
  /** Checks the internal cache for an appropriate entry, and if none
   * is found, reads the term values in codefield/code and returns an
array
   * of size codereader.maxDoc()/code containing the value each document
   * has in the given field.
   * @param reader  Used to get field values.
   * @param field   Which field contains the strings.
   * @return The values in the given field for each document.
   * @throws IOException  If any error occurs.
   */
  public String[] getStrings (IndexReader reader, String field)
  throws IOException;


Looks similar; cache size is [maxdoc]; however values stored are 8-byte
pointers for 64-bit JVM.


  private MapClass?,Cache caches;
  private synchronized void init() {
caches = new HashMapClass?,Cache(7);
...
caches.put(String.class, new StringCache(this));
caches.put(StringIndex.class, new StringIndexCache(this));
...
  }


StringCache and StringIndexCache use WeakHashMap internally... but objects
won't be ever garbage collected in a faceted production system...

SOLR SimpleFacets don't use getStrings API, so the hope is memory
requirements are minimized.


However, Lucene may use it internally for some queries (or, for instance, to
get access to a nontokenized cached field without reading index)... to be
safe, use this in your basic memory estimates:


[512Mb ~ 1Gb] + [non_tokenized_fields_count] x [maxdoc] x [8 bytes]


-Fuad



 -Original Message-
 From: Fuad Efendi [mailto:f...@efendi.ca]
 Sent: November-02-09 7:37 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Lucene FieldCache memory requirements
 
 
 Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no
 difference between maxdoc and maxdoc + 1 for such estimate... difference
is
 between 0.4Gb and 1.2Gb...
 
 
 So, let's vote ;)
 
 A. [maxdoc] x [8 bytes ~ pointer to String object]
 
 B. [maxdoc] x [8 bytes ~ pointer to Document object]
 
 C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]
 - same as [String1_Document_Count + ... + String10_Document_Count] x [4
 bytes ~ DocumentID]
 
 D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...]
 
 
 Please confirm that it is Pointer to Object and not Lucene Document ID...
I
 hope it is (int) Document ID...
 
 
 
 
 
  -Original Message-
  From: Mark Miller [mailto:markrmil...@gmail.com]
  Sent: November-02-09 6:52 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Lucene FieldCache memory requirements
 
  It also briefly requires more memory than just that - it allocates an
  array the size of maxdoc+1 to hold the unique terms - and then sizes
down.
 
  Possibly we can use the getUnuiqeTermCount method in the flexible
  indexing branch to get rid of that - which is why I was thinking it
  might be a good idea to drop the unsupported exception in that method
  for things like multi reader and just do the work to get the right
  number (currently there is a comment that the user should do that work
  if necessary, making the call unreliable for this).
 
  Fuad Efendi wrote:
   Thank you very much Mike,
  
   I found it:
   org.apache.solr.request.SimpleFacets
   ...
   // TODO: future logic could use filters instead of the
 fieldcache if
   // the number of terms in the field is small enough.
   counts = getFieldCacheCounts(searcher, base, field,
 offset,limit,
   mincount, missing, sort, prefix);
   ...
   FieldCache.StringIndex si =
   FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
   final String[] terms = si.lookup;
   final int[] termNum = si.order;
   ...
  
  
   So that 64-bit requires more memory :)
  
  
   Mike, am I right here?
   [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents
(100mlns)]
   (64-bit JVM)
   1.2Gb RAM for this...
  
   Or, may be I am wrong:
  
   For Lucene directly, simple strings would consume an pointer (4 or 8
   bytes depending on whether your JRE is 64bit) per doc, and the string
   index would consume an int (4 bytes) per doc.
  
  
   [8 bytes (64bit)] x [number of documents (100mlns)]?
   0.8Gb
  
   Kind of Map between String and DocSet, saving 4 bytes... Key is
 String,
   and Value is array of 64-bit pointers to Document. Why 64-bit (for
 64

RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Hi Mark,

Yes, I understand it now; however, how will StringIndexCache size down in a
production system faceting by Country on a homepage? This is SOLR
specific...


Lucene specific: Lucene doesn't read from disk if it can retrieve field
value for a specific document ID from cache. How will it size down in purely
Lucene-based heavy-loaded production system? Especially if this cache is
used for query optimizations.



 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: November-02-09 8:53 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene FieldCache memory requirements
 
  static final class StringIndexCache extends Cache {
 StringIndexCache(FieldCache wrapper) {
   super(wrapper);
 }
 
 @Override
 protected Object createValue(IndexReader reader, Entry entryKey)
 throws IOException {
   String field = StringHelper.intern(entryKey.field);
   final int[] retArray = new int[reader.maxDoc()];
   String[] mterms = new String[reader.maxDoc()+1];
   TermDocs termDocs = reader.termDocs();
   TermEnum termEnum = reader.terms (new Term (field));
   int t = 0;  // current term number
 
   // an entry for documents that have no terms in this field
   // should a document with no terms be at top or bottom?
   // this puts them at the top - if it is changed,
 FieldDocSortedHitQueue
   // needs to change as well.
   mterms[t++] = null;
 
   try {
 do {
   Term term = termEnum.term();
   if (term==null || term.field() != field) break;
 
   // store term text
   // we expect that there is at most one term per document
   if (t = mterms.length) throw new RuntimeException (there are
 more terms than  +
   documents in field \ + field + \, but it's
 impossible to sort on  +
   tokenized fields);
   mterms[t] = term.text();
 
   termDocs.seek (termEnum);
   while (termDocs.next()) {
 retArray[termDocs.doc()] = t;
   }
 
   t++;
 } while (termEnum.next());
   } finally {
 termDocs.close();
 termEnum.close();
   }
 
   if (t == 0) {
 // if there are no terms, make the term array
 // have a single null entry
 mterms = new String[1];
   } else if (t  mterms.length) {
 // if there are less terms than documents,
 // trim off the dead array space
 String[] terms = new String[t];
 System.arraycopy (mterms, 0, terms, 0, t);
 mterms = terms;
   }
 
   StringIndex value = new StringIndex (retArray, mterms);
   return value;
 }
   };
 
 The formula for a String Index fieldcache is essentially the String
 array of unique terms (which does indeed size down at the bottom) and
 the int array indexing into the String array.
 
 
 Fuad Efendi wrote:
  To be correct, I analyzed FieldCache awhile ago and I believed it never
  sizes down...
 
  /**
   * Expert: The default cache implementation, storing all values in
memory.
   * A WeakHashMap is used for storage.
   *
   * pCreated: May 19, 2004 4:40:36 PM
   *
   * @since   lucene 1.4
   */
 
 
  Will it size down? Only if we are not faceting (as in SOLR v.1.3)...
 
  And I am still unsure, Document ID vs. Object Pointer.
 
 
 
 
 
  I don't understand this:
 
  so with a ton of docs and a few uniques, you get a temp boost in the
RAM
  reqs until it sizes it down.
 
  Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it
is
  not cache?
 
 
 
 
 
 
 
 --
 - Mark
 
 http://www.lucidimagination.com
 
 





RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Even in simplistic scenario, when it is Garbage Collected, we still
_need_to_be_able_ to allocate enough RAM to FieldCache on demand... linear
dependency on document count...


 
 Hi Mark,
 
 Yes, I understand it now; however, how will StringIndexCache size down in
a
 production system faceting by Country on a homepage? This is SOLR
 specific...
 
 
 Lucene specific: Lucene doesn't read from disk if it can retrieve field
 value for a specific document ID from cache. How will it size down in
purely
 Lucene-based heavy-loaded production system? Especially if this cache is
 used for query optimizations.
 




RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
FieldCache uses internally WeakHashMap... nothing wrong, but... no any
Garbage Collection tuning will help in case if allocated RAM is not enough
for replacing Weak** with Strong**, especially for SOLR faceting... 10%-15%
CPU taken by GC were reported...
-Fuad





Lucene FieldCache memory requirements

2009-10-30 Thread Fuad Efendi
Hi,


Can anyone confirm Lucene FieldCache memory requirements? I have 100
millions docs with non-tokenized field country (10 different countries); I
expect it requires array of (int, long), size of array 100,000,000,
without any impact of country field length; 

it requires 600,000,000 bytes: int is pointer to document (Lucene document
ID),  and long is pointer to String value...

Am I right, is it 600Mb just for this country (indexed, non-tokenized,
non-boolean) field and 1 million docs? I need to calculate exact minimum RAM
requirements... 

I believe it shouldn't depend on cardinality (distribution) of field...

Thanks,
Fuad