On 8/8/06, bo_b <[EMAIL PROTECTED]> wrote:
I have tried indexing a vbulletin message board, containing roughly 7
million posts.
My schema is as follows:
<field name="postid" type="int" indexed="true" stored="true" />
<field name="threadid" type="int" indexed="false" stored="true" />
<field name="username" type="string" indexed="false" stored="true" />
<field name="title" type="string" indexed="false" stored="true" />
<field name="teaser" type="string" indexed="false" stored="true" />
<field name="date" type="date" indexed="true" stored="true"
omitNorms="true"/>
<field name="blob" type="text" indexed="true" stored="false"
multiValued="true" omitNorms="true"/>
<uniqueKey>postid</uniqueKey>
<copyField source="username" dest="blob"/>
<copyField source="title" dest="blob"/>
I am trying to figure out if there is anything I can do to lower the disk
usage and or increase sorting speed before we go live with the search. So a
few questions came to mind
1) Sorting I was planning to do on the date field(aka add "; date desc").
But I was wondering if it would be more efficient to sort on postid
instead(since higher postid in vbulletin=newer post).
No, they will be roughly the same speed.
What you *could* try to do is always *index* documents in postid/date
order... then sorting would not require any FieldCache entry. It
would require a minor change to Solr (allow sorting on lucene internal
docid, which matches the order that documents are added to an index).
2) If we sort on postid instead, would we need to use integer, or the sint
type? I assume sint would be faster(?) but perhaps use more storage?
If you need range queries, SortableIntField values are ordered
correctly for them to work.
For sorting, both int and sint fields work... the difference is in how
the FieldCache entry is built.
For IntField, an Integer.parseInt(str) needs to be done for each distinct str.
SortableIntField is sorted like strings... the ordinal (order in the
index) is recorded for each distinct value.
So sint will build the FieldCache faster, but the string values will
cause the entry to be larger. Aftert the FieldCache entry is built,
both int and sint should be comparable in speed.
3) About Omitnorms=true, I must admit i dont exactly understand what it does
:) But I read that it would save 1 byte pr document.
One byte per document for that indexed field, regardless of if the
field exists for all documents or not. You loose length normalization
(an increase in score for matches on shorter fields... not needed if
it's not a full-text field anyway), and you loose index-time boosts
(which it doesn't look like you are using).
Since "blob" looks like the body of the post, I think you probably
*want* norms to get the length normalization factors. Probably all
other indexed fields can have omitNorms="true" (including postid)
Are the any other
fields I need to add it to in my schema? As far as I understand
Omitnorms=true only makes a difference for indexed=true fields, and doesnt
do anything for int fields?
omitNorms=true will omit norms for *any* indexed field, including int
fields. Deep inside Lucene, all indexed fields are string fields.
-Yonik