Changing schema and reindexing documents

2010-10-06 Thread M.Rizwan
Hi,

I have lots of documents in my solr index.
Now I have a requirement to change its schema and add a new field.

What should I do, so that all the documents keep working after schema
change?

Thanks

Riz


Re: Changing schema and reindexing documents

2010-10-06 Thread Gora Mohanty
On Wed, Oct 6, 2010 at 11:59 AM, M.Rizwan griz...@gmail.com wrote:
 Hi,

 I have lots of documents in my solr index.
 Now I have a requirement to change its schema and add a new field.

 What should I do, so that all the documents keep working after schema
 change?
[...]

You will need to reindex if the schema is changed.

Regards,
Gora


Re: Changing schema and reindexing documents

2010-10-06 Thread Lance Norskog
If you add a field to the schema file and restart Solr, the existing
documents won't have that field. New documents that you index will. If
this is ok, you are safe.

In general, don't change the schema without indexing. You can trip
over the weirdest problems.

On Wed, Oct 6, 2010 at 12:31 AM, Gora Mohanty g...@mimirtech.com wrote:
 On Wed, Oct 6, 2010 at 11:59 AM, M.Rizwan griz...@gmail.com wrote:
 Hi,

 I have lots of documents in my solr index.
 Now I have a requirement to change its schema and add a new field.

 What should I do, so that all the documents keep working after schema
 change?
 [...]

 You will need to reindex if the schema is changed.

 Regards,
 Gora




-- 
Lance Norskog
goks...@gmail.com


Trouble with exception Document [Null] missing required field DocID

2010-10-06 Thread Ahson Iqbal
Hi All

I m new to solr extract request handler, i want to index pdf documents but when 
i submit document to solr using curl i got following exception 


Document [Null] missing required field DocID

my curl command is like 


curl 
http://localhost:8983/solr1/update/extract?literal.DocID=123fmap.content=Contentscommit=true;
 -F myfi...@d:/solr/apache-solr-1.4.0/docs/filename1.pdf

and here is my schema

fields
field name=DocID type=string indexed=true stored=true/
field name=Contents type=text indexed=true stored=true/
dynamicField name=ignored_* type=ignored indexed=false stored=false/
/fields
uniqueKeyDocID/uniqueKey


please help me if i m missing something

Regards
Ahsan


  

Help needed on indexing Zope CMS content

2010-10-06 Thread Marian Steinbach
Hi!

We are planning to periodically index several MySQL database tables
plus a Zope CMS document tree in Solr.

Indexing the Zope DB seems to be tricky though.

Has anyone here done this and could provide a URL or sample code to a
solution? Something running as a python script would be great, but
different approaches are welcome, too.

I know this is only slightly related to Solr, but since Solr has
spread so widely, I wanted to give it a try here.

Thanks in advance!

Marian


Re: Help needed on indexing Zope CMS content

2010-10-06 Thread Gora Mohanty
On Wed, Oct 6, 2010 at 1:58 PM, Marian Steinbach mar...@sendung.de wrote:
 Hi!

 We are planning to periodically index several MySQL database tables
 plus a Zope CMS document tree in Solr.

 Indexing the Zope DB seems to be tricky though.
[...]

Been a while since I touched Zope, but there seems to be something called
collective.solr available:
* http://pypi.python.org/pypi/collective.solr
* http://www.contentmanagementsoftware.info/plone/collective.solr

Does this not meet your needs?

Regards,
Gora


Experience running Solr on ISCSI

2010-10-06 Thread Thijs

Hi.

Our hardware department is planning on moving some stuff to new machines 
(on our request)
They are suggesting using virtualization (some CISCO solution) on those 
machines and having the 'disk' connected via ISCSI.


Does anybody have experience running a SOLR index on a ISCSI drive?
We have already tried with NFS but that is slowing the index process 
down to much, about 12 times slower. So NFS is a no-go. I could have 
know that as it is mentioned on a lot of places to avoid nfs. But I 
can't find info about ISCSI


Does anybody have experience running a SOLR index on a virtualized 
environment? Is it resistant enough that it keeps working when the 
virtualized machine is transfered to a different hardware node?


thanks


Re: Solr UIMA integration

2010-10-06 Thread maheshkumar

Hi Tommaso,

I will try the service call outside Solr/UIMA.

And the text i am using is 

FileName: Entity.xml
add
doc
  field name=referenceEntity.xml/field
  field name=contentSenator Dick Durbin (D-IL)  Chicago , March 3,
2007./field
  field name=titleEntity Extraction/field  
/doc
/add

and using curl to index it curl http://localhost:8080/solr/update -F
solr.bo...@entity.xml

Thanks
Mahesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-UIMA-integration-tp1528253p1642093.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr UIMA integration

2010-10-06 Thread Tommaso Teofili
Hi Mahesh,
the issue here is that you're not sending a field name=text.../field
to Solr from which UIMAUpdateRequestProcessor extracts text to analyze :)
Infact by default UIMAUpdateRequestProcessor extracts text to analyze from
that field and send that value to a UIMA pipeline.
Obviously you could choose to customize this behavior making
UIMAUpdateRequestProcessor read values from each field that is being indexed
in the document or another field.

However this made me realize that in such situations that field value is a
String null and not a null object, as I expected; so line 57 in
UIMAUpdateRequestProcessor should be changed as following to prevent such
errors:
...
if (textFieldValue != null  !.equals(textFieldValue) 
!null.equals(textFieldValue)) {
...
Hope this helps,
Tommaso

2010/10/6 maheshkumar maheshkuma...@gmail.com


 Hi Tommaso,

 I will try the service call outside Solr/UIMA.

 And the text i am using is

 FileName: Entity.xml
 add
 doc
  field name=referenceEntity.xml/field
  field name=contentSenator Dick Durbin (D-IL)  Chicago , March 3,
 2007./field
  field name=titleEntity Extraction/field
 /doc
 /add

 and using curl to index it curl http://localhost:8080/solr/update -F
 solr.bo...@entity.xml

 Thanks
 Mahesh
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-UIMA-integration-tp1528253p1642093.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Help needed on indexing Zope CMS content

2010-10-06 Thread calvinhp


Marian Steinbach-3 wrote:
 
 We are planning to periodically index several MySQL database tables
 plus a Zope CMS document tree in Solr.
 
 Indexing the Zope DB seems to be tricky though.
 
 Has anyone here done this and could provide a URL or sample code to a
 solution? Something running as a python script would be great, but
 different approaches are welcome, too.
 

We implemented SolrIndex for a customer and released it as a ZCatalog
plugin.  You can find it here:

http://pypi.python.org/pypi/alm.solrindex/

It has no dependancies on Plone and is pretty much drop in and just add it
as an index in your sites catalog.

Cheers,
Calvin

-
-- 
Six Feet Up, Inc. | Sponsor of Plone Conference 2010 (Oct. 25th-31st)
Direct Line: +1 (317) 861-5948 x602
Email: cal...@sixfeetup.com
Try Plone 4 Today at: http://plone4demo.com

How am I doing? Please contact my manager Gabrielle Hendryx-Parker
at gabrie...@sixfeetup.com with any feedback.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-needed-on-indexing-Zope-CMS-content-tp1641160p1642806.html
Sent from the Solr - User mailing list archive at Nabble.com.


phrase query with autosuggest (SOLR-1316)

2010-10-06 Thread mike anderson
It seemed like SOLR-1316 was a little too long to continue the conversation.

Is there support for quotes indicating a phrase query. For example, my
autosuggest query for mike sha ought to return mike shaffer, mike
sharp, etc. Instead I get suggestions for mike and for sha, resulting
in a collated result mike r meyer shaw,

Cheers,
Mike


Re: having problem about Solr Date Field.

2010-10-06 Thread Gora Mohanty
On Wed, Oct 6, 2010 at 9:17 PM, Kouta Osabe kota0919was...@gmail.com wrote:
 Hi, Gora

 Thanks for your advice.

 and then I try to write these codes following your advice.

 Case1
 pub_date column(MySQL) is 2010-09-27 00:00:00.

 I wrote like below.

 SolrJDto info = new SolrJDto();
 TimeZone tz2 = TimeZone.getTimeZone(UTC+9);
 Calendar cal = Calendar.getInstance(tz2);
 // publishDate represent publish_date column on Solr Schema and the
 type is pdate.
 info.publishDate = rs.getDate(publish_date,cal);

 then I got 2010-09-27T00:00:00Z on Solr Admin.
 This result is what I expected.

 Case2
 reg_date column(MySQL) is 2010-09-27 11:22:33.

 I wrote like below.
 TimeZone tz2 = TimeZone.getTimeZone(UTC+9);
 Calendar cal = Calendar.getInstance(tz2);
 info.publishDate = rs.getDate(reg_date,cal);

 then, I got 2010-09-27T02:22:33Z on Solr admin.
 this result is not what i expected.
[...]

It seems like mysql is doing UTC conversion for one column,
and not for  the other. I can think of two possible reasons for
this:
* If they are from different mysql servers, it is possible that the
  timezone is set differently for the two servers. Please see
  http://dev.mysql.com/doc/refman/5.1/en/time-zone-support.html
  for how to set the timezone for mysql. (It is also possible for
  the client connection to set a connection-specific timezone,
  but I do not think that is what is happening here.)
* The type of the columns is different, e.g., one could be a
   DATETIME, and the other a TIMESTAMP. The mysql timezone
   link above also explains how these are handled.

Without going through the above could you not just set the timezone
for reg_date to UTC to get the result that you expect?

Regards,
Gora.


RE: phrase query with autosuggest (SOLR-1316)

2010-10-06 Thread Robert Petersen
My simple but effective solution to that problem was to replace the
white spaces in the items you index for autosuggest with some special
character, then your wildcarding will work with the whole phrase as you
desire.

Index: mike_shaffer
Query: mike_sha*  

-Original Message-
From: mike anderson [mailto:saidthero...@gmail.com] 
Sent: Wednesday, October 06, 2010 7:33 AM
To: solr-user@lucene.apache.org
Subject: phrase query with autosuggest (SOLR-1316)

It seemed like SOLR-1316 was a little too long to continue the
conversation.

Is there support for quotes indicating a phrase query. For example, my
autosuggest query for mike sha ought to return mike shaffer, mike
sharp, etc. Instead I get suggestions for mike and for sha,
resulting
in a collated result mike r meyer shaw,

Cheers,
Mike


Re: Strategy for re-indexing

2010-10-06 Thread Otis Gospodnetic
Hi,

I don't *think* there is any DIH request queuing going on - each is triggered 
by 
the DIH request.  You need to queue them yourself if your app/data is such that 
running multiple imports/deltas causes problems with either hardware or data.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Allistair Crossley a...@roxxor.co.uk
 To: solr-user@lucene.apache.org
 Sent: Wed, October 6, 2010 10:49:49 AM
 Subject: Strategy for re-indexing
 
 Hi,
 
 I was interested in gaining some insight into how you guys schedule  updates 
for your Solr index (I have a single index).
 
 Right now during  development I have added deltaQuery specifications to data 
import entities to  control the number of rows being queries on re-indexes.
 
 However in terms  of *when* to reindex we have a lot going on in the system - 
there are 4  sub-systems: custom application data, a CMS, a forum and a blog. 
It's all being  indexed and at any given time there will be users and 
administrators all  updating various parts of the sub-systems. 

 
 For the time being during  development I have been issuing reindexes to the 
data import handler on each  CRUD on any given sub-system. This has been 
working 
fine to be honest. It does  need to be as immediate as possible - a scheduled 
update won't work for us. Even  every 10 minutes is probably not fast enough. 

 
 So I wonder what others  do. Is anyone else in a similar situation?
 
 And what happens if 4 users  generate 4 different requests to the data import 
handler to update for different  types of data?  The DIH will be running 
already 
let's say for request 1,  then request 2 comes in - is it rejected? Or is it 
queued?
 
 I need it to  be queued and serviced because the request 1 re-index may have 
already run its  queries but missed the data added by the user for request 2. 
Same then goes for  the requests 3 and 4. 

 
 Thanks for your consideration,
 
 Allistair


Re: Experience running Solr on ISCSI

2010-10-06 Thread Otis Gospodnetic
Thijs,

The only thing I could find is this: 
http://search-lucene.com/m/VDjIlUc2Ci2/iscsisubj=Lucene+on+NFS+iSCSI

I don't have experience with transferring Solr/Lucene/index to different 
hardware nodes without stopping and persisting things before transfer.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Thijs vonk.th...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wed, October 6, 2010 7:23:33 AM
 Subject: Experience running Solr on ISCSI
 
 Hi.
 
 Our hardware department is planning on moving some stuff to new  machines (on 
our request)
 They are suggesting using virtualization (some  CISCO solution) on those 
machines and having the 'disk' connected via  ISCSI.
 
 Does anybody have experience running a SOLR index on a ISCSI  drive?
 We have already tried with NFS but that is slowing the index process  down to 
much, about 12 times slower. So NFS is a no-go. I could have know that  as it 
is 
mentioned on a lot of places to avoid nfs. But I can't find info about  ISCSI
 
 Does anybody have experience running a SOLR index on a virtualized  
environment? Is it resistant enough that it keeps working when the virtualized 
 
machine is transfered to a different hardware node?
 
 thanks
 


StatsComponent and multi-valued fields

2010-10-06 Thread dbashford

Running 1.4.1.

I'm able to execute stats queries against multi-valued fields, but when
given a facet, the statscomponent only considers documents that have a facet
value as the last value in the field.

As an example, imagine you are running stats on fooCount, and you want to
facet on bar, which is multi-valued.  Two documents...

1)
fooCount = 100
bar = A, B, C

2) 
fooCount = 5
bar = C, B, A

stats.field=fooCountstats=truestats.facet=bar

I would expect to see stats for A, B, and C all with sums of 105.  But what
I'm seeing is stats for C and A with sums of 100 and 5 respectively.

Is this expected behavior?  Something I'm possibly doing wrong?  Is this
just not advisable?

Thanks!

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/StatsComponent-and-multi-valued-fields-tp1644918p1644918.html
Sent from the Solr - User mailing list archive at Nabble.com.


script transformer vs. custom transformer

2010-10-06 Thread Tim Heckman
This might be a dumb question, but should I expect that a custom
transformer written in java will perform better than a javascript
script transformer?

Or does the javascript get compiled to bytecode such that there really
should not be much difference between the two?

Of course, the bigger performance issue I'm dealing with is getting
the data out of the SQL database as quickly as possible, but I'm
curious about the performance implications of a script transform vs.
the same transform done in java.


thanks,
Tim


Re: phrase query with autosuggest (SOLR-1316)

2010-10-06 Thread Jonathan Rochkind
If you use Chantal's suggestion from an earlier thread, involving facets 
and tokenized fields, but not the tokens handling -- i think it will 
work. (But that solution requires only one auto-suggest value per 
document).


There are a bunch of ways people have figured out to do auto-suggest 
without putting it in an entirely seperate Solr core. They all have 
their issues and strengths and weaknesses, including a weakness of being 
kind of confusing to implement sometimes. I don't think anyone's come up 
with a general purpose works for everything isn't confusing solution yet.


Robert Petersen wrote:

My simple but effective solution to that problem was to replace the
white spaces in the items you index for autosuggest with some special
character, then your wildcarding will work with the whole phrase as you
desire.

Index: mike_shaffer
Query: mike_sha*  


-Original Message-
From: mike anderson [mailto:saidthero...@gmail.com] 
Sent: Wednesday, October 06, 2010 7:33 AM

To: solr-user@lucene.apache.org
Subject: phrase query with autosuggest (SOLR-1316)

It seemed like SOLR-1316 was a little too long to continue the
conversation.

Is there support for quotes indicating a phrase query. For example, my
autosuggest query for mike sha ought to return mike shaffer, mike
sharp, etc. Instead I get suggestions for mike and for sha,
resulting
in a collated result mike r meyer shaw,

Cheers,
Mike

  


Re: multi level faceting

2010-10-06 Thread Peter Karich
Hi,

there is a solution without the patch. Here it should be explained:
http://www.lucidimagination.com/blog/2010/08/11/stumped-with-solr-chris-hostetter-of-lucene-pmc-at-lucene-revolution/

If not, I will do on 9.10.2010 ;-)

Regards,
Peter.

 I've a similar problem with a project I'm working on now.  I am holding out 
 for either SOLR-64 or SOLR-792 being a bit more mature before I need the 
 functionality but if not I was thinking I could do multi-level faceting by 
 indexing the data as a String like this:

 id: 1
 SHOE: Sneakers|Men|Size 7

 id: 2
 SHOE: Sneakers|Men|Size 8

 id: 3
 SHOE: Sneakers|Women|Size 6

 etc

 and then in the UI, show just up to the first delimiter (you'll have to sum 
 the counts in the UI too).  Once the user clicks on Sneakers, you would 
 then add fq=SHOE:Sneakers|* to the query and then show the values up to the 
 2nd delimiter, etc.  

 Alternatively, if you didn't want to use a wildcard query, you could index 
 each level separately like this:

 id: 1
 SHOE1: Sneakers
 SHOE2: Sneakers|Men
 SHOE3: Sneakers|Men|Size 7

 Then after the user clicks on the 1st level, fq on SHOE1 and show SHOE2, etc. 
  This wouldn't work so well if you had more than a few levels in your 
 hierarchy.

 I haven't actually tried this and like I said I'm hoping I could just use a 
 patch (really I hope 3.x gets released GA with the functionality but I won't 
 hold my breath...)  But I do think this would work in a pinch if need be.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Nguyen, Vincent (CDC/OD/OADS) (CTR) [mailto:v...@cdc.gov] 
 Sent: Tuesday, October 05, 2010 8:22 AM
 To: solr-user@lucene.apache.org
 Subject: RE: multi level faceting

 Just to clarify, the effect I was look for was this.  

 Sneakers
Men (22)
Women (43)

 AFTER a user filters by one of those, they would be presented with a NEW
 facet field such as 

 Sneakers
Men
   Size 7 (10)
   Size 8 (11)
   Size 9 (23)

 Vincent Vu Nguyen

 -Original Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
 Sent: Monday, October 04, 2010 11:44 PM
 To: solr-user@lucene.apache.org
 Subject: Re: multi level faceting

 Hi,

 I *think* this is not what Vincent was after.  If I read the suggestions

 correctly, you are saying to use fq=xfq=y -- multiple fqs.
 But I think Vincent is wondering how to end up with something that will
 let him 
 create a UI with multi-level facets (with a single request), e.g.

 Footwear (100)
   Sneakers (20)
 Men (1)
 Women (19)

   Dancing shoes (10)
 Men (0)
 Women (10)
 ...

 If this is what Vincent was after, I'd love to hear suggestions myself.
 :)

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
   
 From: Jason Brown jason.br...@sjp.co.uk
 To: solr-user@lucene.apache.org
 Sent: Mon, October 4, 2010 11:34:56 AM
 Subject: RE: multi level faceting

 Yes, by adding fq back into the main query you will get results
 
 increasingly  
   
 filtered each time.

 You may run into an issue if you are displaying facet  counts, as the
 
 facet 
   
 part of the query will also obey the increasingly filtered  fq, and so
 
 not 
   
 display counts for other categories anymore from the chosen facet
 
 (depends if 
   
 you need to display counts from a facet once the first value from  the
 
 facet has 
   
 been chosen if you get my drift). Local params are a way to deal  with
 
 this by 
   
 not subjecting the facet count to the same fq restriction (but
 
 allowing the 
   
 search results to obey it).



 -Original  Message-
 From: Nguyen, Vincent (CDC/OD/OADS) (CTR) [mailto:v...@cdc.gov]
 Sent: Mon 04/10/2010  16:34
 To: solr-user@lucene.apache.org
 Subject:  RE: multi level faceting

 Ok.  Thanks for the quick  response.

 Vincent Vu Nguyen
 Division of Science Quality and  Translation
 Office of the Associate Director for Science
 Centers for  Disease Control and Prevention (CDC)
 404-498-6154
 Century Bldg  2400
 Atlanta, GA 30329 


 -Original Message-
 From:  Allistair Crossley [mailto:a...@roxxor.co.uk] 
 Sent: Monday, October  04, 2010 9:40 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: multi level faceting

 I think that is just sending 2 fq facet queries  through. In Solr PHP
 
 I
   
 would do that with, e.g.

 $params['facet'] =  true;
 $params['facet.fields'] = array('Size');
 $params['fq'] =  array('sex' = array('Men', 'Women'));

 but yes i think you'd have to  send through what the current facet
 
 query
   
 is and add it to your next  drill-down

 On Oct 4, 2010, at 9:36 AM, Nguyen, Vincent (CDC/OD/OADS)  (CTR)
 
 wrote:
   
 
 Hi,



 I was wondering  if there's a way to display facet options based on
 previous facet  values.  For example, I've seen many shopping sites
   
 where
 
 a user  can 

RE: Experience with large merge factors

2010-10-06 Thread Burton-West, Tom
Hi Mike,

.Do you use multiple threads for indexing?  Large RAM buffer size is
also good, but I think perf peaks out mabye around 512 MB (at least
based on past tests)?

We are using Solr, I'm not sure if Solr uses multiple threads for indexing.  We 
have 30 producers each sending documents to 1 of 12 Solr shards on a round 
robin basis.  So each shard will get multiple requests.

Believe it or not, merging is typically compute bound.  It's costly to
decode  re-encode all the vInts.

Sounds like we need to do some monitoring during merging to see what the cpu 
use is and also the io wait during large merges.

Larger merge factor is good because it means the postings are copied 
fewer times, but, it's bad beacuse you could risk running out of
descriptors, and, if the OS doesn't have enough RAM, you'll start to
thin out the readahead that the OS can do (which makes the merge less
efficient since the disk heads are seeking more).

Is there a way to estimate the amount of RAM for the readahead?   Once we start 
the re-indexing we will be running 12 shards on a 16 processor box with 144 GB 
of memory.

Do you do any deleting?
Deletes would happen as a byproduct of updating a record.  This shouldn't 
happen too frequently during re-indexing, but we update records when a document 
gets re-scanned and re-OCR'd.  This would probably amount to a few thousand.


Do you use stored fields and/or term vectors?  If so, try to make
your docs uniform if possible, ie add the same fields in the same
order.  This enables lucene to use bulk byte copy merging under the hood.

We use 4 or 5 stored fields.  They are very small compared to our huge OCR 
field.  Since we construct our Solr documents programattically, I'm fairly 
certain that they are always in the same order.  I'll have to look at the code 
when I get back to make sure.

We aren't using term vectors now, but we plan to add them as well as a number 
of fields based on MARC (cataloging) metadata in the future.

Tom

Re: Experience with large merge factors

2010-10-06 Thread Otis Gospodnetic
Hi Tom,


 .Do you use multiple threads for indexing?  Large RAM  buffer size is
 also good, but I think perf peaks out mabye around 512  MB (at least
 based on past tests)?
 
 We are using Solr, I'm not  sure if Solr uses multiple threads for indexing.  
We have 30 producers  each sending documents to 1 of 12 Solr shards on a 
round 
robin basis.  So  each shard will get multiple requests.

Solr itself doesn't use multiple threads for indexing, but you can easily do 
that on the client side.  SolrJ's StreamingUpdateSolrServer is the simplest 
thing to use for this.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Burton-West, Tom tburt...@umich.edu
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Wed, October 6, 2010 9:57:12 PM
 Subject: RE: Experience with large merge factors
 
 Hi Mike,
 
 .Do you use multiple threads for indexing?  Large RAM  buffer size is
 also good, but I think perf peaks out mabye around 512  MB (at least
 based on past tests)?
 
 We are using Solr, I'm not  sure if Solr uses multiple threads for indexing.  
We have 30 producers  each sending documents to 1 of 12 Solr shards on a 
round 
robin basis.  So  each shard will get multiple requests.
 
 Believe it or not, merging  is typically compute bound.  It's costly to
 decode   re-encode all the vInts.
 
 Sounds like we need to do some monitoring during  merging to see what the cpu 
use is and also the io wait during large  merges.
 
 Larger merge factor is good because it means the postings  are copied 
 fewer times, but, it's bad beacuse you could risk running  out of
 descriptors, and, if the OS doesn't have enough RAM, you'll  start to
 thin out the readahead that the OS can do (which makes the  merge less
 efficient since the disk heads are seeking  more).
 
 Is there a way to estimate the amount of RAM for the  readahead?   Once we 
start the re-indexing we will be running 12 shards on  a 16 processor box with 
144 GB of memory.
 
 Do you do any  deleting?
 Deletes would happen as a byproduct of updating a record.   This shouldn't 
happen too frequently during re-indexing, but we update records  when a 
document 
gets re-scanned and re-OCR'd.  This would probably amount  to a few thousand.
 
 
 Do you use stored fields and/or term  vectors?  If so, try to make
 your docs uniform if possible, ie  add the same fields in the same
 order.  This enables lucene to  use bulk byte copy merging under the hood.
 
 We use 4 or 5 stored  fields.  They are very small compared to our huge OCR 
field.  Since we  construct our Solr documents programattically, I'm fairly 
certain that they are  always in the same order.  I'll have to look at the 
code 
when I get back to  make sure.
 
 We aren't using term vectors now, but we plan to add them as  well as a 
 number 
of fields based on MARC (cataloging) metadata in the  future.
 
 Tom