On Apr 2, 2009, at 4:02 AM, Fergus McMenemie wrote:
Grant,

Hmmm, the big difference is made by &overwrite=false. But,
can you explain why &overwrite=false makes such a difference.
I am starting off with an empty index and I have checked the
content there are no duplicates in the uniqueKey field.

I guess if &overwrite=false then a few checks can be removed
from the indexing process, and if I am confident that my content
contains no duplicates then this is a good speed up.

http://wiki.apache.org/solr/UpdateCSV says that if overwrite
is true (the default) then overwrite documents based on the
uniqueKey. However what will solr/lucene do if the uniqueKey
is not unique and overwrite=false?

overwrite=false means Solr does not issue deletes first, meaning if you have a doc w/ that id already, you will now have two docs with that id. unique Id is enforced by Solr, not by Lucene.

Even if you can't guarantee uniqueness, you can still do overwrite = false as a workaround using the suggestion I gave you in a prior email: 1. Add a new field that is unique for your data source, but is the same for all records in that data source. i.e. type = geonames.txt 2. Before updating, issue a delete by query for the value of that type, which will delete all records with that term
3. Do your indexing with overwrite = false

I should note, however, that the speed difference you are seeing may not be as pronounced as it appears. If I recall during ApacheCon, I commented on how long it takes to shutdown your Solr instance when exiting it. That time it takes is in fact Solr doing the work that was put off by not committing earlier and having all those deletes pile up.

Thus, while it is likely that your older version is still faster due to the new fsync stuff in Lucene, it may not be that much faster. I think you could see this by actually doing commit = true, but I'm not 100% sure.




fergus: perl -nlaF"\t" -e 'print "$F[2]";' geonames.txt | wc -l
1000000
fergus: perl -nlaF"\t" -e 'print "$F[2]";' geonames.txt | sort -u | wc -l
1000000
fergus: /usr/bin/head geonames.txt
RC UFI UNI LAT LONG DMS_LAT DMS_LONG MGRS JOG FC DSG PC CC1 ADM1 ADM2 POP ELEV CC2 NT LC SHORT_FORM GENERIC SORT_NAME FULL_NAME FULL_NAME_ND MODIFY_DATE 1 -1307828 60524 12.466667 -69.9 122800 -695400 19PDP0219578323 ND19-14 T MT AA 00 PALUMARGA Palu Marga Palu Marga 1995-03-23 1 -1307756 -1891720 12.5 -70.016667 123000 -700100 19PCP8952982056 ND19-14 P PPLX

PS. do you want me to do some kind of chop through the
different versions to see where the slow down happened
or are you happy you have nailed it?    
--

===============================================================
Fergus McMenemie               Email:fer...@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to