On Apr 2, 2009, at 4:02 AM, Fergus McMenemie wrote:
Grant,
Hmmm, the big difference is made by &overwrite=false. But,
can you explain why &overwrite=false makes such a difference.
I am starting off with an empty index and I have checked the
content there are no duplicates in the uniqueKey field.
I guess if &overwrite=false then a few checks can be removed
from the indexing process, and if I am confident that my content
contains no duplicates then this is a good speed up.
http://wiki.apache.org/solr/UpdateCSV says that if overwrite
is true (the default) then overwrite documents based on the
uniqueKey. However what will solr/lucene do if the uniqueKey
is not unique and overwrite=false?
overwrite=false means Solr does not issue deletes first, meaning if
you have a doc w/ that id already, you will now have two docs with
that id. unique Id is enforced by Solr, not by Lucene.
Even if you can't guarantee uniqueness, you can still do overwrite =
false as a workaround using the suggestion I gave you in a prior email:
1. Add a new field that is unique for your data source, but is the
same for all records in that data source. i.e. type = geonames.txt
2. Before updating, issue a delete by query for the value of that
type, which will delete all records with that term
3. Do your indexing with overwrite = false
I should note, however, that the speed difference you are seeing may
not be as pronounced as it appears. If I recall during ApacheCon, I
commented on how long it takes to shutdown your Solr instance when
exiting it. That time it takes is in fact Solr doing the work that
was put off by not committing earlier and having all those deletes
pile up.
Thus, while it is likely that your older version is still faster due
to the new fsync stuff in Lucene, it may not be that much faster. I
think you could see this by actually doing commit = true, but I'm not
100% sure.
fergus: perl -nlaF"\t" -e 'print "$F[2]";' geonames.txt | wc -l
1000000
fergus: perl -nlaF"\t" -e 'print "$F[2]";' geonames.txt | sort -u |
wc -l
1000000
fergus: /usr/bin/head geonames.txt
RC UFI UNI LAT LONG DMS_LAT DMS_LONG MGRS JOG FC DSG PC CC1 ADM1
ADM2 POP ELEV CC2 NT LC SHORT_FORM GENERIC SORT_NAME FULL_NAME
FULL_NAME_ND MODIFY_DATE
1 -1307828 60524 12.466667 -69.9 122800 -695400 19PDP0219578323
ND19-14 T MT AA 00 PALUMARGA Palu Marga Palu Marga 1995-03-23
1 -1307756 -1891720 12.5 -70.016667 123000 -700100 19PCP8952982056
ND19-14 P PPLX
PS. do you want me to do some kind of chop through the
different versions to see where the slow down happened
or are you happy you have nailed it?
--
===============================================================
Fergus McMenemie Email:fer...@twig.me.uk
Techmore Ltd Phone:(UK) 07721 376021
Unix/Mac/Intranets Analyst Programmer
===============================================================
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search