Unfortunately, <uniqueKey/> has never changed. The issue can take some time to show itself although I think there were logic issues with the way I update documents in my index.
I first do a full purge and reindex of all items without issue. Over time, I only index items that have changed/are new since initial reindex. However, I start to see duplicates appear which is strange becuase I use a combination of <uniqueKey/> plus overwrite="true" which should guarantee uniqueness. However, I have been using the /admin/luke lastModified date to check for items which have been added/updated after this date but have just realized that lastModified will only change if I a) reindex everything or b) call optimize, so I have been retrieving items which have already been added to the index. I think explicitly storing the last run time (in a file/db field) will ensure I only retrieve those items which have changed since the last index. This will also go a long way to solving the duplication issue. Thanks again Hayden On 11 September 2015 at 19:33, Erick Erickson <erickerick...@gmail.com> wrote: > OK, this makes no sense whatsoever, so I"m missing something. > > commitWithin shouldn't matter at all, there's code to handle multiple > updates between commits. > > I'm _really_ shooting in the dark here, but... > > > did you perhaps change the <uniqueKey> definition from the default "id" > to "key" without blowing away the entire data directory in between? > > > Take a look at your schema file through the Admin/UI browser, is it what > you expect? And did you reload/restart after the changes? > > > I could get _some_ duplication by changing the field that was my > <uniqueKey> > the adding more docs. Which makes some sense since some of the Lucene > segment files were created with one definition and some with another. But > that > doesn't explain why you _keep_ getting more and more duplicates. > > But this behavior is fundamental Solr, so I doubt it would have snuck > through > or not generated very loud howls. Which leaves us with wondering what is > unexpected in your setup. Everything you've shown us looks good, so I'm > puzzled. > > Best, > Erick > > > On Fri, Sep 11, 2015 at 9:52 AM, Mr Havercamp <mrhaverc...@gmail.com> > wrote: > > I'm wondering if the commitWithin is causing issues. > > > > On 11 September 2015 at 18:52, Mr Havercamp <mrhaverc...@gmail.com> > wrote: > > > >> Thanks for the suggestions. No, not using MERGEINDEXES nor > >> MapReduceIndexerTool. > >> > >> I've pasted the <add/> XML in case there is something broken there (cut > >> down for brevity, i.e. the "..."): > >> > >> <add overwrite="true" commitWithin="10000"><doc><field > >> name="handle_s">123456789/3</field><field name="title">Test > >> Submission</field><field name="title_sort">Test Submission</field><field > >> name="access">1</field><field name="parent_id">1</field><field > >> name="collection_s">Test Collection</field><field > name="collection_fc">test > >> collection|||Test Collection</field><field name="collection_sort">Test > >> Collection</field><field name="dc.contributor.author_fc">young, > >> hayden|||Young, Hayden</field><field name="author">Young, > >> Hayden</field><field name="dc.contributor.author_sm">Young, > >> Hayden</field>...<field name="key">archive.item.1</field>...</doc></add> > >> > >> On 11 September 2015 at 18:06, Erick Erickson <erickerick...@gmail.com> > >> wrote: > >> > >>> Are you by any chance using the MERGEINDEXES > >>> core admin call? Or using MapReduceIndexerTool? > >>> > >>> Neither of those delete duplicates.... > >>> > >>> This is a fundamental part of Solr though, so it's > >>> virtually certain that there's some innocent-seeming > >>> thing you're doing that's causing this... > >>> > >>> Best, > >>> Erick > >>> > >>> On Fri, Sep 11, 2015 at 8:55 AM, Shawn Heisey <apa...@elyograg.org> > >>> wrote: > >>> > On 9/11/2015 9:10 AM, Mr Havercamp wrote: > >>> >> fieldType def: > >>> >> > >>> >> <!-- The StrField type is not analyzed, but indexed/stored > >>> >> verbatim. --> > >>> >> <fieldType name="string" class="solr.StrField" > >>> >> sortMissingLast="true" /> > >>> >> > >>> >> It is not SolrCloud. > >>> > > >>> > As long as it's not a distributed index, I can't think of any problem > >>> > those field/type definitions might cause. Even if it were > distributed > >>> > and you had the same document in multiple shards, duplicates should > be > >>> > removed at query time, if each shard has the same schema as the > others. > >>> > > >>> > I don't have any further ideas. There may be something wrong that I > >>> > haven't thought of. > >>> > > >>> > Thanks, > >>> > Shawn > >>> > > >>> > >> > >> >