I am using a configurable custom plugin to parse the documents in my crawl.
I occasionally have a situation where a small number of documents cause the solrindex step to fail. This happens if the rules I have setup for my custom parser produce documents that don't match the solr schema (multiple occurrences of fields that should only have a single occurrence). I could relax the schema, but this is not what I want. Instead I correct the parsing configuration and re-fetch and reparse the documents that were causing the problem. (Not all the documents in the failed segment, which could be thousands - but just the problematic ones.) Having re-fetched/parsed the problem documents I merge the newly created segment directory with the earlier segment directory. I can then successfully use solrindex to index the documents in the merged segment directory. My question is about storing the merged segment and disposing of the earlier segment directory. Specifically, I would like to store them in such a way that I can still trace from the fetchdb to find documents in the segmentdb. It seems to me that the best approach would be to delete the old segment directory and replace it with a symbolic link to the merged segment directory. In this way it would seem possible to trace any of the documents in the fetchdb (both those that were re-fetched and those that were not) to the correct segment. Does this make sense? Are there any circumstances where the symbolic link would cause problems? Does anyone have a better approach? Is my concern even legitimate? If not, why not? Thanks

