I am using a configurable custom plugin to parse the documents in my crawl.

 

I occasionally have a situation where a small number of documents cause the
solrindex step to fail.  This happens if the rules I have setup for my
custom parser produce documents that don't match the solr schema (multiple
occurrences of fields that should only have a single occurrence).

 

I could relax the schema, but this is not what I want.  Instead I correct
the parsing configuration and re-fetch and reparse the documents that were
causing the problem.  (Not all the documents in the failed segment, which
could be thousands - but just the problematic ones.)

 

Having re-fetched/parsed the problem documents I merge the newly created
segment directory with the earlier segment directory.  I can then
successfully use solrindex to index the documents in the merged segment
directory.

 

My question is about storing the merged segment and disposing of the earlier
segment directory.  Specifically, I would like to store them in such a way
that I can still trace from the fetchdb to find documents in the segmentdb.

 

It seems to me that the best approach would be to delete the old segment
directory and replace it with a symbolic link to the merged segment
directory.  In this way it would seem possible to trace any of the documents
in the fetchdb (both those that were re-fetched and those that were not) to
the correct segment.

 

Does this make sense?  Are there any circumstances where the symbolic link
would cause problems?  Does anyone have a better approach?  Is my concern
even legitimate? If not, why not?

 

Thanks

Reply via email to