Re: How could I avoid reindexing same files?

Fergus McMenemie Wed, 08 Apr 2009 03:05:34 -0700

>Hi Fergus,
>
>On Tue, Apr 07, 2009 at 05:06:23PM +0100, Fergus McMenemie wrote:
>> >Thank you much Fergus,
>> >
>> >I was considering implementing a database which would hold a path name
>> >and an MD5 sum of each file.
>> Snap. That is close to what we did. However due to our pervious
>> duff full text search engine we had to hold this information in
>> a separate checksums file. Solr is much better at allowing you
>> to add extra meta information as the document is being submitted
>> for indexing.
>> 
>> curl http://localhost...update/extract 
>>    -F "myfi...@file.pdf;ext.literal.id=file.pdf;ext.literal.chksum=XXXXX"
>
>- Great idea, simpler and cleaner!
>
> 
>> >Then as a part of Solr indexing, one could check against the DB if a
>> >file path exists, if Yes, then compare MD5 and only index if different.
>> Using solr you could hold the checksum and pathname as solr fields,
>> then rather than looking up a DB you would look up solr. Having every
>> thing in the one place is better for consistency and quality. You
>> could also dump all checksums and pathnames from solr if/when you wanted
>> to validate your folder structure and or indexes.
>
>- What kind of query could I use with Solr, to check for a specific
>  filename/checksum and get an answer as close to "TRUE or FALSE" as possible?


Some thought needs to be given to this to make sure that
the performance is adequate. But at its simplest:-

curl http://localhost.../select?id=file.pdf&fl=id,chksum
-- 

===============================================================
Fergus McMenemie               Email:fer...@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: How could I avoid reindexing same files?

Reply via email to