Chris wrote:

I'm building a crawler that needs to find all the documents in a repository. Once I do the first crawl, how do I go back later and get all the documents that have changed?

I could do a full recrawl, but I was hoping there was a faster way to find the nodes that had been inserted/updated/deleted since the last crawl.

If you use-case allow you to register a listener you can listen for modifications events. On the other hand, if you are doing a snap-shot, you can add a modified-time attribute to all nodes and when you need to find all updated just select nodes that has modified-time later than your last snap-shot. But this task is the same with RDBMS. How to select all updated rows from a table ...

--
Ivan Latysh
[EMAIL PROTECTED]

Reply via email to