Hi anupamk,
On Tue, Mar 18, 2014 at 2:45 AM, <[email protected]> wrote: > > While running the two crawler's concurrently I have run into the problems > and nutch sometimes throws a IOException saying that the ".locked" file > exists in crawldb. While one of crawl script tries to generate and/or > update > crawldb. > Your description of the IOException is not accurate. You will get one of the following two throw new IOException("lock file " + lockFile + " already exists."); or throw new IOException("lock file " + lockFile + " already exists and is a directory."); it is important to state the difference. The former occurs when we acknowledge that a lock file exists within HDFS (which is essentially what CrawlDB builds on top of) but DO NOT use force to override the locked state of the FileSystem. This is what i suspect is happening to you. The latter occurs when we acknowledge that a lock file exists within HDFS and use force to override the locked state BUT that the locked file is in fact a directory. > > Why does this happen and what do I do to avoid this ? Well locks are implemented when we are writing into HDFS and we do not wish for other clients to be writing to the same file (or directory) at the same time. AFAIK locks feature only exists for writes as oppose to client reads for HDFS for reads you would need to use Zookeeper or Curator. In this case we do not wish for the CrawlDB to be updated by more than one client at a time so we implement a lock for the (file) destination we are writing to. You can ovoid it by ensuring that only one client attempts to write to the CrawlDB at any one time. The IOException is not bad, it is merely protecting you from potentially corrupting the CrawlDB by doing something like overwriting crawl data or something similar. I would suggest you DO NOT use the force switch to override the lock. It is certainly a configuration option which carries risk and the last thing you want is for your CrawlDB to become corrupted or something similar. > What does -force mean > ? > Simply means that we override the lock for the writes to the CrawlDB. https://wiki.apache.org/nutch/bin/nutch%20updatedb We really should make this more explicit why 'CAUTION IS ADVISED' > Any information / wiki links / documentation explaining locking and how it > works would be appreciated. > > One last thing, have you also checked out the 'generate.update.crawldb' property in nutch-site.xml. It may come in handy if you have multiple Nutch instances working concurrently. hth Lewis

