Re: Interleaved nutch crawls locks crawldb

Lewis John Mcgibbney Wed, 19 Mar 2014 19:07:22 -0700

Hi anupamk,

On Tue, Mar 18, 2014 at 2:45 AM, <[email protected]> wrote:

>
> While running the two crawler's concurrently I have run into the problems
> and nutch sometimes throws a IOException saying that the ".locked" file
> exists in crawldb. While one of crawl script tries to generate and/or
> update
> crawldb.
>

Your description of the IOException is not accurate. You will get one of
the following two

throw new IOException("lock file " + lockFile + " already exists.");
or
throw new IOException("lock file " + lockFile + " already exists and is a
directory.");

it is important to state the difference.
The former occurs when we acknowledge that a lock file exists within HDFS
(which is essentially what CrawlDB builds on top of) but DO NOT use force
to override the locked state of the FileSystem. This is what i suspect is
happening to you.
The latter occurs when we acknowledge that a lock file exists within HDFS
and use force to override the locked state BUT that the locked file is in
fact a directory.

>
> Why does this happen and what do I do to avoid this ?

Well locks are implemented when we are writing into HDFS and we do not wish
for other clients to be writing to the same file (or directory) at the same
time. AFAIK locks feature only exists for writes as oppose to client reads
for HDFS for reads you would need to use Zookeeper or Curator. In this case
we do not wish for the CrawlDB to be updated by more than one client at a
time so we implement a lock for the (file) destination we are writing to.
You can ovoid it by ensuring that only one client attempts to write to the
CrawlDB at any one time. The IOException is not bad, it is merely
protecting you from potentially corrupting the CrawlDB by doing something
like overwriting crawl data or something similar. I would suggest you DO
NOT use the force switch to override the lock. It is certainly a
configuration option which carries risk and the last thing you want is for
your CrawlDB to become corrupted or something similar.

> What does -force mean
> ?
>

Simply means that we override the lock for the writes to the CrawlDB.
 https://wiki.apache.org/nutch/bin/nutch%20updatedb
We really should make this more explicit why 'CAUTION IS ADVISED'

> Any information / wiki links / documentation explaining locking and how it
> works would be appreciated.
>
>
One last thing, have you also checked out the 'generate.update.crawldb'
property in nutch-site.xml. It may come in handy if you have multiple Nutch
instances working concurrently.

hth
Lewis

Re: Interleaved nutch crawls locks crawldb

Reply via email to