Thanks for updating the tutorial. I tried my setup, the crawl command is
running. But none of the pages are being crawled.
I created urls directory inside local folder and added new file nutch with
url in the same as mentioned in tutorial.

(I also tried file named urls inside nutch/runtime/local diretcory. The
contents of urls file is http://lucene.apache.org/nutch/ )

Here's the log:

us137390:local parampreetsethi$  bin/nutch crawl urls -dir crawl -depth 3
-topN 50 
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 50
Injector: starting at 2011-07-12 12:22:12
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-07-12 12:22:15, elapsed: 00:00:03
Generator: starting at 2011-07-12 12:22:15
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl


Please help.

Thanks
Param

On 7/12/11 5:52 AM, "Julien Nioche" <[email protected]> wrote:

> On 12 July 2011 10:30, Julien Nioche <[email protected]> wrote:
> 
>> 
>> 
>>>>> There seems to be no crawl-urlfilter file indeed. Don't know why it's
>>>>> gone since
>>>>> the crawl command is still there. You can find the file in the 1.2
>>>>> release: http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/
>>>> 
>>>> Crawl-urlfilter has been removed  purposefully as it did not add
>>> anything
>>>> to the other url filters (automaton | regex) in terms of functionality.
>>> By
>>>> default the urlfilters contain (+.) which IIRC was what the
>>>> Crawl-urlfilter used to do.
>>>> 
>>> 
>>> That's reasonable. But now news users are unaware and don't know what to
>>> do
>>> with this error message.
>>> 
>> 
>> Yep, the tutorial needs updating indeed
>> 
> 
> done
> 
> 
>> 
>> 
>> 
>>> 
>>>>>> Thanks for a quick reply.
>>>>>> 
>>>>>> I searched in the nutch directory but still do not see that file :(.
>>>>> 
>>>>> Here's
>>>>> 
>>>>>> complete file list inside runtime/local/conf directory.
>>>>>> 
>>>>>> us137390:conf parampreetsethi$ pwd
>>>>>> /Users/parampreetsethi/Documents/workspace/nutch/runtime/local/conf
>>>>>> us137390:conf parampreetsethi$ ls -t
>>>>>> automaton-urlfilter.txt    domain-urlfilter.txt    nutch-default.xml
>>>>>> prefix-urlfilter.txt    solrindex-mapping.xml
>>>>>> configuration.xsl    httpclient-auth.xml    nutch-site.xml
>>>>>> regex-normalize.xml    subcollections.xml
>>>>>> domain-suffixes.xml    log4j.properties    parse-plugins.dtd
>>>>>> regex-urlfilter.txt    suffix-urlfilter.txt
>>>>>> domain-suffixes.xsd    nutch-conf.xsl        parse-plugins.xml
>>>>>> schema.xml tika-mimetypes.xml
>>>>>> 
>>>>>> By the way, I tried deploying the code by checking out from svn
>>>>> 
>>>>> repository,
>>>>> 
>>>>>> but could not build it. I was getting following error:
>>>>>> 
>>>>>> resolve-default:
>>>>> 
>>>>>> [ivy:resolve] :: Ivy 2.2.0 - 20100923230623 ::
>>>>> http://ant.apache.org/ivy/
>>>>> 
>>>>>> :: [ivy:resolve] :: loading settings :: file =
>>>>>> 
>>>>>> /Users/parampreetsethi/Documents/workspace/nutch/ivy/ivysettings.xml
>>>>>> [ivy:resolve]
>>>>>> [ivy:resolve] :: problems summary ::
>>>>>> [ivy:resolve] :::: WARNINGS
>>>>>> [ivy:resolve]         module not found:
>>>>>> org.apache.gora#gora-core;0.2-incubating
>>>>>> [ivy:resolve]     ==== local: tried
>>>>>> [ivy:resolve]
>>>>> 
>>>>> 
>>> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-core/0.2-incubati
>>>>> ng
>>>>> 
>>>>>> / ivys/ivy.xml
>>>>>> [ivy:resolve]       -- artifact
>>>>>> org.apache.gora#gora-core;0.2-incubating!gora-core.jar:
>>>>>> [ivy:resolve]
>>>>> 
>>>>> 
>>> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-core/0.2-incubati
>>>>> ng
>>>>> 
>>>>>> / jars/gora-core.jar
>>>>>> [ivy:resolve]         module not found:
>>>>>> org.apache.gora#gora-sql;0.2-incubating
>>>>>> [ivy:resolve]     ==== local: tried
>>>>>> [ivy:resolve]
>>>>> 
>>>>> 
>>> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-sql/0.2-incubatin
>>>>> g/
>>>>> 
>>>>>> i vys/ivy.xml
>>>>>> [ivy:resolve]       -- artifact
>>>>>> org.apache.gora#gora-sql;0.2-incubating!gora-sql.jar:
>>>>>> [ivy:resolve]
>>>>> 
>>>>> 
>>> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-sql/0.2-incubatin
>>>>> g/
>>>>> 
>>>>>> j ars/gora-sql.jar
>>>>>> [ivy:resolve]         ::::::::::::::::::::::::::::::::::::::::::::::
>>>>>> [ivy:resolve]         ::          UNRESOLVED DEPENDENCIES         ::
>>>>>> [ivy:resolve]         ::::::::::::::::::::::::::::::::::::::::::::::
>>>>>> [ivy:resolve]         :: org.apache.gora#gora-core;0.2-incubating:
>>> not
>>>>>> found [ivy:resolve]         ::
>>> org.apache.gora#gora-sql;0.2-incubating:
>>>>>> not found [ivy:resolve]
>>>>>> 
>>>>>> :::::::::::::::::::::::::::::::::::::::::::::: [ivy:resolve]
>>>>>> 
>>>>>> [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
>>>>>> 
>>>>>> BUILD FAILED
>>>>> 
>>>>>> /Users/parampreetsethi/Documents/workspace/nutch/build.xml:458:
>>>>> impossible
>>>>> 
>>>>>> to resolve dependencies:
>>>>>>     resolve failed - see output for details
>>>>>> 
>>>>>> -param
>>>>>> 
>>>>>> On 7/11/11 5:56 PM, "Jerry E. Craig, Jr." <[email protected]>
>>>>> 
>>>>> wrote:
>>>>>>> Look down a little further for the
>>>>>>> 
>>>>>>> or
>>>>>>> runtime/local/bin/nutch (version >= 1.3)
>>>>>>> 
>>>>>>> If you download the bin then it's in the runtime directory.
>>>>>>> 
>>>>>>> Jerry E. Craig, Jr.
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Sethi, Parampreet [mailto:[email protected]]
>>>>>>> Sent: Monday, July 11, 2011 2:51 PM
>>>>>>> To: [email protected]
>>>>>>> Subject: Nutch Novice help
>>>>>>> 
>>>>>>> Hi All,
>>>>>>> 
>>>>>>> Sorry for such a naïve question,  I downloaded nutch 1.3 binary
>>> today
>>>>> 
>>>>> and
>>>>> 
>>>>>>> trying to set it up as mentioned in Tutorial at
>>>>>>> http://wiki.apache.org/nutch/NutchTutorial
>>>>>>> 
>>>>>>> How ever I am not able to find crawl-urlfilter.txt inside conf
>>>>> 
>>>>> directory.
>>>>> 
>>>>>>> Is there any other place where I should look for this file?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Param
>>> 
>> 
>> 
>> 
>> --
>> *
>> *Open Source Solutions for Text Engineering
>> 
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> 
> 
> 

Reply via email to