Hey Lewis, Thanks for the quick reply. Looks like I am tangled now =)

I tried the tutorial mentioned at
http://wiki.apache.org/nutch/RunningNutchAndSolr

For me step 3 is not working. Two of the directories are not created (which
should be there after step 3 is complete.)

crawl/crawldb - Created
crawl/linkdb - not created
crawl/segments - not created

Also, I changed the url to http://nutch.apache.org, but still same log
message "Generator: 0 records selected for fetching, exiting ..."

Looks like I am missing some key step =(.

-param

On 7/12/11 1:37 PM, "lewis john mcgibbney" <[email protected]>
wrote:

> Hi,
> 
> I think you are maybe getting tangled here. Please see the following
> tutorial for Nutch 1.3 [1]
> 
> Please also note that the URL you provided is the old Nutch site and now
> redirects to http://nutch.apache.org
> 
> [1] http://wiki.apache.org/nutch/RunningNutchAndSolr
> 
> On Tue, Jul 12, 2011 at 5:23 PM, Sethi, Parampreet <
> [email protected]> wrote:
> 
>> Thanks for updating the tutorial. I tried my setup, the crawl command is
>> running. But none of the pages are being crawled.
>> I created urls directory inside local folder and added new file nutch with
>> url in the same as mentioned in tutorial.
>> 
>> (I also tried file named urls inside nutch/runtime/local diretcory. The
>> contents of urls file is http://lucene.apache.org/nutch/ )
>> 
>> Here's the log:
>> 
>> us137390:local parampreetsethi$  bin/nutch crawl urls -dir crawl -depth 3
>> -topN 50
>> solrUrl is not set, indexing will be skipped...
>> crawl started in: crawl
>> rootUrlDir = urls
>> threads = 10
>> depth = 3
>> solrUrl=null
>> topN = 50
>> Injector: starting at 2011-07-12 12:22:12
>> Injector: crawlDb: crawl/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Merging injected urls into crawl db.
>> Injector: finished at 2011-07-12 12:22:15, elapsed: 00:00:03
>> Generator: starting at 2011-07-12 12:22:15
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: filtering: true
>> Generator: normalizing: true
>> Generator: topN: 50
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: 0 records selected for fetching, exiting ...
>> Stopping at depth=0 - no more URLs to fetch.
>> No URLs to fetch - check your seed list and URL filters.
>> crawl finished: crawl
>> 
>> 
>> Please help.
>> 
>> Thanks
>> Param
>> 
>> On 7/12/11 5:52 AM, "Julien Nioche" <[email protected]> wrote:
>> 
>>> On 12 July 2011 10:30, Julien Nioche <[email protected]>
>> wrote:
>>> 
>>>> 
>>>> 
>>>>>>> There seems to be no crawl-urlfilter file indeed. Don't know why it's
>>>>>>> gone since
>>>>>>> the crawl command is still there. You can find the file in the 1.2
>>>>>>> release:
>> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/
>>>>>> 
>>>>>> Crawl-urlfilter has been removed  purposefully as it did not add
>>>>> anything
>>>>>> to the other url filters (automaton | regex) in terms of
>> functionality.
>>>>> By
>>>>>> default the urlfilters contain (+.) which IIRC was what the
>>>>>> Crawl-urlfilter used to do.
>>>>>> 
>>>>> 
>>>>> That's reasonable. But now news users are unaware and don't know what
>> to
>>>>> do
>>>>> with this error message.
>>>>> 
>>>> 
>>>> Yep, the tutorial needs updating indeed
>>>> 
>>> 
>>> done
>>> 
>>> 
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>>>>> Thanks for a quick reply.
>>>>>>>> 
>>>>>>>> I searched in the nutch directory but still do not see that file :(.
>>>>>>> 
>>>>>>> Here's
>>>>>>> 
>>>>>>>> complete file list inside runtime/local/conf directory.
>>>>>>>> 
>>>>>>>> us137390:conf parampreetsethi$ pwd
>>>>>>>> /Users/parampreetsethi/Documents/workspace/nutch/runtime/local/conf
>>>>>>>> us137390:conf parampreetsethi$ ls -t
>>>>>>>> automaton-urlfilter.txt    domain-urlfilter.txt    nutch-default.xml
>>>>>>>> prefix-urlfilter.txt    solrindex-mapping.xml
>>>>>>>> configuration.xsl    httpclient-auth.xml    nutch-site.xml
>>>>>>>> regex-normalize.xml    subcollections.xml
>>>>>>>> domain-suffixes.xml    log4j.properties    parse-plugins.dtd
>>>>>>>> regex-urlfilter.txt    suffix-urlfilter.txt
>>>>>>>> domain-suffixes.xsd    nutch-conf.xsl        parse-plugins.xml
>>>>>>>> schema.xml tika-mimetypes.xml
>>>>>>>> 
>>>>>>>> By the way, I tried deploying the code by checking out from svn
>>>>>>> 
>>>>>>> repository,
>>>>>>> 
>>>>>>>> but could not build it. I was getting following error:
>>>>>>>> 
>>>>>>>> resolve-default:
>>>>>>> 
>>>>>>>> [ivy:resolve] :: Ivy 2.2.0 - 20100923230623 ::
>>>>>>> http://ant.apache.org/ivy/
>>>>>>> 
>>>>>>>> :: [ivy:resolve] :: loading settings :: file =
>>>>>>>> 
>>>>>>>> /Users/parampreetsethi/Documents/workspace/nutch/ivy/ivysettings.xml
>>>>>>>> [ivy:resolve]
>>>>>>>> [ivy:resolve] :: problems summary ::
>>>>>>>> [ivy:resolve] :::: WARNINGS
>>>>>>>> [ivy:resolve]         module not found:
>>>>>>>> org.apache.gora#gora-core;0.2-incubating
>>>>>>>> [ivy:resolve]     ==== local: tried
>>>>>>>> [ivy:resolve]
>>>>>>> 
>>>>>>> 
>>>>> 
>> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-core/0.2-incubati
>>>>>>> ng
>>>>>>> 
>>>>>>>> / ivys/ivy.xml
>>>>>>>> [ivy:resolve]       -- artifact
>>>>>>>> org.apache.gora#gora-core;0.2-incubating!gora-core.jar:
>>>>>>>> [ivy:resolve]
>>>>>>> 
>>>>>>> 
>>>>> 
>> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-core/0.2-incubati
>>>>>>> ng
>>>>>>> 
>>>>>>>> / jars/gora-core.jar
>>>>>>>> [ivy:resolve]         module not found:
>>>>>>>> org.apache.gora#gora-sql;0.2-incubating
>>>>>>>> [ivy:resolve]     ==== local: tried
>>>>>>>> [ivy:resolve]
>>>>>>> 
>>>>>>> 
>>>>> 
>> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-sql/0.2-incubatin
>>>>>>> g/
>>>>>>> 
>>>>>>>> i vys/ivy.xml
>>>>>>>> [ivy:resolve]       -- artifact
>>>>>>>> org.apache.gora#gora-sql;0.2-incubating!gora-sql.jar:
>>>>>>>> [ivy:resolve]
>>>>>>> 
>>>>>>> 
>>>>> 
>> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-sql/0.2-incubatin
>>>>>>> g/
>>>>>>> 
>>>>>>>> j ars/gora-sql.jar
>>>>>>>> [ivy:resolve]         ::::::::::::::::::::::::::::::::::::::::::::::
>>>>>>>> [ivy:resolve]         ::          UNRESOLVED DEPENDENCIES         ::
>>>>>>>> [ivy:resolve]         ::::::::::::::::::::::::::::::::::::::::::::::
>>>>>>>> [ivy:resolve]         :: org.apache.gora#gora-core;0.2-incubating:
>>>>> not
>>>>>>>> found [ivy:resolve]         ::
>>>>> org.apache.gora#gora-sql;0.2-incubating:
>>>>>>>> not found [ivy:resolve]
>>>>>>>> 
>>>>>>>> :::::::::::::::::::::::::::::::::::::::::::::: [ivy:resolve]
>>>>>>>> 
>>>>>>>> [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
>>>>>>>> 
>>>>>>>> BUILD FAILED
>>>>>>> 
>>>>>>>> /Users/parampreetsethi/Documents/workspace/nutch/build.xml:458:
>>>>>>> impossible
>>>>>>> 
>>>>>>>> to resolve dependencies:
>>>>>>>>     resolve failed - see output for details
>>>>>>>> 
>>>>>>>> -param
>>>>>>>> 
>>>>>>>> On 7/11/11 5:56 PM, "Jerry E. Craig, Jr." <[email protected]>
>>>>>>> 
>>>>>>> wrote:
>>>>>>>>> Look down a little further for the
>>>>>>>>> 
>>>>>>>>> or
>>>>>>>>> runtime/local/bin/nutch (version >= 1.3)
>>>>>>>>> 
>>>>>>>>> If you download the bin then it's in the runtime directory.
>>>>>>>>> 
>>>>>>>>> Jerry E. Craig, Jr.
>>>>>>>>> 
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Sethi, Parampreet [mailto:[email protected]]
>>>>>>>>> Sent: Monday, July 11, 2011 2:51 PM
>>>>>>>>> To: [email protected]
>>>>>>>>> Subject: Nutch Novice help
>>>>>>>>> 
>>>>>>>>> Hi All,
>>>>>>>>> 
>>>>>>>>> Sorry for such a naïve question,  I downloaded nutch 1.3 binary
>>>>> today
>>>>>>> 
>>>>>>> and
>>>>>>> 
>>>>>>>>> trying to set it up as mentioned in Tutorial at
>>>>>>>>> http://wiki.apache.org/nutch/NutchTutorial
>>>>>>>>> 
>>>>>>>>> How ever I am not able to find crawl-urlfilter.txt inside conf
>>>>>>> 
>>>>>>> directory.
>>>>>>> 
>>>>>>>>> Is there any other place where I should look for this file?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Param
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> *
>>>> *Open Source Solutions for Text Engineering
>>>> 
>>>> http://digitalpebble.blogspot.com/
>>>> http://www.digitalpebble.com
>>>> 
>>> 
>>> 
>> 
>> 
> 

Reply via email to