Hello All

I just tried nutch 2.3.1 and problem resolved.It seems that still nutch 2.x is not stable.
Since I am novice in Hadoop-nutch,some questions are in my mind.

1) Can nutch 2.x be distributed on hadoop cluster(in nutch-hadoop documentation only 1.x is taken)??

2) When nutch 2.x is deployed in local environment and logs are stored in hadoop.log .While observing hadoop.log,some information were there related to map-reduce jobs.Does it mean that intermediate data is kept in hdfs and map-reduce jobs processes that data.I want in-depth knowledge of nutch-hadoop architecture.

Waiting for response.

Thanks

On Friday 22 January 2016 10:36 AM, harsh wrote:
Hello All
Is this a bug?If Yes then should I move to nutch 2.3.1?Still waiting for response.
Thanks
On Thursday 21 January 2016 12:45 PM, harsh wrote:
I had configured nutch 2.3 in local enviornment with gora-mongoDB(0.6.1).Exceution flow was given below.

bin/nutch generate -topN 100 -crawlId nutch_crawler

GeneratorJob: starting at 2016-01-21 12:40:19
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 100
GeneratorJob: finished at 2016-01-21 12:40:22, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1453360219-812835806 containing 0 URLs


bin/nutch fetch 1453359290-25044495

FetcherJob: starting at 2016-01-21 12:41:34
FetcherJob: batchId: 1453359290-25044495
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread5, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread7, activeThreads=0
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 10
*QueueFeeder finished: total 0 records. Hit by time limit :0*
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread5, activeThreads=1
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
-finishing thread FetcherThread7, activeThreads=0
-finishing thread FetcherThread8, activeThreads=0
Fetcher: throughput threshold: -1
-finishing thread FetcherThread9, activeThreads=0
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: finished at 2016-01-21 12:41:47, time elapsed: 00:00:13

While there was only one document in MongoDB

db.nutch_crawler_webpage.findOne()
{
    "_id" : "com.thehindu.www:http/",
    "fetchTime" : NumberLong("1453359144150"),
    "fetchInterval" : 2592000,
    "score" : 1,
    "markers" : {
        "dist" : "0",
        "_injmrk_" : "y"
    },
    "metadata" : {
        "_csh_" : BinData(0,"P4AAAA==")
    }
}

Internal links were not found.Why?Where they should be stored.

The configuration of nutch-site.xml is

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
    <name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>
<property>
    <name>http.agent.name</name>
    <value>Orkash Crawler</value>
  </property>
<property>
   <name>plugin.folders</name>
<value>/home/user/Downloads/nutch_jars/mongo_nnutch_2/apache-nutch-2.3/build/plugins</value>
 </property>
<property>
    <name>http.content.limit</name>
    <value>-1</value>
  </property>
</configuration>












Reply via email to