Apache Nutch hadoop+hbase+hdfs integration

d.zenin Tue, 10 Mar 2015 15:05:35 -0700

Hi guys,

I successfully configured nutch2x + hbase0.94 + hadoop2.5 but have a
question regarding how nutch runs it's "phases"(fetch, parse, etc.). I
found in Nutch source code that each phase is a set of hadoop Map/Reduce
tasks.


Do i have a correct understanding that each Map/Reduce task
stores/processes data to/from hbase? Do these tasks store additionally
their output to hdfs(for example does fetchjob map task store url content
in hdfs, so i can see it directly from hdfs)?

Is it necessary to have up and running task tracker to run map/reduce
tasks? I found out that execution performs successfully even if no
TaskTracker or NodeManager is running.

My jps for successful execution:

breedish-mbp:~ zenind$ jps
21664 HQuorumPeer
19657 Launcher
8974 NailgunRunner
21706 HMaster
20918 NameNode
21808 HRegionServer
21114 SecondaryNameNode
21005 DataNode
21847 Jps

Best Regards,
Dzmitry

Apache Nutch hadoop+hbase+hdfs integration

Reply via email to