> So what's the whole point of supporting Cassandra or other databases(via > Gora) if Hadoop(HDFS & MR)both are essential? What exactly Cassandra would > be doing ?
Because it's a database and not a Hadoop map or sequence file which becomes unhandy if it grows to 100 millions or billions of records. Anyway, Nutch 1.x can crawl billions of pages, it's more actively maintained and provides more features. The only good argument to use 2.x would be to integrate/share crawled data via Cassandra with other components of your infrastructure. Cassandra stores the data, Hadoop runs the crawler and distributed the job tasks. You need also need little HDFS storage to hold and distribute the Nutch program and keep the log files. Sebastian On 02/23/2018 10:48 PM, Kaliyug Antagonist wrote: > So what's the whole point of supporting Cassandra or other databases(via > Gora) if Hadoop(HDFS & MR)both are essential? What exactly Cassandra would > be doing ? > > On 23 Feb 2018 22:41, "Yossi Tamari" <[email protected]> wrote: > >> 1 is not true. >> 2 is true, if we ignore the second part 😊 >> Hadoop is made of two parts: distributed storage (HDFS) and a Map/Reduce >> framework. Nutch is essentially a collection of Map/Reduce tasks. It relies >> on Hadoop to distribute these tasks to all participating servers. So if you >> run in local mode, you can only use one server. If you have a single-node >> Hadoop, Nutch will be able to fully utilize the server, but it will still >> be limited to crawling from one machine, which is only sufficient for >> small/slow crawls. >> >>> -----Original Message----- >>> From: Kaliyug Antagonist [mailto:[email protected]] >>> Sent: 23 February 2018 23:16 >>> To: [email protected] >>> Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop >>> >>> Ohh. I'm a bit confused. What of the following is true in the 'deploy' >> mode: >>> 1. Data cannot be stored in Cassandra, HBase is the only way. >>> 2. Data will be stored in Cassandra but you need a (maybe, just a single >>> node)Hadoop cluster anyway which won't be storing any data but is there >> just to >>> make Nutch happy. >>> >>> On 23 Feb 2018 22:08, "Yossi Tamari" <[email protected]> wrote: >>> >>>> Hi Kaliyug, >>>> >>>> Nutch 2 still requires Hadoop to run, it just allows you to store data >>>> somewhere other than HDFS. >>>> The only way to run Nutch without Hadoop is local mode, which is only >>>> recommended for testing. To do that, run ./runtime/local/bin/crawl. >>>> >>>> Yossi. >>>> >>>>> -----Original Message----- >>>>> From: Kaliyug Antagonist [mailto:[email protected]] >>>>> Sent: 23 February 2018 20:26 >>>>> To: [email protected] >>>>> Subject: Nutch pointed to Cassandra, yet, asks for Hadoop >>>>> >>>>> Windows 10 Nutch 2.3.1 Cassandra 3.11.1 >>>>> >>>>> I have extracted and built Nutch under the Cygwin's home directory. >>>>> >>>>> I believe that the Cassandra server is working: >>>>> >>>>> INFO [main] 2018-02-23 16:20:41,077 StorageService.java:1442 - >>>>> JOINING: Finish joining ring >>>>> INFO [main] 2018-02-23 16:20:41,820 SecondaryIndexManager.java:509 >>>>> - Executing pre-join tasks for: CFS(Keyspace='test', >>>>> ColumnFamily='test') >>>>> INFO [main] 2018-02-23 16:20:42,161 StorageService.java:2268 - Node >>>>> localhost/127.0.0.1 state jump to NORMAL INFO [main] 2018-02-23 >>>>> 16:20:43,049 NativeTransportService.java:75 - Netty using Java NIO >>>>> event >>>> loop >>>>> INFO [main] 2018-02-23 16:20:43,358 Server.java:155 - Using Netty >>>>> Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a, >>>>> netty-codec=netty-codec-4.0.44.Final.452812a, >>>>> netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a, >>>>> netty-codec-http=netty-codec-http-4.0.44.Final.452812a, >>>>> netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a, >>>>> netty-common=netty-common-4.0.44.Final.452812a, >>>>> netty-handler=netty-handler-4.0.44.Final.452812a, >>>>> netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb, >>>>> netty-transport=netty-transport-4.0.44.Final.452812a, >>>>> netty-transport-native-epoll=netty-transport-native-epoll- >>>> 4.0.44.Final.452812a, >>>>> netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a, >>>>> netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a, >>>>> netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a] >>>>> INFO [main] 2018-02-23 16:20:43,359 Server.java:156 - Starting >>>> listening for >>>>> CQL clients on localhost/127.0.0.1:9042 (unencrypted)... >>>>> INFO [main] 2018-02-23 16:20:43,941 CassandraDaemon.java:527 - Not >>>>> starting RPC server as requested. Use JMX >>>>> (StorageService->startRPCServer()) or nodetool (enablethrift) to >>>>> start >>>> it >>>>> >>>>> I did the following check: >>>>> >>>>> apache-cassandra-3.11.1\bin>nodetool status >>>>> Datacenter: datacenter1 >>>>> ======================== >>>>> Status=Up/Down >>>>> |/ State=Normal/Leaving/Joining/Moving >>>>> -- Address Load Tokens Owns (effective) Host ID >>>>> Rack >>>>> UN 127.0.0.1 273.97 KiB 256 100.0% >>>>> dab932f2-d138-4a1a-acd4-f63cbb16d224 rack1 >>>>> >>>>> csql connects >>>>> >>>>> apache-cassandra-3.11.1\bin>cqlsh >>>>> >>>>> WARNING: console codepage must be set to cp65001 to support utf-8 >>>> encoding >>>>> on Windows platforms. >>>>> If you experience encoding problems, change your console codepage >>>>> with >>>> 'chcp >>>>> 65001' before starting cqlsh. >>>>> >>>>> Connected to Test Cluster at 127.0.0.1:9042. >>>>> [cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native protocol >>>>> v4] >>>> Use HELP >>>>> for help. >>>>> WARNING: pyreadline dependency missing. Install to enable tab >>>> completion. >>>>> cqlsh> describe keyspaces >>>>> >>>>> system_schema system_auth system system_distributed test >>>> system_traces >>>>> >>>>> I followed the tutorial 'Setting up NUTCH 2.x with CASSANDRA >>>>> <https://wiki.apache.org/nutch/Nutch2Cassandra>' and added the >>>> respective >>>>> entries in the properties and the xml files. >>>>> >>>>> I go to the Cygwin prompt and attempt to crawl. Instead of using >>>> Cassandra, it >>>>> asks for Hadoop(HBase, probably) >>>>> >>>>> /home/apache-nutch-2.3.1 >>>>> $ ./runtime/deploy/bin/crawl urls/ crawl/ 1 No SOLRURL specified. >>>> Skipping >>>>> indexing. >>>>> which: no hadoop in (<dump of the classpath entries>) Can't find >>>>> Hadoop executable. Add HADOOP_HOME/bin to the path or run in local >>> mode. >>>>> >>>>> >>>>> >>>>> <http://www.avg.com/email- >>>>> signature?utm_medium=email&utm_source=link&utm_campaign=sig- >>>>> email&utm_content=webmail> >>>>> Virus-free. >>>>> www.avg.com >>>>> <http://www.avg.com/email- >>>>> signature?utm_medium=email&utm_source=link&utm_campaign=sig- >>>>> email&utm_content=webmail> >>>>> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>>> >>>> >> >> >

