RE: Nutch pointed to Cassandra, yet, asks for Hadoop

Yossi Tamari Fri, 23 Feb 2018 13:53:28 -0800

I use Nutch 1.X, so I can't really answer your question. However, the point of 
Nutch 2.X is to replace HDFS with other storage options. MR is still required.



> -----Original Message-----
> From: Kaliyug Antagonist [mailto:[email protected]]
> Sent: 23 February 2018 23:49
> To: [email protected]
> Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> 
> So what's the whole point of supporting Cassandra or other databases(via
> Gora) if Hadoop(HDFS & MR)both are essential? What exactly Cassandra would
> be doing ?
> 
> On 23 Feb 2018 22:41, "Yossi Tamari" <[email protected]> wrote:
> 
> > 1 is not true.
> > 2 is true, if we ignore the second part 😊
> > Hadoop is made of two parts: distributed storage (HDFS) and a
> > Map/Reduce framework. Nutch is essentially a collection of Map/Reduce
> > tasks. It relies on Hadoop to distribute these tasks to all
> > participating servers. So if you run in local mode, you can only use
> > one server. If you have a single-node Hadoop, Nutch will be able to
> > fully utilize the server, but it will still be limited to crawling
> > from one machine, which is only sufficient for small/slow crawls.
> >
> > > -----Original Message-----
> > > From: Kaliyug Antagonist [mailto:[email protected]]
> > > Sent: 23 February 2018 23:16
> > > To: [email protected]
> > > Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> > >
> > > Ohh. I'm a bit confused. What of the following is true in the 'deploy'
> > mode:
> > > 1. Data cannot be stored in Cassandra, HBase is the only way.
> > > 2. Data will be stored in Cassandra but you need a (maybe, just a
> > > single node)Hadoop cluster anyway which won't be storing any data
> > > but is there
> > just to
> > > make Nutch happy.
> > >
> > > On 23 Feb 2018 22:08, "Yossi Tamari" <[email protected]> wrote:
> > >
> > > > Hi Kaliyug,
> > > >
> > > > Nutch 2 still requires Hadoop to run, it just allows you to store
> > > > data somewhere other than HDFS.
> > > > The only way to run Nutch without Hadoop is local mode, which is
> > > > only recommended for testing. To do that, run ./runtime/local/bin/crawl.
> > > >
> > > >         Yossi.
> > > >
> > > > > -----Original Message-----
> > > > > From: Kaliyug Antagonist [mailto:[email protected]]
> > > > > Sent: 23 February 2018 20:26
> > > > > To: [email protected]
> > > > > Subject: Nutch pointed to Cassandra, yet, asks for Hadoop
> > > > >
> > > > > Windows 10 Nutch 2.3.1 Cassandra 3.11.1
> > > > >
> > > > > I have extracted and built Nutch under the Cygwin's home directory.
> > > > >
> > > > > I believe that the Cassandra server is working:
> > > > >
> > > > > INFO  [main] 2018-02-23 16:20:41,077 StorageService.java:1442 -
> > > > > JOINING: Finish joining ring
> > > > > INFO  [main] 2018-02-23 16:20:41,820
> > > > > SecondaryIndexManager.java:509
> > > > > - Executing pre-join tasks for: CFS(Keyspace='test',
> > > > > ColumnFamily='test')
> > > > > INFO  [main] 2018-02-23 16:20:42,161 StorageService.java:2268 -
> > > > > Node
> > > > > localhost/127.0.0.1 state jump to NORMAL INFO  [main] 2018-02-23
> > > > > 16:20:43,049 NativeTransportService.java:75 - Netty using Java
> > > > > NIO event
> > > > loop
> > > > > INFO  [main] 2018-02-23 16:20:43,358 Server.java:155 - Using
> > > > > Netty
> > > > > Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a,
> > > > > netty-codec=netty-codec-4.0.44.Final.452812a,
> > > > > netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a,
> > > > > netty-codec-http=netty-codec-http-4.0.44.Final.452812a,
> > > > > netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a,
> > > > > netty-common=netty-common-4.0.44.Final.452812a,
> > > > > netty-handler=netty-handler-4.0.44.Final.452812a,
> > > > > netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb,
> > > > > netty-transport=netty-transport-4.0.44.Final.452812a,
> > > > > netty-transport-native-epoll=netty-transport-native-epoll-
> > > > 4.0.44.Final.452812a,
> > > > > netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a,
> > > > > netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a,
> > > > > netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
> > > > > INFO  [main] 2018-02-23 16:20:43,359 Server.java:156 - Starting
> > > > listening for
> > > > > CQL clients on localhost/127.0.0.1:9042 (unencrypted)...
> > > > > INFO  [main] 2018-02-23 16:20:43,941 CassandraDaemon.java:527 -
> > > > > Not starting RPC server as requested. Use JMX
> > > > > (StorageService->startRPCServer()) or nodetool (enablethrift) to
> > > > > start
> > > > it
> > > > >
> > > > > I did the following check:
> > > > >
> > > > > apache-cassandra-3.11.1\bin>nodetool status
> > > > > Datacenter: datacenter1
> > > > > ========================
> > > > > Status=Up/Down
> > > > > |/ State=Normal/Leaving/Joining/Moving
> > > > > --  Address    Load       Tokens       Owns (effective)  Host ID
> > > > >                         Rack
> > > > > UN  127.0.0.1  273.97 KiB  256          100.0%
> > > > > dab932f2-d138-4a1a-acd4-f63cbb16d224  rack1
> > > > >
> > > > > csql connects
> > > > >
> > > > > apache-cassandra-3.11.1\bin>cqlsh
> > > > >
> > > > > WARNING: console codepage must be set to cp65001 to support
> > > > > utf-8
> > > > encoding
> > > > > on Windows platforms.
> > > > > If you experience encoding problems, change your console
> > > > > codepage with
> > > > 'chcp
> > > > > 65001' before starting cqlsh.
> > > > >
> > > > > Connected to Test Cluster at 127.0.0.1:9042.
> > > > > [cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native
> > > > > protocol v4]
> > > > Use HELP
> > > > > for help.
> > > > > WARNING: pyreadline dependency missing.  Install to enable tab
> > > > completion.
> > > > > cqlsh> describe keyspaces
> > > > >
> > > > > system_schema  system_auth  system  system_distributed  test
> > > > system_traces
> > > > >
> > > > > I followed the tutorial 'Setting up NUTCH 2.x with CASSANDRA
> > > > > <https://wiki.apache.org/nutch/Nutch2Cassandra>' and added the
> > > > respective
> > > > > entries in the properties and the xml files.
> > > > >
> > > > > I go to the Cygwin prompt and attempt to crawl. Instead of using
> > > > Cassandra, it
> > > > > asks for Hadoop(HBase, probably)
> > > > >
> > > > > /home/apache-nutch-2.3.1
> > > > > $ ./runtime/deploy/bin/crawl urls/ crawl/ 1 No SOLRURL specified.
> > > > Skipping
> > > > > indexing.
> > > > > which: no hadoop in (<dump of the classpath entries>) Can't find
> > > > > Hadoop executable. Add HADOOP_HOME/bin to the path or run in
> > > > > local
> > > mode.
> > > > >
> > > > >
> > > > >
> > > > > <http://www.avg.com/email-
> > > > > signature?utm_medium=email&utm_source=link&utm_campaign=sig-
> > > > > email&utm_content=webmail>
> > > > > Virus-free.
> > > > > www.avg.com
> > > > > <http://www.avg.com/email-
> > > > > signature?utm_medium=email&utm_source=link&utm_campaign=sig-
> > > > > email&utm_content=webmail>
> > > > > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> > > >
> > > >
> >
> >

RE: Nutch pointed to Cassandra, yet, asks for Hadoop

Reply via email to