RE: Nutch pointed to Cassandra, yet, asks for Hadoop

Markus Jelsma Fri, 23 Feb 2018 14:03:05 -0800

Hi,

If you want to stay clear of all 2.x caveats, use Nutch 1.x. If you want the 
most stable and feature rich version, use 1.x. If you want to limit the number 
of wheels (Gora as DB abstraction, running and operate a separate DB server), 
use 1.x. If you do not intend to crawl tens of millions of records, you are 
fine running Nutch 1.x locally.


Regards,
Markus
 
-----Original message-----
> From:Kaliyug Antagonist <[email protected]>
> Sent: Friday 23rd February 2018 22:48
> To: [email protected]
> Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> 
> So what's the whole point of supporting Cassandra or other databases(via
> Gora) if Hadoop(HDFS & MR)both are essential? What exactly Cassandra would
> be doing ?
> 
> On 23 Feb 2018 22:41, "Yossi Tamari" <[email protected]> wrote:
> 
> > 1 is not true.
> > 2 is true, if we ignore the second part 
> > Hadoop is made of two parts: distributed storage (HDFS) and a Map/Reduce
> > framework. Nutch is essentially a collection of Map/Reduce tasks. It relies
> > on Hadoop to distribute these tasks to all participating servers. So if you
> > run in local mode, you can only use one server. If you have a single-node
> > Hadoop, Nutch will be able to fully utilize the server, but it will still
> > be limited to crawling from one machine, which is only sufficient for
> > small/slow crawls.
> >
> > > -----Original Message-----
> > > From: Kaliyug Antagonist [mailto:[email protected]]
> > > Sent: 23 February 2018 23:16
> > > To: [email protected]
> > > Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> > >
> > > Ohh. I'm a bit confused. What of the following is true in the 'deploy'
> > mode:
> > > 1. Data cannot be stored in Cassandra, HBase is the only way.
> > > 2. Data will be stored in Cassandra but you need a (maybe, just a single
> > > node)Hadoop cluster anyway which won't be storing any data but is there
> > just to
> > > make Nutch happy.
> > >
> > > On 23 Feb 2018 22:08, "Yossi Tamari" <[email protected]> wrote:
> > >
> > > > Hi Kaliyug,
> > > >
> > > > Nutch 2 still requires Hadoop to run, it just allows you to store data
> > > > somewhere other than HDFS.
> > > > The only way to run Nutch without Hadoop is local mode, which is only
> > > > recommended for testing. To do that, run ./runtime/local/bin/crawl.
> > > >
> > > >         Yossi.
> > > >
> > > > > -----Original Message-----
> > > > > From: Kaliyug Antagonist [mailto:[email protected]]
> > > > > Sent: 23 February 2018 20:26
> > > > > To: [email protected]
> > > > > Subject: Nutch pointed to Cassandra, yet, asks for Hadoop
> > > > >
> > > > > Windows 10 Nutch 2.3.1 Cassandra 3.11.1
> > > > >
> > > > > I have extracted and built Nutch under the Cygwin's home directory.
> > > > >
> > > > > I believe that the Cassandra server is working:
> > > > >
> > > > > INFO  [main] 2018-02-23 16:20:41,077 StorageService.java:1442 -
> > > > > JOINING: Finish joining ring
> > > > > INFO  [main] 2018-02-23 16:20:41,820 SecondaryIndexManager.java:509
> > > > > - Executing pre-join tasks for: CFS(Keyspace='test',
> > > > > ColumnFamily='test')
> > > > > INFO  [main] 2018-02-23 16:20:42,161 StorageService.java:2268 - Node
> > > > > localhost/127.0.0.1 state jump to NORMAL INFO  [main] 2018-02-23
> > > > > 16:20:43,049 NativeTransportService.java:75 - Netty using Java NIO
> > > > > event
> > > > loop
> > > > > INFO  [main] 2018-02-23 16:20:43,358 Server.java:155 - Using Netty
> > > > > Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a,
> > > > > netty-codec=netty-codec-4.0.44.Final.452812a,
> > > > > netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a,
> > > > > netty-codec-http=netty-codec-http-4.0.44.Final.452812a,
> > > > > netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a,
> > > > > netty-common=netty-common-4.0.44.Final.452812a,
> > > > > netty-handler=netty-handler-4.0.44.Final.452812a,
> > > > > netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb,
> > > > > netty-transport=netty-transport-4.0.44.Final.452812a,
> > > > > netty-transport-native-epoll=netty-transport-native-epoll-
> > > > 4.0.44.Final.452812a,
> > > > > netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a,
> > > > > netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a,
> > > > > netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
> > > > > INFO  [main] 2018-02-23 16:20:43,359 Server.java:156 - Starting
> > > > listening for
> > > > > CQL clients on localhost/127.0.0.1:9042 (unencrypted)...
> > > > > INFO  [main] 2018-02-23 16:20:43,941 CassandraDaemon.java:527 - Not
> > > > > starting RPC server as requested. Use JMX
> > > > > (StorageService->startRPCServer()) or nodetool (enablethrift) to
> > > > > start
> > > > it
> > > > >
> > > > > I did the following check:
> > > > >
> > > > > apache-cassandra-3.11.1\bin>nodetool status
> > > > > Datacenter: datacenter1
> > > > > ========================
> > > > > Status=Up/Down
> > > > > |/ State=Normal/Leaving/Joining/Moving
> > > > > --  Address    Load       Tokens       Owns (effective)  Host ID
> > > > >                         Rack
> > > > > UN  127.0.0.1  273.97 KiB  256          100.0%
> > > > > dab932f2-d138-4a1a-acd4-f63cbb16d224  rack1
> > > > >
> > > > > csql connects
> > > > >
> > > > > apache-cassandra-3.11.1\bin>cqlsh
> > > > >
> > > > > WARNING: console codepage must be set to cp65001 to support utf-8
> > > > encoding
> > > > > on Windows platforms.
> > > > > If you experience encoding problems, change your console codepage
> > > > > with
> > > > 'chcp
> > > > > 65001' before starting cqlsh.
> > > > >
> > > > > Connected to Test Cluster at 127.0.0.1:9042.
> > > > > [cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native protocol
> > > > > v4]
> > > > Use HELP
> > > > > for help.
> > > > > WARNING: pyreadline dependency missing.  Install to enable tab
> > > > completion.
> > > > > cqlsh> describe keyspaces
> > > > >
> > > > > system_schema  system_auth  system  system_distributed  test
> > > > system_traces
> > > > >
> > > > > I followed the tutorial 'Setting up NUTCH 2.x with CASSANDRA
> > > > > <https://wiki.apache.org/nutch/Nutch2Cassandra>' and added the
> > > > respective
> > > > > entries in the properties and the xml files.
> > > > >
> > > > > I go to the Cygwin prompt and attempt to crawl. Instead of using
> > > > Cassandra, it
> > > > > asks for Hadoop(HBase, probably)
> > > > >
> > > > > /home/apache-nutch-2.3.1
> > > > > $ ./runtime/deploy/bin/crawl urls/ crawl/ 1 No SOLRURL specified.
> > > > Skipping
> > > > > indexing.
> > > > > which: no hadoop in (<dump of the classpath entries>) Can't find
> > > > > Hadoop executable. Add HADOOP_HOME/bin to the path or run in local
> > > mode.
> > > > >
> > > > >
> > > > >
> > > > > <http://www.avg.com/email-
> > > > > signature?utm_medium=email&utm_source=link&utm_campaign=sig-
> > > > > email&utm_content=webmail>
> > > > > Virus-free.
> > > > > www.avg.com
> > > > > <http://www.avg.com/email-
> > > > > signature?utm_medium=email&utm_source=link&utm_campaign=sig-
> > > > > email&utm_content=webmail>
> > > > > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> > > >
> > > >
> >
> >

RE: Nutch pointed to Cassandra, yet, asks for Hadoop

Reply via email to