Re: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-24 Thread Sebastian Nagel
> So what's the whole point of supporting Cassandra or other databases(via
> Gora) if Hadoop(HDFS & MR)both are essential? What exactly Cassandra would
> be doing ?

Because it's a database and not a Hadoop map or sequence file which becomes
unhandy if it grows to 100 millions or billions of records. Anyway,
Nutch 1.x can crawl billions of pages, it's more actively maintained
and provides more features.

The only good argument to use 2.x would be to integrate/share crawled data
via Cassandra with other components of your infrastructure.

Cassandra stores the data, Hadoop runs the crawler and distributed the job 
tasks.
You need also need little HDFS storage to hold and distribute the Nutch program
and keep the log files.

Sebastian

On 02/23/2018 10:48 PM, Kaliyug Antagonist wrote:
> So what's the whole point of supporting Cassandra or other databases(via
> Gora) if Hadoop(HDFS & MR)both are essential? What exactly Cassandra would
> be doing ?
> 
> On 23 Feb 2018 22:41, "Yossi Tamari"  wrote:
> 
>> 1 is not true.
>> 2 is true, if we ignore the second part 😊
>> Hadoop is made of two parts: distributed storage (HDFS) and a Map/Reduce
>> framework. Nutch is essentially a collection of Map/Reduce tasks. It relies
>> on Hadoop to distribute these tasks to all participating servers. So if you
>> run in local mode, you can only use one server. If you have a single-node
>> Hadoop, Nutch will be able to fully utilize the server, but it will still
>> be limited to crawling from one machine, which is only sufficient for
>> small/slow crawls.
>>
>>> -Original Message-
>>> From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
>>> Sent: 23 February 2018 23:16
>>> To: user@nutch.apache.org
>>> Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
>>>
>>> Ohh. I'm a bit confused. What of the following is true in the 'deploy'
>> mode:
>>> 1. Data cannot be stored in Cassandra, HBase is the only way.
>>> 2. Data will be stored in Cassandra but you need a (maybe, just a single
>>> node)Hadoop cluster anyway which won't be storing any data but is there
>> just to
>>> make Nutch happy.
>>>
>>> On 23 Feb 2018 22:08, "Yossi Tamari"  wrote:
>>>
>>>> Hi Kaliyug,
>>>>
>>>> Nutch 2 still requires Hadoop to run, it just allows you to store data
>>>> somewhere other than HDFS.
>>>> The only way to run Nutch without Hadoop is local mode, which is only
>>>> recommended for testing. To do that, run ./runtime/local/bin/crawl.
>>>>
>>>> Yossi.
>>>>
>>>>> -Original Message-
>>>>> From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
>>>>> Sent: 23 February 2018 20:26
>>>>> To: user@nutch.apache.org
>>>>> Subject: Nutch pointed to Cassandra, yet, asks for Hadoop
>>>>>
>>>>> Windows 10 Nutch 2.3.1 Cassandra 3.11.1
>>>>>
>>>>> I have extracted and built Nutch under the Cygwin's home directory.
>>>>>
>>>>> I believe that the Cassandra server is working:
>>>>>
>>>>> INFO  [main] 2018-02-23 16:20:41,077 StorageService.java:1442 -
>>>>> JOINING: Finish joining ring
>>>>> INFO  [main] 2018-02-23 16:20:41,820 SecondaryIndexManager.java:509
>>>>> - Executing pre-join tasks for: CFS(Keyspace='test',
>>>>> ColumnFamily='test')
>>>>> INFO  [main] 2018-02-23 16:20:42,161 StorageService.java:2268 - Node
>>>>> localhost/127.0.0.1 state jump to NORMAL INFO  [main] 2018-02-23
>>>>> 16:20:43,049 NativeTransportService.java:75 - Netty using Java NIO
>>>>> event
>>>> loop
>>>>> INFO  [main] 2018-02-23 16:20:43,358 Server.java:155 - Using Netty
>>>>> Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a,
>>>>> netty-codec=netty-codec-4.0.44.Final.452812a,
>>>>> netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a,
>>>>> netty-codec-http=netty-codec-http-4.0.44.Final.452812a,
>>>>> netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a,
>>>>> netty-common=netty-common-4.0.44.Final.452812a,
>>>>> netty-handler=netty-handler-4.0.44.Final.452812a,
>>>>> netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb,
>>>>> netty-transport=netty-transport-4.0.44.Final.452812a,
>>>>> netty-transport-native-epoll=nett

RE: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-23 Thread Markus Jelsma
Hi,

If you want to stay clear of all 2.x caveats, use Nutch 1.x. If you want the 
most stable and feature rich version, use 1.x. If you want to limit the number 
of wheels (Gora as DB abstraction, running and operate a separate DB server), 
use 1.x. If you do not intend to crawl tens of millions of records, you are 
fine running Nutch 1.x locally. 

Regards,
Markus
 
-Original message-
> From:Kaliyug Antagonist 
> Sent: Friday 23rd February 2018 22:48
> To: user@nutch.apache.org
> Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> 
> So what's the whole point of supporting Cassandra or other databases(via
> Gora) if Hadoop(HDFS & MR)both are essential? What exactly Cassandra would
> be doing ?
> 
> On 23 Feb 2018 22:41, "Yossi Tamari"  wrote:
> 
> > 1 is not true.
> > 2 is true, if we ignore the second part 
> > Hadoop is made of two parts: distributed storage (HDFS) and a Map/Reduce
> > framework. Nutch is essentially a collection of Map/Reduce tasks. It relies
> > on Hadoop to distribute these tasks to all participating servers. So if you
> > run in local mode, you can only use one server. If you have a single-node
> > Hadoop, Nutch will be able to fully utilize the server, but it will still
> > be limited to crawling from one machine, which is only sufficient for
> > small/slow crawls.
> >
> > > -Original Message-
> > > From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> > > Sent: 23 February 2018 23:16
> > > To: user@nutch.apache.org
> > > Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> > >
> > > Ohh. I'm a bit confused. What of the following is true in the 'deploy'
> > mode:
> > > 1. Data cannot be stored in Cassandra, HBase is the only way.
> > > 2. Data will be stored in Cassandra but you need a (maybe, just a single
> > > node)Hadoop cluster anyway which won't be storing any data but is there
> > just to
> > > make Nutch happy.
> > >
> > > On 23 Feb 2018 22:08, "Yossi Tamari"  wrote:
> > >
> > > > Hi Kaliyug,
> > > >
> > > > Nutch 2 still requires Hadoop to run, it just allows you to store data
> > > > somewhere other than HDFS.
> > > > The only way to run Nutch without Hadoop is local mode, which is only
> > > > recommended for testing. To do that, run ./runtime/local/bin/crawl.
> > > >
> > > > Yossi.
> > > >
> > > > > -Original Message-
> > > > > From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> > > > > Sent: 23 February 2018 20:26
> > > > > To: user@nutch.apache.org
> > > > > Subject: Nutch pointed to Cassandra, yet, asks for Hadoop
> > > > >
> > > > > Windows 10 Nutch 2.3.1 Cassandra 3.11.1
> > > > >
> > > > > I have extracted and built Nutch under the Cygwin's home directory.
> > > > >
> > > > > I believe that the Cassandra server is working:
> > > > >
> > > > > INFO  [main] 2018-02-23 16:20:41,077 StorageService.java:1442 -
> > > > > JOINING: Finish joining ring
> > > > > INFO  [main] 2018-02-23 16:20:41,820 SecondaryIndexManager.java:509
> > > > > - Executing pre-join tasks for: CFS(Keyspace='test',
> > > > > ColumnFamily='test')
> > > > > INFO  [main] 2018-02-23 16:20:42,161 StorageService.java:2268 - Node
> > > > > localhost/127.0.0.1 state jump to NORMAL INFO  [main] 2018-02-23
> > > > > 16:20:43,049 NativeTransportService.java:75 - Netty using Java NIO
> > > > > event
> > > > loop
> > > > > INFO  [main] 2018-02-23 16:20:43,358 Server.java:155 - Using Netty
> > > > > Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a,
> > > > > netty-codec=netty-codec-4.0.44.Final.452812a,
> > > > > netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a,
> > > > > netty-codec-http=netty-codec-http-4.0.44.Final.452812a,
> > > > > netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a,
> > > > > netty-common=netty-common-4.0.44.Final.452812a,
> > > > > netty-handler=netty-handler-4.0.44.Final.452812a,
> > > > > netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb,
> > > > > netty-transport=netty-transport-4.0.44.Final.452812a,
> > > > > netty-transport-native-epoll=netty-transport-native-epoll-
> > > > 4.0

RE: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-23 Thread Yossi Tamari
I use Nutch 1.X, so I can't really answer your question. However, the point of 
Nutch 2.X is to replace HDFS with other storage options. MR is still required.


> -Original Message-
> From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> Sent: 23 February 2018 23:49
> To: user@nutch.apache.org
> Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> 
> So what's the whole point of supporting Cassandra or other databases(via
> Gora) if Hadoop(HDFS & MR)both are essential? What exactly Cassandra would
> be doing ?
> 
> On 23 Feb 2018 22:41, "Yossi Tamari"  wrote:
> 
> > 1 is not true.
> > 2 is true, if we ignore the second part 😊
> > Hadoop is made of two parts: distributed storage (HDFS) and a
> > Map/Reduce framework. Nutch is essentially a collection of Map/Reduce
> > tasks. It relies on Hadoop to distribute these tasks to all
> > participating servers. So if you run in local mode, you can only use
> > one server. If you have a single-node Hadoop, Nutch will be able to
> > fully utilize the server, but it will still be limited to crawling
> > from one machine, which is only sufficient for small/slow crawls.
> >
> > > -Original Message-
> > > From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> > > Sent: 23 February 2018 23:16
> > > To: user@nutch.apache.org
> > > Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> > >
> > > Ohh. I'm a bit confused. What of the following is true in the 'deploy'
> > mode:
> > > 1. Data cannot be stored in Cassandra, HBase is the only way.
> > > 2. Data will be stored in Cassandra but you need a (maybe, just a
> > > single node)Hadoop cluster anyway which won't be storing any data
> > > but is there
> > just to
> > > make Nutch happy.
> > >
> > > On 23 Feb 2018 22:08, "Yossi Tamari"  wrote:
> > >
> > > > Hi Kaliyug,
> > > >
> > > > Nutch 2 still requires Hadoop to run, it just allows you to store
> > > > data somewhere other than HDFS.
> > > > The only way to run Nutch without Hadoop is local mode, which is
> > > > only recommended for testing. To do that, run ./runtime/local/bin/crawl.
> > > >
> > > > Yossi.
> > > >
> > > > > -Original Message-
> > > > > From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> > > > > Sent: 23 February 2018 20:26
> > > > > To: user@nutch.apache.org
> > > > > Subject: Nutch pointed to Cassandra, yet, asks for Hadoop
> > > > >
> > > > > Windows 10 Nutch 2.3.1 Cassandra 3.11.1
> > > > >
> > > > > I have extracted and built Nutch under the Cygwin's home directory.
> > > > >
> > > > > I believe that the Cassandra server is working:
> > > > >
> > > > > INFO  [main] 2018-02-23 16:20:41,077 StorageService.java:1442 -
> > > > > JOINING: Finish joining ring
> > > > > INFO  [main] 2018-02-23 16:20:41,820
> > > > > SecondaryIndexManager.java:509
> > > > > - Executing pre-join tasks for: CFS(Keyspace='test',
> > > > > ColumnFamily='test')
> > > > > INFO  [main] 2018-02-23 16:20:42,161 StorageService.java:2268 -
> > > > > Node
> > > > > localhost/127.0.0.1 state jump to NORMAL INFO  [main] 2018-02-23
> > > > > 16:20:43,049 NativeTransportService.java:75 - Netty using Java
> > > > > NIO event
> > > > loop
> > > > > INFO  [main] 2018-02-23 16:20:43,358 Server.java:155 - Using
> > > > > Netty
> > > > > Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a,
> > > > > netty-codec=netty-codec-4.0.44.Final.452812a,
> > > > > netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a,
> > > > > netty-codec-http=netty-codec-http-4.0.44.Final.452812a,
> > > > > netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a,
> > > > > netty-common=netty-common-4.0.44.Final.452812a,
> > > > > netty-handler=netty-handler-4.0.44.Final.452812a,
> > > > > netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb,
> > > > > netty-transport=netty-transport-4.0.44.Final.452812a,
> > > > > netty-transport-native-epoll=netty-transport-native-epoll-
> > > > 4.0.44.Final.452812a,
> > > > > netty-transport-rxtx=netty-transpo

RE: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-23 Thread Kaliyug Antagonist
So what's the whole point of supporting Cassandra or other databases(via
Gora) if Hadoop(HDFS & MR)both are essential? What exactly Cassandra would
be doing ?

On 23 Feb 2018 22:41, "Yossi Tamari"  wrote:

> 1 is not true.
> 2 is true, if we ignore the second part 😊
> Hadoop is made of two parts: distributed storage (HDFS) and a Map/Reduce
> framework. Nutch is essentially a collection of Map/Reduce tasks. It relies
> on Hadoop to distribute these tasks to all participating servers. So if you
> run in local mode, you can only use one server. If you have a single-node
> Hadoop, Nutch will be able to fully utilize the server, but it will still
> be limited to crawling from one machine, which is only sufficient for
> small/slow crawls.
>
> > -Original Message-
> > From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> > Sent: 23 February 2018 23:16
> > To: user@nutch.apache.org
> > Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> >
> > Ohh. I'm a bit confused. What of the following is true in the 'deploy'
> mode:
> > 1. Data cannot be stored in Cassandra, HBase is the only way.
> > 2. Data will be stored in Cassandra but you need a (maybe, just a single
> > node)Hadoop cluster anyway which won't be storing any data but is there
> just to
> > make Nutch happy.
> >
> > On 23 Feb 2018 22:08, "Yossi Tamari"  wrote:
> >
> > > Hi Kaliyug,
> > >
> > > Nutch 2 still requires Hadoop to run, it just allows you to store data
> > > somewhere other than HDFS.
> > > The only way to run Nutch without Hadoop is local mode, which is only
> > > recommended for testing. To do that, run ./runtime/local/bin/crawl.
> > >
> > > Yossi.
> > >
> > > > -Original Message-
> > > > From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> > > > Sent: 23 February 2018 20:26
> > > > To: user@nutch.apache.org
> > > > Subject: Nutch pointed to Cassandra, yet, asks for Hadoop
> > > >
> > > > Windows 10 Nutch 2.3.1 Cassandra 3.11.1
> > > >
> > > > I have extracted and built Nutch under the Cygwin's home directory.
> > > >
> > > > I believe that the Cassandra server is working:
> > > >
> > > > INFO  [main] 2018-02-23 16:20:41,077 StorageService.java:1442 -
> > > > JOINING: Finish joining ring
> > > > INFO  [main] 2018-02-23 16:20:41,820 SecondaryIndexManager.java:509
> > > > - Executing pre-join tasks for: CFS(Keyspace='test',
> > > > ColumnFamily='test')
> > > > INFO  [main] 2018-02-23 16:20:42,161 StorageService.java:2268 - Node
> > > > localhost/127.0.0.1 state jump to NORMAL INFO  [main] 2018-02-23
> > > > 16:20:43,049 NativeTransportService.java:75 - Netty using Java NIO
> > > > event
> > > loop
> > > > INFO  [main] 2018-02-23 16:20:43,358 Server.java:155 - Using Netty
> > > > Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a,
> > > > netty-codec=netty-codec-4.0.44.Final.452812a,
> > > > netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a,
> > > > netty-codec-http=netty-codec-http-4.0.44.Final.452812a,
> > > > netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a,
> > > > netty-common=netty-common-4.0.44.Final.452812a,
> > > > netty-handler=netty-handler-4.0.44.Final.452812a,
> > > > netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb,
> > > > netty-transport=netty-transport-4.0.44.Final.452812a,
> > > > netty-transport-native-epoll=netty-transport-native-epoll-
> > > 4.0.44.Final.452812a,
> > > > netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a,
> > > > netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a,
> > > > netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
> > > > INFO  [main] 2018-02-23 16:20:43,359 Server.java:156 - Starting
> > > listening for
> > > > CQL clients on localhost/127.0.0.1:9042 (unencrypted)...
> > > > INFO  [main] 2018-02-23 16:20:43,941 CassandraDaemon.java:527 - Not
> > > > starting RPC server as requested. Use JMX
> > > > (StorageService->startRPCServer()) or nodetool (enablethrift) to
> > > > start
> > > it
> > > >
> > > > I did the following check:
> > > >
> > > > apache-cassandra-3.11.1\bin>nodetool status
> > > > Datacenter: datacenter1
>

RE: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-23 Thread Yossi Tamari
1 is not true.
2 is true, if we ignore the second part 😊
Hadoop is made of two parts: distributed storage (HDFS) and a Map/Reduce 
framework. Nutch is essentially a collection of Map/Reduce tasks. It relies on 
Hadoop to distribute these tasks to all participating servers. So if you run in 
local mode, you can only use one server. If you have a single-node Hadoop, 
Nutch will be able to fully utilize the server, but it will still be limited to 
crawling from one machine, which is only sufficient for small/slow crawls.

> -Original Message-
> From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> Sent: 23 February 2018 23:16
> To: user@nutch.apache.org
> Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> 
> Ohh. I'm a bit confused. What of the following is true in the 'deploy' mode:
> 1. Data cannot be stored in Cassandra, HBase is the only way.
> 2. Data will be stored in Cassandra but you need a (maybe, just a single
> node)Hadoop cluster anyway which won't be storing any data but is there just 
> to
> make Nutch happy.
> 
> On 23 Feb 2018 22:08, "Yossi Tamari"  wrote:
> 
> > Hi Kaliyug,
> >
> > Nutch 2 still requires Hadoop to run, it just allows you to store data
> > somewhere other than HDFS.
> > The only way to run Nutch without Hadoop is local mode, which is only
> > recommended for testing. To do that, run ./runtime/local/bin/crawl.
> >
> > Yossi.
> >
> > > -Original Message-
> > > From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> > > Sent: 23 February 2018 20:26
> > > To: user@nutch.apache.org
> > > Subject: Nutch pointed to Cassandra, yet, asks for Hadoop
> > >
> > > Windows 10 Nutch 2.3.1 Cassandra 3.11.1
> > >
> > > I have extracted and built Nutch under the Cygwin's home directory.
> > >
> > > I believe that the Cassandra server is working:
> > >
> > > INFO  [main] 2018-02-23 16:20:41,077 StorageService.java:1442 -
> > > JOINING: Finish joining ring
> > > INFO  [main] 2018-02-23 16:20:41,820 SecondaryIndexManager.java:509
> > > - Executing pre-join tasks for: CFS(Keyspace='test',
> > > ColumnFamily='test')
> > > INFO  [main] 2018-02-23 16:20:42,161 StorageService.java:2268 - Node
> > > localhost/127.0.0.1 state jump to NORMAL INFO  [main] 2018-02-23
> > > 16:20:43,049 NativeTransportService.java:75 - Netty using Java NIO
> > > event
> > loop
> > > INFO  [main] 2018-02-23 16:20:43,358 Server.java:155 - Using Netty
> > > Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a,
> > > netty-codec=netty-codec-4.0.44.Final.452812a,
> > > netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a,
> > > netty-codec-http=netty-codec-http-4.0.44.Final.452812a,
> > > netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a,
> > > netty-common=netty-common-4.0.44.Final.452812a,
> > > netty-handler=netty-handler-4.0.44.Final.452812a,
> > > netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb,
> > > netty-transport=netty-transport-4.0.44.Final.452812a,
> > > netty-transport-native-epoll=netty-transport-native-epoll-
> > 4.0.44.Final.452812a,
> > > netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a,
> > > netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a,
> > > netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
> > > INFO  [main] 2018-02-23 16:20:43,359 Server.java:156 - Starting
> > listening for
> > > CQL clients on localhost/127.0.0.1:9042 (unencrypted)...
> > > INFO  [main] 2018-02-23 16:20:43,941 CassandraDaemon.java:527 - Not
> > > starting RPC server as requested. Use JMX
> > > (StorageService->startRPCServer()) or nodetool (enablethrift) to
> > > start
> > it
> > >
> > > I did the following check:
> > >
> > > apache-cassandra-3.11.1\bin>nodetool status
> > > Datacenter: datacenter1
> > > 
> > > Status=Up/Down
> > > |/ State=Normal/Leaving/Joining/Moving
> > > --  AddressLoad   Tokens   Owns (effective)  Host ID
> > > Rack
> > > UN  127.0.0.1  273.97 KiB  256  100.0%
> > > dab932f2-d138-4a1a-acd4-f63cbb16d224  rack1
> > >
> > > csql connects
> > >
> > > apache-cassandra-3.11.1\bin>cqlsh
> > >
> > > WARNING: console codepage must be set to cp65001 to support utf-8
> > encoding
> > > on Windows platforms.
> > &

RE: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-23 Thread Kaliyug Antagonist
Ohh. I'm a bit confused. What of the following is true in the 'deploy' mode:
1. Data cannot be stored in Cassandra, HBase is the only way.
2. Data will be stored in Cassandra but you need a (maybe, just a single
node)Hadoop cluster anyway which won't be storing any data but is there
just to make Nutch happy.

On 23 Feb 2018 22:08, "Yossi Tamari"  wrote:

> Hi Kaliyug,
>
> Nutch 2 still requires Hadoop to run, it just allows you to store data
> somewhere other than HDFS.
> The only way to run Nutch without Hadoop is local mode, which is only
> recommended for testing. To do that, run ./runtime/local/bin/crawl.
>
> Yossi.
>
> > -Original Message-
> > From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> > Sent: 23 February 2018 20:26
> > To: user@nutch.apache.org
> > Subject: Nutch pointed to Cassandra, yet, asks for Hadoop
> >
> > Windows 10 Nutch 2.3.1 Cassandra 3.11.1
> >
> > I have extracted and built Nutch under the Cygwin's home directory.
> >
> > I believe that the Cassandra server is working:
> >
> > INFO  [main] 2018-02-23 16:20:41,077 StorageService.java:1442 -
> > JOINING: Finish joining ring
> > INFO  [main] 2018-02-23 16:20:41,820 SecondaryIndexManager.java:509 -
> > Executing pre-join tasks for: CFS(Keyspace='test',
> > ColumnFamily='test')
> > INFO  [main] 2018-02-23 16:20:42,161 StorageService.java:2268 - Node
> > localhost/127.0.0.1 state jump to NORMAL INFO  [main] 2018-02-23
> > 16:20:43,049 NativeTransportService.java:75 - Netty using Java NIO event
> loop
> > INFO  [main] 2018-02-23 16:20:43,358 Server.java:155 - Using Netty
> > Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a,
> > netty-codec=netty-codec-4.0.44.Final.452812a,
> > netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a,
> > netty-codec-http=netty-codec-http-4.0.44.Final.452812a,
> > netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a,
> > netty-common=netty-common-4.0.44.Final.452812a,
> > netty-handler=netty-handler-4.0.44.Final.452812a,
> > netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb,
> > netty-transport=netty-transport-4.0.44.Final.452812a,
> > netty-transport-native-epoll=netty-transport-native-epoll-
> 4.0.44.Final.452812a,
> > netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a,
> > netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a,
> > netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
> > INFO  [main] 2018-02-23 16:20:43,359 Server.java:156 - Starting
> listening for
> > CQL clients on localhost/127.0.0.1:9042 (unencrypted)...
> > INFO  [main] 2018-02-23 16:20:43,941 CassandraDaemon.java:527 - Not
> > starting RPC server as requested. Use JMX
> > (StorageService->startRPCServer()) or nodetool (enablethrift) to start
> it
> >
> > I did the following check:
> >
> > apache-cassandra-3.11.1\bin>nodetool status
> > Datacenter: datacenter1
> > 
> > Status=Up/Down
> > |/ State=Normal/Leaving/Joining/Moving
> > --  AddressLoad   Tokens   Owns (effective)  Host ID
> > Rack
> > UN  127.0.0.1  273.97 KiB  256  100.0%
> > dab932f2-d138-4a1a-acd4-f63cbb16d224  rack1
> >
> > csql connects
> >
> > apache-cassandra-3.11.1\bin>cqlsh
> >
> > WARNING: console codepage must be set to cp65001 to support utf-8
> encoding
> > on Windows platforms.
> > If you experience encoding problems, change your console codepage with
> 'chcp
> > 65001' before starting cqlsh.
> >
> > Connected to Test Cluster at 127.0.0.1:9042.
> > [cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native protocol v4]
> Use HELP
> > for help.
> > WARNING: pyreadline dependency missing.  Install to enable tab
> completion.
> > cqlsh> describe keyspaces
> >
> > system_schema  system_auth  system  system_distributed  test
> system_traces
> >
> > I followed the tutorial 'Setting up NUTCH 2.x with CASSANDRA
> > ' and added the
> respective
> > entries in the properties and the xml files.
> >
> > I go to the Cygwin prompt and attempt to crawl. Instead of using
> Cassandra, it
> > asks for Hadoop(HBase, probably)
> >
> > /home/apache-nutch-2.3.1
> > $ ./runtime/deploy/bin/crawl urls/ crawl/ 1 No SOLRURL specified.
> Skipping
> > indexing.
> > which: no hadoop in () Can't find Hadoop
> > executable. Add HADOOP_HOME/bin to the path or run in local mode.
> >
> >
> >
> >  > signature?utm_medium=email&utm_source=link&utm_campaign=sig-
> > email&utm_content=webmail>
> > Virus-free.
> > www.avg.com
> >  > signature?utm_medium=email&utm_source=link&utm_campaign=sig-
> > email&utm_content=webmail>
> > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
>


RE: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-23 Thread Yossi Tamari
Hi Kaliyug,

Nutch 2 still requires Hadoop to run, it just allows you to store data 
somewhere other than HDFS.
The only way to run Nutch without Hadoop is local mode, which is only 
recommended for testing. To do that, run ./runtime/local/bin/crawl.

Yossi.

> -Original Message-
> From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> Sent: 23 February 2018 20:26
> To: user@nutch.apache.org
> Subject: Nutch pointed to Cassandra, yet, asks for Hadoop
> 
> Windows 10 Nutch 2.3.1 Cassandra 3.11.1
> 
> I have extracted and built Nutch under the Cygwin's home directory.
> 
> I believe that the Cassandra server is working:
> 
> INFO  [main] 2018-02-23 16:20:41,077 StorageService.java:1442 -
> JOINING: Finish joining ring
> INFO  [main] 2018-02-23 16:20:41,820 SecondaryIndexManager.java:509 -
> Executing pre-join tasks for: CFS(Keyspace='test',
> ColumnFamily='test')
> INFO  [main] 2018-02-23 16:20:42,161 StorageService.java:2268 - Node
> localhost/127.0.0.1 state jump to NORMAL INFO  [main] 2018-02-23
> 16:20:43,049 NativeTransportService.java:75 - Netty using Java NIO event loop
> INFO  [main] 2018-02-23 16:20:43,358 Server.java:155 - Using Netty
> Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a,
> netty-codec=netty-codec-4.0.44.Final.452812a,
> netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a,
> netty-codec-http=netty-codec-http-4.0.44.Final.452812a,
> netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a,
> netty-common=netty-common-4.0.44.Final.452812a,
> netty-handler=netty-handler-4.0.44.Final.452812a,
> netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb,
> netty-transport=netty-transport-4.0.44.Final.452812a,
> netty-transport-native-epoll=netty-transport-native-epoll-4.0.44.Final.452812a,
> netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a,
> netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a,
> netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
> INFO  [main] 2018-02-23 16:20:43,359 Server.java:156 - Starting listening for
> CQL clients on localhost/127.0.0.1:9042 (unencrypted)...
> INFO  [main] 2018-02-23 16:20:43,941 CassandraDaemon.java:527 - Not
> starting RPC server as requested. Use JMX
> (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
> 
> I did the following check:
> 
> apache-cassandra-3.11.1\bin>nodetool status
> Datacenter: datacenter1
> 
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID
> Rack
> UN  127.0.0.1  273.97 KiB  256  100.0%
> dab932f2-d138-4a1a-acd4-f63cbb16d224  rack1
> 
> csql connects
> 
> apache-cassandra-3.11.1\bin>cqlsh
> 
> WARNING: console codepage must be set to cp65001 to support utf-8 encoding
> on Windows platforms.
> If you experience encoding problems, change your console codepage with 'chcp
> 65001' before starting cqlsh.
> 
> Connected to Test Cluster at 127.0.0.1:9042.
> [cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native protocol v4] Use 
> HELP
> for help.
> WARNING: pyreadline dependency missing.  Install to enable tab completion.
> cqlsh> describe keyspaces
> 
> system_schema  system_auth  system  system_distributed  test  system_traces
> 
> I followed the tutorial 'Setting up NUTCH 2.x with CASSANDRA
> ' and added the respective
> entries in the properties and the xml files.
> 
> I go to the Cygwin prompt and attempt to crawl. Instead of using Cassandra, it
> asks for Hadoop(HBase, probably)
> 
> /home/apache-nutch-2.3.1
> $ ./runtime/deploy/bin/crawl urls/ crawl/ 1 No SOLRURL specified. Skipping
> indexing.
> which: no hadoop in () Can't find Hadoop
> executable. Add HADOOP_HOME/bin to the path or run in local mode.
> 
> 
> 
>  signature?utm_medium=email&utm_source=link&utm_campaign=sig-
> email&utm_content=webmail>
> Virus-free.
> www.avg.com
>  signature?utm_medium=email&utm_source=link&utm_campaign=sig-
> email&utm_content=webmail>
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>