Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.

2014-09-24 Thread Meraj A. Khan
Folks,

As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN
cluster .

In order to scale I would need to Fetch concurrently with multiple map
tasks on multiple nodes ,I  think that the first step to do so would be to
generate multiple segments in the generate phase so that multiple fetch map
tasks can operate in parallel and in  order to generate multiple segments
at Generate time I have made the following changes , but unfortunately I
have been unsuccessful in doing so.

I have tweaked the following parameters in bin/crawl to do so .

added the *maxNumSegments* and *numFetchers* parameters in the call to
generate in *bin/crawl *script as can be seen below.


*$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
$CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers
-noFilter*

(Here $numFetchers has a value of 15)

The *generate.max.count* and *generate.count.mode* and *topN* are all
default values , meaning I am not providing any values for them.

Also the crawldb status before the Generate phase is as shown below , it
shows that the number of unfetched URLs is more than *75 million* , so its
not that there are not enough urls for Generate to generate multiple
segments.

* CrawlDB status*
* db_fetched=318708*
* db_gone=4774*
* db_notmodified=2274*
* db_redir_perm=2253*
* db_redir_temp=2527*
* db_unfetched=7524*

However I do see this message in the logs consistently during the generate
phase.

 *Generator: jobtracker is 'local', generating exactly one partition.*

is this "one partition" referring to the the single segment that is going
to be generated ? If so how do I address this.


I feel like I have exhausted all the options but I am unable to have the
Generate phase generate more than one segment at a time.

Can someone let me know if there is anything else that I should be trying
here ?

*Thanks and any help is much appreciated!*


Re: Apache nutch 1.9 error - Input path does not exist

2014-09-24 Thread atawfik
Hi,

To access the core, you need to provide the full API. For instance,
you can query everything using 
http://127.0.0.1:8983/solr/collection1/select?q=*:*. You can read more about 
that in the Solr site.

If you want to explore the core, you can navigate to 
http://127.0.0.1:8983/solr/#/collection1

Regards
Ameer 
On Sep 24, 2014, at 11:49 PM, gsamsa [via Lucene] 
 wrote:

> Thx for your answer! 
> 
> I immediately tried it, however, it gives me: 
> 
> 
> 
> Any recommendations, what I am doing wrong? 
> 
> Should I start nutch with this url(http://127.0.0.1:8983/solr/collection1/)? 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918p4160996.html
> To unsubscribe from Apache nutch 1.9 error - Input path does not exist, click 
> here.
> NAML





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918p4161001.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Apache nutch 1.9 error - Input path does not exist

2014-09-24 Thread atawfik
Hi,

Your Solr address is wrong. You should include the core name. In your case,
it will be http://127.0.0.1:8983/solr/collection1/

Regards
Ameer



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918p4160991.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Apache nutch 1.9 error - Input path does not exist

2014-09-24 Thread Jonathan Cooper-Ellis
Hello,

It looks like you're confusing the usage of bin/crawl with the old
bin/nutch crawl command. You want to start the crawl like this:

bin/crawl

So, the script thinks "-solr" is your crawl directory (which does not
exist):

2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/testUser/Desktop/nutch-solr-example/apache-
nutch-1.9/-solr/segments/crawl_generate

Hope that helps!

-jce



On Wed, Sep 24, 2014 at 9:36 AM, gsamsa  wrote:

> Hello guys,
>
> I have installed *apache nutch 1.9* and *solr 3.6.2*, which run on an
> ubuntu
> virtual machine in virtualbox.
>
> *Description of error*
>
>
> I start a crawl like that:
>
> *./bin/crawl urls/ -solr http://127.0.0.1:8983/solr/ 1*
>
> However, I get the following error(that is my log from
> `nutch/logs/hadoop.logs`):
>
>
>
> /  2014-09-24 14:39:46,252 INFO  crawl.Injector - Injector: starting at
> 2014-09-24 14:39:46
> 2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: crawlDb:
> -solr/crawldb
> 2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: urlDir:
> urls
> 2014-09-24 14:39:46,260 INFO  crawl.Injector - Injector: Converting
> injected urls to crawl db entries.
> 2014-09-24 14:39:47,263 WARN  util.NativeCodeLoader - Unable to
> load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2014-09-24 14:39:47,375 WARN  snappy.LoadSnappy - Snappy native
> library not loaded
> 2014-09-24 14:39:49,076 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'inject', using default
> 2014-09-24 14:39:49,132 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'inject', using default
> 2014-09-24 14:39:50,001 INFO  crawl.Injector - Injector: Total
> number of urls rejected by filters: 0
> 2014-09-24 14:39:50,002 INFO  crawl.Injector - Injector: Total
> number of urls after normalization: 2
> 2014-09-24 14:39:50,003 INFO  crawl.Injector - Injector: Merging
> injected urls into crawl db.
> 2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: overwrite:
> false
> 2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: update:
> false
> 2014-09-24 14:39:52,116 INFO  crawl.Injector - Injector: URLs
> merged: 2
> 2014-09-24 14:39:52,136 INFO  crawl.Injector - Injector: Total new
> urls injected: 0
> 2014-09-24 14:39:52,139 INFO  crawl.Injector - Injector: finished
> at
> 2014-09-24 14:39:52, elapsed: 00:00:05
> 2014-09-24 14:39:55,557 WARN  util.NativeCodeLoader - Unable to
> load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2014-09-24 14:39:55,571 INFO  crawl.Generator - Generator: starting
> at 2014-09-24 14:39:55
> 2014-09-24 14:39:55,574 INFO  crawl.Generator - Generator:
> Selecting
> best-scoring urls due for fetch.
> 2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
> filtering: false
> 2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
> normalizing: true
> 2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator: topN:
> 5
> 2014-09-24 14:39:58,013 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
> 2014-09-24 14:39:58,044 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'partition', using default
> 2014-09-24 14:39:58,291 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
> 2014-09-24 14:39:58,370 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'generate_host_count', using default
> 2014-09-24 14:39:58,782 INFO  crawl.Generator - Generator:
> Partitioning selected urls for politeness.
> 2014-09-24 14:39:59,785 INFO  crawl.Generator - Generator: segment:
> -solr/segments/20140924143959
> 2014-09-24 14:40:00,313 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'partition', using default
> 2014-09-24 14:40:01,032 INFO  crawl.Generator - Generator: finished
> at 2014-09-24 14:40:01, elapsed: 00:00:05
> 2014-09-24 14:40:03,462 INFO  fetcher.Fetcher - Fetcher: starting
> at
> 2014-09-24 14:40:03
> 2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher: segment:
> -solr/segments
> 2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher Timelimit
> set for : 1411573203467
> 2014-09-24 14:40:04,207 WARN  util.NativeCodeLoader - Unable to
> load
> nati

Apache nutch 1.9 error - Input path does not exist

2014-09-24 Thread gsamsa
Hello guys,

I have installed *apache nutch 1.9* and *solr 3.6.2*, which run on an ubuntu
virtual machine in virtualbox.

*Description of error*


I start a crawl like that:

*./bin/crawl urls/ -solr http://127.0.0.1:8983/solr/ 1*

However, I get the following error(that is my log from
`nutch/logs/hadoop.logs`):

  

/  2014-09-24 14:39:46,252 INFO  crawl.Injector - Injector: starting at
2014-09-24 14:39:46
2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: crawlDb:
-solr/crawldb
2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: urlDir:
urls
2014-09-24 14:39:46,260 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2014-09-24 14:39:47,263 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-09-24 14:39:47,375 WARN  snappy.LoadSnappy - Snappy native
library not loaded
2014-09-24 14:39:49,076 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
2014-09-24 14:39:49,132 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
2014-09-24 14:39:50,001 INFO  crawl.Injector - Injector: Total
number of urls rejected by filters: 0
2014-09-24 14:39:50,002 INFO  crawl.Injector - Injector: Total
number of urls after normalization: 2
2014-09-24 14:39:50,003 INFO  crawl.Injector - Injector: Merging
injected urls into crawl db.
2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: overwrite:
false
2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: update:
false
2014-09-24 14:39:52,116 INFO  crawl.Injector - Injector: URLs
merged: 2
2014-09-24 14:39:52,136 INFO  crawl.Injector - Injector: Total new
urls injected: 0
2014-09-24 14:39:52,139 INFO  crawl.Injector - Injector: finished at
2014-09-24 14:39:52, elapsed: 00:00:05
2014-09-24 14:39:55,557 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-09-24 14:39:55,571 INFO  crawl.Generator - Generator: starting
at 2014-09-24 14:39:55
2014-09-24 14:39:55,574 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
filtering: false
2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
normalizing: true
2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator: topN:
5
2014-09-24 14:39:58,013 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2014-09-24 14:39:58,044 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
2014-09-24 14:39:58,291 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2014-09-24 14:39:58,370 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'generate_host_count', using default
2014-09-24 14:39:58,782 INFO  crawl.Generator - Generator:
Partitioning selected urls for politeness.
2014-09-24 14:39:59,785 INFO  crawl.Generator - Generator: segment:
-solr/segments/20140924143959
2014-09-24 14:40:00,313 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
2014-09-24 14:40:01,032 INFO  crawl.Generator - Generator: finished
at 2014-09-24 14:40:01, elapsed: 00:00:05
2014-09-24 14:40:03,462 INFO  fetcher.Fetcher - Fetcher: starting at
2014-09-24 14:40:03
2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher: segment:
-solr/segments
2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher Timelimit
set for : 1411573203467
2014-09-24 14:40:04,207 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-09-24 14:40:04,301 ERROR security.UserGroupInformation -
PriviledgedActionException as:testUser
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist:
file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at
org.apache.ha

RE: DOCUMENTATION - Nutch and Hidden Services

2014-09-24 Thread Markus Jelsma
Hi - this is really awesome! Is there also a way to use different exit nodes 
for different fetchers or queues, or can you instruct to regularly change exit 
nodes?
Markus

-Original message-
From: Lewis John Mcgibbney
Sent: Wednesday 24th September 2014 4:57
To: user@nutch.apache.org; d...@nutch.apache.org
Subject: DOCUMENTATION - Nutch and Hidden Services

Hi Folks,

Ive added a document for crawling hidden services .onion sites present within 
the Tor network.

The documentation is available on the Nutch wiki
https://wiki.apache.org/nutch/SetupNutchAndTor 


Hope some folks find this helpful.

Thank you to Roger Dingledine from Tor for his patience and expertise in the 
area.
Best
Lewis

-- 
Lewis