Re: need your support

2010-01-20 Thread Mattmann, Chris A (388J)
Hi Sahar,

Can you post your:


 1.  crawl-urlfilter
 2.  nutch-site.xml

Also how are you running this program below?

I'm CC'ing nutch-user@ so the community can benefit from this thread.

Cheers,
Chris



On 1/20/10 1:42 PM, "sahar elkazaz"  wrote:


Dear/ sirur

I have follow all steps on your article to run nutch

and use this java program to access the segments:

 package nutch;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.nutch.searcher.Hit;
import org.apache.nutch.searcher.HitDetails;
import org.apache.nutch.searcher.Hits;
import org.apache.nutch.searcher.NutchBean;
import org.apache.nutch.searcher.Query;
import org.apache.nutch.searcher.Summary;
import org.apache.nutch.util.NutchConfiguration;
public class nutch   {
  /** For debugging. */
  public static void main(String[] args) throws Exception {
 Configuration conf = NutchConfiguration.create();
   conf = NutchConfiguration.create();
  NutchBean bean = new NutchBean(conf);
Query query = Query.parse("animal" +
"", conf);
Hits hits = bean.search(query, 10);
System.out.println("Total hits: " + hits.getTotal());
int length = (int)Math.min(hits.getTotal(), 10);
Hit[] show = hits.getHits(0, length);
HitDetails[] details = bean.getDetails(show);
 Summary[] summaries = bean.getSummary(details, query);
 for ( int i = 0; i (FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
 &nb sp;  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:89)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:77)
at nutch.nutch.main(nutch.java:25)
10/01/20 22:29:28 WARN fs.FileSystem: uri=file:///
javax.security.auth.login.LoginException: Login failed:
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation 
.login(UnixUserGroupInformation.java:257)
at 
org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at 
org.apache.nutch.searcher.LuceneSearchBean.(LuceneSearchBean.java:50)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:102)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:7 7)
at nutch.nutch.main(nutch.java:25)
10/01/20 22:29:28 INFO searcher.SearchBean: opening indexes in crawl/indexes
10/01/20 22:29:28 WARN fs.FileSystem: uri=file:///
javax.security.auth.login.LoginException: Login failed:
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at 
org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1438)
 ;at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.IndexSearcher.(IndexSearcher.java:59)
at 
org.apache.nutch.searcher.LuceneSearchBean.init(LuceneSearchBean.java:77)
at 
org.apache.nutch.searcher.LuceneSearchBean.(LuceneSearchBean.java:51)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:102)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:77)
at nutch.nutch.main(nutch.java :25)
10/01/20 22:29:28 INFO plugin.PluginRepository: Plugins: looking in: 
D:\nutch-1.0\plugins
10/01/20 22:29:28 INFO plugin.PluginRepository: Plugin Auto-activation mode: 
[true]
10/01/20 22:29:28 INFO plugin.PluginRepository: Registered Plugins:
10/01/20 22:29:28 INFO plugin.PluginRepository: the nutch core 
extension points (nutch-extensionpoints)
10/01/20 22:29:28 INFO plugin.PluginRepository: Basic Query Filter 
(query-basic)
10/01/20 22:29:28 INFO plugin.PluginRepository: Basic URL Normalizer 
(urlnormalizer-basic)
10/01/20 22:29:28 INFO plugin.PluginRepository: Html Parse Plug-in 
(parse-html)
10/01/20 22:29:28 INFO plugin.PluginRepository: Basic Indexing Filter 
(index-basic)
10/01/20 22:29:28 INFO plugin.Plugi nRepository: Site Query Filter 
(

Re: Configurin nutch-site.xml

2010-01-20 Thread MilleBii
Well that does not work this way really.

If you want to use is it make it run on one node (pseudo-distributed mode)
and then deploy.
If you have it running in pseudo-distributed it won't use the local
filesystem this is why I don't understand your remarks in the initial mail.

NUTCH logs are in NUTCH_HOME/logs look for the hadoop file it will tell you
what is happening more or less.



2010/1/20 Santiago Pérez 

>
> I launch the hdfs because I want to make it work in one computer and when
> it
> works, launching in several as a distributed version.
>
> Which logs do you need to check?
>
>
> MilleBii wrote:
> >
> > Why do you launch hdfs if you don't want use it ?
> >
> > What are the logs saying,  all fetch urls are logged usually ? But
> > nothing is displaid
> >
> > 2010/1/20, Santiago Pérez :
> >>
> >> Hej,
> >>
> >> I am configuring Nutch for just crawling webs in several machines
> >> (currently
> >> I want to test with only one).
> >> Building Nutch with ant was successfully
> >>
> >>bin/hadoop namenode -format
> >>bin/start-all.sh
> >>
> >> They show correct logs
> >>
> >>   bin/hadoop dfs -put urls urls
> >>   bin/hadoop dfs -ls
> >>
> >> They show the urls directory correctly
> >>
> >> But when I launch it the fetcher starts but does not show any message of
> >> parsing and it stops in the second depth. The crawl-urlfilter and
> >> nutch-default are well configured because they work great using local
> >> filesystem (instead of hdfs). I guess it is because nutch-site is empty.
> >>
> >> What should be its content?
> >>
> >> core-site.xml:
> >>
> >> 
> >>
> >> 
> >>
> >> 
> >>
> >> 
> >>   fs.default.name
> >>   hdfs://localhost:9000/
> >>   
> >> The name of the default file system. Either the literal string
> >> "local" or a host:port for NDFS.
> >>   
> >> 
> >>
> >> 
> >>
> >>
> >> ---
> >>
> >> hdfs-site.xml:
> >>
> >> 
> >>
> >> 
> >>
> >> 
> >>
> >> 
> >>   dfs.name.dir
> >>   /root/filesystem/name
> >> 
> >>
> >> 
> >>   dfs.data.dir
> >>   /root/filesystem/data
> >> 
> >>
> >> 
> >>   dfs.replication
> >>   1
> >> 
> >>
> >> 
> >>
> >>
> >> ---
> >>
> >>
> >> mapred-site.xml:
> >>
> >> 
> >>
> >> 
> >>
> >> 
> >>
> >> 
> >>   mapred.job.tracker
> >>   hdfs://localhost:9001/
> >>   
> >> The host and port that the MapReduce job tracker runs at. If
> >> "local", then jobs are run in-process as a single map and
> >> reduce task.
> >>   
> >> 
> >>
> >> 
> >>   mapred.map.tasks
> >>   2
> >>   
> >> define mapred.map tasks to be number of slave hosts
> >>   
> >> 
> >>
> >> 
> >>   mapred.reduce.tasks
> >>   2
> >>   
> >> define mapred.reduce tasks to be number of slave hosts
> >>   
> >> 
> >>
> >> 
> >>   mapred.system.dir
> >>   /root/filesystem/mapreduce/system
> >> 
> >>
> >> 
> >>   mapred.local.dir
> >>   /root/filesystem/mapreduce/local
> >> 
> >>
> >> 
> >> --
> >> View this message in context:
> >>
> http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27245750.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > -MilleBii-
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27248860.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
-MilleBii-


Redundancy issue in crawling

2010-01-20 Thread Ken Ken

Hello,

I am trying to save memory, cpu, and bandwidth as much as possible in case if I 
have million of urls to crawl.


crawl-urlfilter.txt in the conf/ directory.

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*apache.org/

# skip everything else
-.


Assume my urls/seed file contains million of urls to fetch, crawl, generate, 
and etc, and I don't want to go through million of lines to review each.
Let's say I have these lines in my urls/seed file (file will million urls)


http://apache.org
http://subdomain1.apache.org
http://subdomain2.apache.org
http://subdomain3.apache.org
http://subdomain4.apache.org
http://subdomain5.apache.org

Correct me if I am wrong, but if 'http://apache.org'
 also crawls subdomains with '+^http://([a-z0-9]*\.)*apache.org/' in 
crawl-urlfilter.txt, then wouldn't it crawl more than once if I have subdomain 
lines in my urls/seed file?

I don't mind if I have one or two, but it would waste a lot of cpu, memory, and 
bandwidth as my  continue to grow with urls.

If that was the issue, then have anyone thought of a way to file out all the 
subdomains in his/her urls/seed file.

I'm trying to find out of a way (maybe there is a better method) to search each 
line for for than 1 dot ".".  If there is no other way, then does any know to 
search which lines have more than 1 "." in each line using unix commands 
(vi,awk,sed)?

Thank you very much



  

Re: Configurin nutch-site.xml

2010-01-20 Thread Santiago Pérez

I launch the hdfs because I want to make it work in one computer and when it
works, launching in several as a distributed version.

Which logs do you need to check?


MilleBii wrote:
> 
> Why do you launch hdfs if you don't want use it ?
> 
> What are the logs saying,  all fetch urls are logged usually ? But
> nothing is displaid
> 
> 2010/1/20, Santiago Pérez :
>>
>> Hej,
>>
>> I am configuring Nutch for just crawling webs in several machines
>> (currently
>> I want to test with only one).
>> Building Nutch with ant was successfully
>>
>>bin/hadoop namenode -format
>>bin/start-all.sh
>>
>> They show correct logs
>>
>>   bin/hadoop dfs -put urls urls
>>   bin/hadoop dfs -ls
>>
>> They show the urls directory correctly
>>
>> But when I launch it the fetcher starts but does not show any message of
>> parsing and it stops in the second depth. The crawl-urlfilter and
>> nutch-default are well configured because they work great using local
>> filesystem (instead of hdfs). I guess it is because nutch-site is empty.
>>
>> What should be its content?
>>
>> core-site.xml:
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>   fs.default.name
>>   hdfs://localhost:9000/
>>   
>> The name of the default file system. Either the literal string
>> "local" or a host:port for NDFS.
>>   
>> 
>>
>> 
>>
>>
>> ---
>>
>> hdfs-site.xml:
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>   dfs.name.dir
>>   /root/filesystem/name
>> 
>>
>> 
>>   dfs.data.dir
>>   /root/filesystem/data
>> 
>>
>> 
>>   dfs.replication
>>   1
>> 
>>
>> 
>>
>>
>> ---
>>
>>
>> mapred-site.xml:
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>   mapred.job.tracker
>>   hdfs://localhost:9001/
>>   
>> The host and port that the MapReduce job tracker runs at. If
>> "local", then jobs are run in-process as a single map and
>> reduce task.
>>   
>> 
>>
>> 
>>   mapred.map.tasks
>>   2
>>   
>> define mapred.map tasks to be number of slave hosts
>>   
>> 
>>
>> 
>>   mapred.reduce.tasks
>>   2
>>   
>> define mapred.reduce tasks to be number of slave hosts
>>   
>> 
>>
>> 
>>   mapred.system.dir
>>   /root/filesystem/mapreduce/system
>> 
>>
>> 
>>   mapred.local.dir
>>   /root/filesystem/mapreduce/local
>> 
>>
>> 
>> --
>> View this message in context:
>> http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27245750.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> -MilleBii-
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27248860.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Configurin nutch-site.xml

2010-01-20 Thread MilleBii
Why do you launch hdfs if you don't want use it ?

What are the logs saying,  all fetch urls are logged usually ? But
nothing is displaid

2010/1/20, Santiago Pérez :
>
> Hej,
>
> I am configuring Nutch for just crawling webs in several machines (currently
> I want to test with only one).
> Building Nutch with ant was successfully
>
>bin/hadoop namenode -format
>bin/start-all.sh
>
> They show correct logs
>
>   bin/hadoop dfs -put urls urls
>   bin/hadoop dfs -ls
>
> They show the urls directory correctly
>
> But when I launch it the fetcher starts but does not show any message of
> parsing and it stops in the second depth. The crawl-urlfilter and
> nutch-default are well configured because they work great using local
> filesystem (instead of hdfs). I guess it is because nutch-site is empty.
>
> What should be its content?
>
> core-site.xml:
>
> 
>
> 
>
> 
>
> 
>   fs.default.name
>   hdfs://localhost:9000/
>   
> The name of the default file system. Either the literal string
> "local" or a host:port for NDFS.
>   
> 
>
> 
>
>
> ---
>
> hdfs-site.xml:
>
> 
>
> 
>
> 
>
> 
>   dfs.name.dir
>   /root/filesystem/name
> 
>
> 
>   dfs.data.dir
>   /root/filesystem/data
> 
>
> 
>   dfs.replication
>   1
> 
>
> 
>
>
> ---
>
>
> mapred-site.xml:
>
> 
>
> 
>
> 
>
> 
>   mapred.job.tracker
>   hdfs://localhost:9001/
>   
> The host and port that the MapReduce job tracker runs at. If
> "local", then jobs are run in-process as a single map and
> reduce task.
>   
> 
>
> 
>   mapred.map.tasks
>   2
>   
> define mapred.map tasks to be number of slave hosts
>   
> 
>
> 
>   mapred.reduce.tasks
>   2
>   
> define mapred.reduce tasks to be number of slave hosts
>   
> 
>
> 
>   mapred.system.dir
>   /root/filesystem/mapreduce/system
> 
>
> 
>   mapred.local.dir
>   /root/filesystem/mapreduce/local
> 
>
> 
> --
> View this message in context:
> http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27245750.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
-MilleBii-


Configurin nutch-site.xml

2010-01-20 Thread Santiago Pérez

Hej,

I am configuring Nutch for just crawling webs in several machines (currently
I want to test with only one). 
Building Nutch with ant was successfully

   bin/hadoop namenode -format
   bin/start-all.sh

They show correct logs

  bin/hadoop dfs -put urls urls
  bin/hadoop dfs -ls

They show the urls directory correctly

But when I launch it the fetcher starts but does not show any message of
parsing and it stops in the second depth. The crawl-urlfilter and
nutch-default are well configured because they work great using local
filesystem (instead of hdfs). I guess it is because nutch-site is empty. 

What should be its content?

core-site.xml:








  fs.default.name
  hdfs://localhost:9000/
  
The name of the default file system. Either the literal string 
"local" or a host:port for NDFS.
  





---

hdfs-site.xml:








  dfs.name.dir
  /root/filesystem/name



  dfs.data.dir
  /root/filesystem/data



  dfs.replication
  1





---


mapred-site.xml:








  mapred.job.tracker
  hdfs://localhost:9001/
  
The host and port that the MapReduce job tracker runs at. If 
"local", then jobs are run in-process as a single map and 
reduce task.
  


 
  mapred.map.tasks
  2
  
define mapred.map tasks to be number of slave hosts
   
 

 
  mapred.reduce.tasks
  2
  
define mapred.reduce tasks to be number of slave hosts
   
 


  mapred.system.dir
  /root/filesystem/mapreduce/system



  mapred.local.dir
  /root/filesystem/mapreduce/local



-- 
View this message in context: 
http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27245750.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Nutch 1.0 slow crawls

2010-01-20 Thread MilleBii
Well politeness can still be the problem... If for instance your are
crawling blogs like wordpress or blogspot, they are all different url
but with same IP so the fetcher will wait.



2010/1/20, axi :
>
> Hi to all,
> I'm a novice user of Nutch, I have it on a debian machine, and I have probe
> the latest release of nutch 1.0 with very slow results in crawling, I have a
> 10 Megabytes/s connection and it only crawls at 300 Kb/s with peaks of 1
> Mb/s. I tweaked everything, dns, linux tcp settings, thread numbers, java
> conf etc.. but anything have effect. There are a lot of spin waiting threads
> there, only 10-20 of them working and I have injected 1M different hosts, so
> politeness is not the problem. I swithched back to 0.9 nutch, and then it
> works like a charm at good speeds 5-6 Mb/s with the bottleneck on machine
> cpu.
>
> ¿Are this issues solved on dev version of nutch or why is this happens?
>
> Thanks in advance,
>
> --
> View this message in context:
> http://old.nabble.com/Nutch-1.0-slow-crawls-tp27243302p27243302.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
-MilleBii-


Re: Nutch 1.0 slow crawls

2010-01-20 Thread Julien Nioche
Hi,

See https://issues.apache.org/jira/browse/NUTCH-721 for an
explanation. This has been fixed in the SVN version.

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2010/1/20 axi :
>
> Hi to all,
> I'm a novice user of Nutch, I have it on a debian machine, and I have probe
> the latest release of nutch 1.0 with very slow results in crawling, I have a
> 10 Megabytes/s connection and it only crawls at 300 Kb/s with peaks of 1
> Mb/s. I tweaked everything, dns, linux tcp settings, thread numbers, java
> conf etc.. but anything have effect. There are a lot of spin waiting threads
> there, only 10-20 of them working and I have injected 1M different hosts, so
> politeness is not the problem. I swithched back to 0.9 nutch, and then it
> works like a charm at good speeds 5-6 Mb/s with the bottleneck on machine
> cpu.
>
> ¿Are this issues solved on dev version of nutch or why is this happens?
>
> Thanks in advance,
>
> --
> View this message in context: 
> http://old.nabble.com/Nutch-1.0-slow-crawls-tp27243302p27243302.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


Nutch 1.0 slow crawls

2010-01-20 Thread axi

Hi to all,
I'm a novice user of Nutch, I have it on a debian machine, and I have probe
the latest release of nutch 1.0 with very slow results in crawling, I have a
10 Megabytes/s connection and it only crawls at 300 Kb/s with peaks of 1
Mb/s. I tweaked everything, dns, linux tcp settings, thread numbers, java
conf etc.. but anything have effect. There are a lot of spin waiting threads
there, only 10-20 of them working and I have injected 1M different hosts, so
politeness is not the problem. I swithched back to 0.9 nutch, and then it
works like a charm at good speeds 5-6 Mb/s with the bottleneck on machine
cpu.

¿Are this issues solved on dev version of nutch or why is this happens?

Thanks in advance,
 
-- 
View this message in context: 
http://old.nabble.com/Nutch-1.0-slow-crawls-tp27243302p27243302.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: How to change url score?

2010-01-20 Thread Julien Nioche
Hi,

The SVN version of Nutch has a new functionality for the Injector
which allows you to specify the score of a URL (see
https://issues.apache.org/jira/browse/NUTCH-655)

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2010/1/20 xiao yang :
> I'm crawling a group of web sites for some time. Now I want to add a new
> site: http://xxx.com
> Here is the process:
>
> 1. put xxx.com into a file: urls, and put it on Hadoop
> 2. run bin/nutch crawl urls -dir crawl -depth 5 -threads 1 -topN 1000
>
> However, the newly added site is not crawled for its score is too low.
>
> URL: http://xxx.com/
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun Jan 17 14:59:08 CST 2010
> Modified time: Thu Jan 01 08:00:00 CST 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
>
> How can I change the score manually so this site will be included in the
> next crawl round?
>
> Thanks!
> Xiao
>


How to change url score?

2010-01-20 Thread xiao yang
I'm crawling a group of web sites for some time. Now I want to add a new
site: http://xxx.com
Here is the process:

1. put xxx.com into a file: urls, and put it on Hadoop
2. run bin/nutch crawl urls -dir crawl -depth 5 -threads 1 -topN 1000

However, the newly added site is not crawled for its score is too low.

URL: http://xxx.com/
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sun Jan 17 14:59:08 CST 2010
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null

How can I change the score manually so this site will be included in the
next crawl round?

Thanks!
Xiao