Thread-safety issues with Nutch language detector

2010-03-28 Thread asaf halfon
Hi, I'm new to Nutch. I use the latest version (1.0)
and I'm getting those errors a lot:

java.util.ConcurrentModificationException?https://www.semanticsad.com/ContextIn/wiki/ConcurrentModificationException
at java.util.HashMap? https://www.semanticsad.com/ContextIn/wiki/HashMap$
HashIterator? https://www.semanticsad.com/ContextIn/wiki/HashIterator
.nextEntry(HashMap? https://www.semanticsad.com/ContextIn/wiki/HashMap
.java:793)
at java.util.HashMap? https://www.semanticsad.com/ContextIn/wiki/HashMap$
ValueIterator? https://www.semanticsad.com/ContextIn/wiki/ValueIterator
.next(HashMap? https://www.semanticsad.com/ContextIn/wiki/HashMap
.java:822)
at
org.apache.nutch.analysis.lang.NGramProfile.normalize(NGramProfile.java:249)
at
org.apache.nutch.analysis.lang.NGramProfile.analyze(NGramProfile.java:216)
at 
org.apache.nutch.analysis.lang.LanguageIdentifier?https://www.semanticsad.com/ContextIn/wiki/LanguageIdentifier
.identify(LanguageIdentifier?https://www.semanticsad.com/ContextIn/wiki/LanguageIdentifier
.java:387)
at 
org.apache.nutch.analysis.lang.LanguageIdentifier?https://www.semanticsad.com/ContextIn/wiki/LanguageIdentifier
.identify(LanguageIdentifier?https://www.semanticsad.com/ContextIn/wiki/LanguageIdentifier
.java:367)
at 
translation.apache.nutch.NutchLanguageIdentifier?https://www.semanticsad.com/ContextIn/wiki/NutchLanguageIdentifier
.detect(NutchLanguageIdentifier?https://www.semanticsad.com/ContextIn/wiki/NutchLanguageIdentifier
.java:25)


and

java.lang.NullPointerException?https://www.semanticsad.com/ContextIn/wiki/NullPointerException
at
org.apache.nutch.analysis.lang.NGramProfile.getSorted(NGramProfile.java:266)
at 
org.apache.nutch.analysis.lang.LanguageIdentifier?https://www.semanticsad.com/ContextIn/wiki/LanguageIdentifier
.identify(LanguageIdentifier?https://www.semanticsad.com/ContextIn/wiki/LanguageIdentifier
.java:388)
at 
org.apache.nutch.analysis.lang.LanguageIdentifier?https://www.semanticsad.com/ContextIn/wiki/LanguageIdentifier
.identify(LanguageIdentifier?https://www.semanticsad.com/ContextIn/wiki/LanguageIdentifier
.java:367)
at 
translation.apache.nutch.NutchLanguageIdentifier?https://www.semanticsad.com/ContextIn/wiki/NutchLanguageIdentifier
.detect(NutchLanguageIdentifier?https://www.semanticsad.com/ContextIn/wiki/NutchLanguageIdentifier
.java:25)

what should I do?


Re: Fw: Blocked nutch spider accessing pages

2007-12-11 Thread Ricardo J. Méndez
Nutch-agent is a mailing list related to the usage of Nutch as a search
agent, not a person.  The reason your messages are showing up on Nabble is
because they're being sent to a public list that is indexed by many sites.


-- Ricardo


Re: Fw: Blocked nutch spider accessing pages

2007-12-10 Thread Martin Kuen
Hi,

a few things should be said in order to clarify the situation:

1. Nutch is NO SERVICE. Nutch is a free software project which is
subject to the Apache 2.0 license.
2. Nutch can be seen as a TOOLKIT to build a search application. To
create the search index a spider (the nutch spider) may be used.
3. The software (Nutch spider) forces a given user to supply a
customized agent-name using a configuration file. Without modifying
the source code it is not possible to advertise only Nutch as
agent-name. It would be sth. like me/Nutch or you/Nutch.
5. It is a pity that somebody is using this software in this way.
However, if this is bothering you that much you will have to take
steps against the person/party sending the spider to your domain
(IP-address?).
6. Unfortunately the nutch spider is sometimes (too often) used as a
site scraping tool. The spider can be used without the search/index
capabilities of Nutch.
7. A properly configured nutch robot will obey your robots.txt file.
With properly I mean configured as intended.

citation:
Thank you for your reply regarding the above and for any additional
information you can supply regarding steps that can be taken to block
Nutch once and for all from spidering my domain.
Well you could take your server offline ;). I really don't want to
insult you, but that's the only solution. Next time somebody will
modify wget to show the same kind of misbehaviour. Nutch is like
giving TNT-sticks to children (quote).

citation:
Please advise me how to block it permanently from my domain or i will
seek avenues to report your spider for its intrusive behaviour to the
major search engines possibly resulting in your domains removal from
their listings.
I really don't want to comment on that one . . .


However, regarding your site - I want to point out something:
citation:
ALL pages on blue-candy.com are copyright protected. Copying of any
page for any use is not allowed.
First, looking at your front page I found the following meta-tag:
meta name=Robots content=index,follow
Ahm . . . Well this will make any robot copy your site's content. At
least add noarchive to it (page is indexed, but the page itself is
not stored).
Second, you should add a rel=nofollow attribute if you want a robot
not to follow a given (for images . . . )

You're not alone:
http://johannburkard.de/blog/www/spam/this-much-nutch-is-too-much-nutch.html

The people developing nutch are serious people. Sorry that you are the
victim of some . . . well . . . script-kiddie. Probably some people
are more comfortable with modifying existing, well behaving code than
with using a mouse (or a download manager?).

I am just an individual and cannot/must not speak for the Apache
organisation. I am not affiliated in any way with them. This is just
my own private opinion.


Just my two cents,

Martin


PS: I hope your request for removal of your messages is approved



On Dec 7, 2007 9:55 PM, bluebrit [EMAIL PROTECTED] wrote:
 Hi,

 I just saw that my emails to you appear on this page 
 http://www.nabble.com/Fw:-Blocked-nutch-spider-accessing-pages-t4877480.html

 It was not my intent for these emails to be made available for everybody. 
 This was a personal email between myself and you and was considered private.

 Please remove all details regarding me from your database and remove the 
 emails from the above domain and relevant pages.

 For your additional information in response to your reply on this page 
 however, you state.

 Hi

 I am sory for your bandwidth consumption but I do not think that is my
 spider. You are right about robot.txt files. I did not read them because I
 do not know existence of that file. Thank you for advice. But I never send
 my spider to your domain. I crawled Only small amount of host and none of
 them about your host. How can you be sure about that is my spider. Please
 warn me if the problem continues and you sure about the spider.

 Thank you

 I can be sure it is your spider because my log file detects you and gives me 
 a link with a live url that I can follow. The link is live in this email at 
 the moment and it was in the past emails but I see you have removed the link 
 on the above page.

  Nutch 608+203 3.02 MB 07 Dec 2007 - 01:57


 If this isn't you, then you should have a safeguard in place that stops this 
 happening or gives each user a distinct number that has to remain in the 
 software for it to work. At least that way a method of tracking the users 
 would be available.

 I also think your comment regarding your not knowing of the existence of 
 robots.txt files a bit unrealistic as you obviously have the knowledge to use 
 software such as this.

 At the end of the day, there is no reason for software such as this to ignore 
 a robots.txt file unless it is attempting something it shouldn't.

 Please take note of my above request to remove these emails from your pages 
 as I can see a future problem coming from this where my address is harvested 
 and I end up

Fw: Blocked nutch spider accessing pages

2007-11-26 Thread bluebrit
I sent the below original email to you without reply two weeks ago and as you 
can see my domain is still being crawled by your spider.
Please advise me how to block it permanently from my domain or i will seek 
avenues to report your spider for its intrusive behaviour to the major search 
engines possibly resulting in your domains removal from their listings.


  Nutch 1733+29 10.79 MB 23 Nov 2007 - 11:06 



regards
owner blue-candy.com


- Original Message - 
From: bluebrit 
To: nutch-agent@lucene.apache.org 
Sent: Monday, November 12, 2007 12:43 PM
Subject: Blocked nutch spider accessing pages


Hello, I am writing this email to you because of the following.

Blocked spider in robots.txt found in log file.

User-agent: Nutch
Disallow: /

To date this month Nutch has appeared in my site log an unreasonable amount of 
times bearing in mind it is supposed to be blocked. It is obvious that your 
spider is not reading the robots.txt file and as my domain contains a copyright 
warning, can i assume you will be able to ensure that your spider or the user 
of your spider will stop the repeated visits and possible copying of text / 
graphics as well.

Below is a copy of log files from the last six months, that although not large 
in bandwidth usage, does constitute a problem as it seems to show an increasing 
demand.

  NutchCVS 588+8 2.37 MB 28 Jun 2007 - 06:43 

  Nutch 807+21 2.25 MB 28 Jun 2007 - 15:46 


  Nutch 324+223 1.35 MB 31 Jul 2007 - 04:11 


  Nutch 105+18 657.46 KB 31 Aug 2007 - 18:41 

  NutchCVS 712+12 2.73 MB 15 Aug 2007 - 00:38 


  Nutch 42+13 315.86 KB 30 Sep 2007 - 04:34 


  Nutch 30+12 182.74 KB 24 Oct 2007 - 19:56 


  Nutch 977+15 6.87 MB 08 Nov 2007 - 22:41 


My domain is http://www.blue-candy.com

Please note this is an adult domain and ALL of the images / video clips are 
also copyright protected by the sponsoring companies.

Thank you for your reply regarding the above and for any additional information 
you can supply regarding steps that can be taken to block Nutch once and for 
all from spidering my domain.

Regards
Owner blue-candy.com



Blocked nutch spider accessing pages

2007-11-14 Thread bluebrit
Hello, I am writing this email to you because of the following.

Blocked spider in robots.txt found in log file.

User-agent: Nutch
Disallow: /

To date this month Nutch has appeared in site log an unreasonable amount of 
times bearing in mind it is supposed to be blocked. It is obvious that your 
spider is not reading the robots.txt file and as my domain contains a copyright 
warning, can i assume you will be able to ensure that your spider or the user 
of your spider will stop the repeated visits and possible copying of text / 
graphics as well.

Below is a copy of log files from the last six months, that although not large 
in bandwidth usage, does constitute a problem as it seems to show an increasing 
demand.

  NutchCVS 588+8 2.37 MB 28 Jun 2007 - 06:43 

  Nutch 807+21 2.25 MB 28 Jun 2007 - 15:46 


  Nutch 324+223 1.35 MB 31 Jul 2007 - 04:11 


  Nutch 105+18 657.46 KB 31 Aug 2007 - 18:41 

  NutchCVS 712+12 2.73 MB 15 Aug 2007 - 00:38 


  Nutch 42+13 315.86 KB 30 Sep 2007 - 04:34 


  Nutch 30+12 182.74 KB 24 Oct 2007 - 19:56 


  Nutch 977+15 6.87 MB 08 Nov 2007 - 22:41 


My domain is http://www.blue-candy.com

Please note this is an adult domain and ALL of the images / video clips are 
also copyright protected by the sponsoring companies.

Thank you for your reply regarding the above and for any additional information 
you can supply regarding steps that can be taken to block Nutch once and for 
all from spidering my domain.

Regards
Owner blue-candy.com



Latest step by Step Installation guide for dummies: Nutch 0.9.

2007-10-23 Thread Peter Wang
Hi,

I have finished a detailed Latest step by Step Installation guide for
dummies: Nutch 0.9.
http://www.thechristianlife.com/z/NutchGuideForDummies.htm

Please add this link to the homepage if it is good enough to be shared.
Thanks!

Peter


Re: New to nutch, seem to be problems

2007-08-30 Thread misc


Hello-

   One more important piece of data about the problems that I am having. 
After waiting a really long time, I learned that fetch is not hung up, it 
was just raly slow.  It took only a few hours to go through all the 
urls (the corresponding lines for each url appears in the hadoop.log, and 
all the content was loaded).  Then it took 24 hours of waiting before the 
phrase fetcher done appeared.  Then fetch returned.  Why would fetch hang 
after the crawl was done before returning?


   Looking at the code it would seem that some of the fetcher threads must 
be stuck for a long time.  Don't these time out?


   thanks


- Original Message - 
From: misc [EMAIL PROTECTED]

To: nutch-agent@lucene.apache.org
Sent: Wednesday, August 29, 2007 5:31 PM
Subject: Re: New to nutch, seem to be problems




Hello-

   I will reply to my own post with new findings and observations.  About 
the slowness of generate, I just don't believe that it should take many 
hours to generate (any sized) list on a database that is a couple million 
large.  I could do the equivalant on plain text lists using grep, sort, 
uniq in just minutes.  I *must* be doing something wrong.


   I dug into it today.  Could someone correct me if I am wrong on any of 
this?  I couldn't find any written information about this anywhere.


   1. The generate seems to be broken into three phases, each a separate 
mapreduce command.  The first phase runs through all the urls in the 
crawldb, and throws out any that aren't eligable for crawling (by 
crawldate).


   2. The second phase partitions by hostname and ranks according to 
frequency.  It also cuts out repeat requests to a host if the number is 
too high (set by a parameter), and then sorts the urls by frequency.


   3. The third phase updates the database with the information that the 
url is being crawled and should not be handed out to anyone else.


   By observing what was going on, I could see that the first phase seems 
to take a couple of hours.  I can change the debug level of nutch to debug 
and see all the rejected urls being generated, and it does seem to be 
slow, a couple per second (my db has about 200k crawled things, and about 
200 uncrawled, so about 1 in 10 should be rejected  How can nutch 
only be going at a rate of about 20 per second, this is way too slow).


   I also looked to see if DNS lookups were slowing me down, but as far as 
I can tell not.  This is because the first phase doesn't even do DNS, yet 
is slow, and second because I used Wireshark to look for dns lookups and 
found none.


   Can someone tell me the expected time for generate to run?  6 hours is 
too long!


   thanks
   -J


- Original Message - 
From: misc [EMAIL PROTECTED]

To: nutch-agent@lucene.apache.org
Sent: Tuesday, August 28, 2007 6:27 PM
Subject: New to nutch, seem to be problems


Hello-

   My configuration and stats are at the end of this email.  I have set up 
nutch to crawl 100,000 urls.  The first pass (of 100,000) items went well, 
but problems started after this.


   1. Generate takes many hours to complete.  It doesn't matter whether I 
generate 1 million or 1000 items, it takes about 5 hours to complete.  Is 
this normal?


   2. Fetch works great, until it is done.  It then freezes up 
indefinitely.  It can fetch 100 pages in about 12 hours, and all the 
fetched content is in /tmp, but then it just sits there, not returning to 
the command line.  I have let it sit for about 12 hours and eventually 
broke down and cancelled it.  If I try to undate the database it of course 
fails.


   3. Fetch2 runs very slowly, even though I am using 80 threads, I only 
download an object per every few seconds (1 every 5 or 10 seconds).  From 
the log, I can see that almost always 79 or 80 threads are spinWaiting.


   4. I can't tell if fetch2 freezes like fetch does, as I haven't been 
able to wait the many days it will take to go through a full fetch with 
fetch2.


Configuration:

   Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.

   The ethernet connection has a dedicated 1gb connection to the web, so 
certainly that isn't a problem.


   I have tested on nutch 0.9 and the newest daily build from 2007-08-28.

   I seeded with urls from the opendirectory, 10.  I first ran a pass 
to load all 10, then took the topN=1million (10 times larger than the 
first set of urls).  The first pass had no problem, the second pass (and 
beyond) is where the problems began.







New to nutch, seem to be problems

2007-08-28 Thread misc
Hello-

My configuration and stats are at the end of this email.  I have set up 
nutch to crawl 100,000 urls.  The first pass (of 100,000) items went well, but 
problems started after this.

1. Generate takes many hours to complete.  It doesn't matter whether I 
generate 1 million or 1000 items, it takes about 5 hours to complete.  Is this 
normal?

2. Fetch works great, until it is done.  It then freezes up indefinitely.  
It can fetch 100 pages in about 12 hours, and all the fetched content is in 
/tmp, but then it just sits there, not returning to the command line.  I have 
let it sit for about 12 hours and eventually broke down and cancelled it.  If I 
try to undate the database it of course fails.

3. Fetch2 runs very slowly, even though I am using 80 threads, I only 
download an object per every few seconds (1 every 5 or 10 seconds).  From the 
log, I can see that almost always 79 or 80 threads are spinWaiting.

4. I can't tell if fetch2 freezes like fetch does, as I haven't been able 
to wait the many days it will take to go through a full fetch with fetch2.

Configuration:

Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.

The ethernet connection has a dedicated 1gb connection to the web, so 
certainly that isn't a problem.

I have tested on nutch 0.9 and the newest daily build from 2007-08-28.

I seeded with urls from the opendirectory, 10.  I first ran a pass to 
load all 10, then took the topN=1million (10 times larger than the first 
set of urls).  The first pass had no problem, the second pass (and beyond) is 
where the problems began.



Help with nutch

2007-03-28 Thread james redden

When I run nutch as a standard user the crawl barfs at 893M (1021 disk space
used in total). Java jumps up to 100% CPU time at this point. I have no
quotas or limits on this user account. A limitation in java? Has anyone seen
this issue before?

Thanks in advance.