Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-15 Thread Michael Wechner

Doug Cutting wrote:
http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.html 



well, I think incrediBILL has an argument, that people might really 
start excluding bots from their servers if it's
becoming too much. What might help is that incrediBILL would offer an 
index of the site, which should be smaller
than the site itself. I am not sure if there exists a standard for 
something like this. Basically the bot would ask the
server if an index exists and where it is located and what the date it 
is from and then the bot decides to download the index

or otherwise starts crawling the site.

Michi

--
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61



search speed

2006-06-15 Thread anton
I using dfs. My index contain 3706249 documents. Presently, searching for
occupies from 2 before 4 seconds (I test on query with 3 search term).
Tomcat started on box with cpu Dual Opteron 2.4 GHz and 16 GB Ram. I think
search is very slow now. 
We can make search faster? 
What factors influence on search speed?





RE: search speed

2006-06-15 Thread Gal Nitzan
Hi,

DFS is too slow for the search.

What we did, was extracted the segments to the local FS i.e. to the hard
disk. Each machine has 2X300GB HD in raid.

Bin/hadoop dfs -get index /nutch/index
Bin/hadoop dfs -get linkdb /nutch/linkdb
Bin/hadoop dfs -get segments /nutch/segments

When we run out of disk space for the segments on one web server, we add
another web server, use mergesegs to split the segments and use the
distributed search.

HTH


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 15, 2006 10:09 AM
To: nutch-dev@lucene.apache.org
Subject: search speed

I using dfs. My index contain 3706249 documents. Presently, searching for
occupies from 2 before 4 seconds (I test on query with 3 search term).
Tomcat started on box with cpu Dual Opteron 2.4 GHz and 16 GB Ram. I think
search is very slow now. 
We can make search faster? 
What factors influence on search speed?







RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-15 Thread Gal Nitzan
In my company we changed the default and many other probably did the same.
However, we must not ignore the behavior of the irresponsible users of
Nutch. And for that reason the use of the default must be blocked in code.

Just my 2 cents.


-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 15, 2006 9:30 AM
To: nutch-dev@lucene.apache.org
Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

Doug Cutting wrote:

http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm
l 


well, I think incrediBILL has an argument, that people might really 
start excluding bots from their servers if it's
becoming too much. What might help is that incrediBILL would offer an 
index of the site, which should be smaller
than the site itself. I am not sure if there exists a standard for 
something like this. Basically the bot would ask the
server if an index exists and where it is located and what the date it 
is from and then the bot decides to download the index
or otherwise starts crawling the site.

Michi

-- 
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61





[jira] Assigned: (NUTCH-306) DistributedSearch.Client liveAddresses concurrency problem

2006-06-15 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-306?page=all ]

Sami Siren reassigned NUTCH-306:


Assign To: Sami Siren

 DistributedSearch.Client liveAddresses concurrency problem
 --

  Key: NUTCH-306
  URL: http://issues.apache.org/jira/browse/NUTCH-306
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.7, 0.8-dev
 Reporter: Grant Glouser
 Assignee: Sami Siren
 Priority: Critical
  Attachments: DistributedSearch.java-patch

 Under heavy load, hits returned by DistributedSearch.Client can become out of 
 sync with the Client's live server list.
 DistributedSearch.Client maintains an array of live search servers 
 (liveAddresses).  This array is updated at intervals by a watchdog thread.  
 When the Client returns hits from a search, it tracks which hits came from 
 which server by saving an index into the liveAddresses array (as Hit.indexNo).
 The problem occurs when the search servers cannot service some remote 
 procedure calls before the client times out (due to heavy load, for example). 
  If the Client returns some Hits from a search, and then the array of 
 liveAddresses changes while the Hits are still being used, the indexNos for 
 those Hits can become invalid, referring to different servers than the Hit 
 originated from (or no server at all!).
 Symptoms of this problem include:
 - ArrayIndexOutOfBoundsException (when the array of liveAddresses shrinks, a 
 Hit from the last server in liveAddresses in the previous update cycle now 
 has an indexNo past the end of the array)
 - IOException: read past EOF (suppose a hit comes back from server A with a 
 doc number of 1000.  Then the watchdog thread updates liveAddresses and now 
 the Hit looks like it came from server B, but server B only has 900 
 documents.  Trying to get details for the hit will read past EOF in server 
 B's index.)
 - Of course, you could also get a silent failure in which you find a hit on 
 server A, but the details/summary are fetched from server B.  To the user, it 
 would simply look like an incorrect or nonsense hit.
 We have solved this locally by removing the liveAddresses array.  Instead, 
 the watchdog thread updates an array of booleans (same size as the array of 
 defaultAddresses) that indicate whether that address responded to the latest 
 call from the watchdog thread.  Hit.indexNo is then always an index into the 
 complete array of defaultAddresses, so it is stable and always valid.  
 Callers of getDetails()/getSummary()/etc. must still be aware that these 
 methods may return null when the corresponding server is unable to respond.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Resolved: (NUTCH-122) block numbers need a better random number generator

2006-06-15 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-122?page=all ]
 
Sami Siren resolved NUTCH-122:
--

Resolution: Invalid

this is more related to hadoop

 block numbers need a better random number generator
 ---

  Key: NUTCH-122
  URL: http://issues.apache.org/jira/browse/NUTCH-122
  Project: Nutch
 Type: Bug

   Components: fetcher, indexer, searcher
 Versions: 0.8-dev
 Reporter: Paul Baclace
  Attachments: MersenneTwister.java, MersenneTwister.java

 In order to support billions of block numbers, a better PRNG than 
 java.util.Random is needed.  To reach billions with low probability of 
 collision, 64 bit random numbers are needed (the Birthday Problem is the 
 model for the number of bits needed; the result is that twice as many bits 
 are needed as the number of bits to count the expected number of items.) The 
 built-in java.util.Random keeps only 48 bits of state which is only 
 sufficient for 2^24 items.  Using repeated calls to or more than one instance 
 of Random does not increase its total entropy.  
   Analysis
 util.Random is a linear congruential generator (LCG) identical to drand48.
   util.Random keeps 48 bits of state and gangs together 2 consecutive 
 values to return 64 bit values.
   LCGs suffer from periodicity in the low order bits which would make 
 modular binning less than random.
 low order bits could mean least significant byte.
   LCGs have periods in the range 106 to 109 when using 32 bit words, a 
 range of poor to fair.
   seed = ( 0x5DEECE66DL * seed + 0xBL )  ((1L  48) - 1);
 the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of 
 time.
   Results of the Birthday Spacings Test look good.
  References
   http://www.math.utah.edu/~beebe/java/random/README
   http://www.pierssen.com/arcview/upload/esoterica/randomizer.html
 Recommended alternative:MersenneTwister
   Matsumoto and Nishimura (1998).
   Longest period of any known generator 2^19937 or about 10^6001.
   A period that exceeds the number of unique values seems ideal; 
 obviously a shorter period than the number of unique values (like 
 util.Random)  is a problem).
   Faster than java.util.Random (Random was recent tweaked, however).
   Excellent result for Diehard Birthday Spacings Test.
   Can be seeded with up to 624 32 bit integers.
 Doug Cutting wrote on nutch-dev:
  It just occurred to me that perhaps we could simply use sequential block 
  numbering. 
   All block ids are generated centrally on the namenode.  
 Response from Paul Baclace:
 I'm not sure what the advantage of sequential block numbers would be
 since long period PRNG block numbering does not even need to store
 it's state, just pick a new starting place.
 Sequential block numbering does have the downside that picking a datanode 
 based on (BlockNum % DataNodeCount) would devolve into round robin.  Any 
 attempt to pass the sequence through a hash ends up becoming a random number 
 generator.
 Sequential numbering provides contiguous numbers, but after G.C. that would 
 be lost, so no advantage there.
 When human beings eyeball block numbers, many with small differences are more 
 likely to be misread than many that are totally different.
 If block numbering is sequential, then there is a temptation to use 32 bits 
 instead of 64, but 32 bits leads to wrap-around and uh oh. 
 FSNamesystem uses Random to help pick a target datanode, but it could just 
 use the randomness of block numbers.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-187) Cannot start Nutch datanodes on Windows outside of a cygwin environment because of DF

2006-06-15 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-187?page=all ]
 
Sami Siren closed NUTCH-187:


Resolution: Won't Fix

closed as requested

 Cannot start Nutch datanodes on Windows outside of a cygwin environment  
 because of DF
 --

  Key: NUTCH-187
  URL: http://issues.apache.org/jira/browse/NUTCH-187
  Project: Nutch
 Type: Improvement

   Components: ndfs
 Versions: 0.8-dev
  Environment: Windows
 Reporter: Dominik Friedrich
 Priority: Minor
  Attachments: DF.diff

 Currently you cannot start Nutch datanodes on Windows outside of a cygwin 
 environment because it relies on the df command to read the free disk space.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-15 Thread Paul Sutter
I think that Nutch has to solve the problem: if you leave the problem to the
websites, they're more likely to cut you off than they are to implement
their own index storage scheme. Besides, they'd get it wrong, have stale
data, etc.

Maybe what is needed is brainstorming on a shared crawling scheme
implemented in Nutch. Maybe something based on a bittorrent-like protocol? 

incrediBILL seems to have a pretty good point.

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 15, 2006 12:30 AM
To: nutch-dev@lucene.apache.org
Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

Doug Cutting wrote:

http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm
l 


well, I think incrediBILL has an argument, that people might really 
start excluding bots from their servers if it's
becoming too much. What might help is that incrediBILL would offer an 
index of the site, which should be smaller
than the site itself. I am not sure if there exists a standard for 
something like this. Basically the bot would ask the
server if an index exists and where it is located and what the date it 
is from and then the bot decides to download the index
or otherwise starts crawling the site.

Michi

-- 
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61



[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-15 Thread Chris A. Mattmann (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12416379 ] 

Chris A. Mattmann commented on NUTCH-258:
-

 Thanks for this patch Chris - even if now it is outdate by NUTCH-303 :-(
 Since Nutch no more use the deprecated Hadoop LogFormatter, there is no more 
 logSevere check in the code.

Oh Jerome. You're always trying to scoop me on stuff! ;)


 But I'm not sure all these log severe should be marked as severe (fatal level 
 is used now).

Agreed. Let's review the places in the patch where severe errors are logged, 
and then remove/add as deemed necessary. 


 So, what I suggest is to review all the fatal logs and check if they are 
 really fatal for the whole process. 

Agreed. I'll get on this right away.

 And finally, why not simply throwing a RuntimeException that will by catched 
 the Fetcher if something wrong really occurs?

Because we don't want one RuntimeException killing all subsequent fetching 
tasks. See the previous discussions on this by Andrzej, Scott, and I. Basically 
it boils down to ensuring that LOG.severe and its associated checking mechanism 
is associated within the context of a particular fetching task that executes: 
we believed that the best way to do that would be to use the Hadoop 
Configuration (which is task specific). Make sense?

Okey dokey, I'll work on an updated patch and submit for review soon (I won't 
specify an exact date, because I'm always late ;) ).


 Once Nutch logs a SEVERE log item, Nutch fails forevermore
 --

  Key: NUTCH-258
  URL: http://issues.apache.org/jira/browse/NUTCH-258
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
  Environment: All
 Reporter: Scott Ganyo
 Assignee: Chris A. Mattmann
 Priority: Critical
  Attachments: NUTCH-258.Mattmann.060906.patch.txt, dumbfix.patch

 Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. 
  This is from the run() method in Fetcher.java:
 public void run() {
   synchronized (Fetcher.this) {activeThreads++;} // count threads
   
   try {
 UTF8 key = new UTF8();
 CrawlDatum datum = new CrawlDatum();
 
 while (true) {
   if (LogFormatter.hasLoggedSevere()) // something bad happened
 break;// exit
   
 Notice the last 2 lines.  This will prevent Nutch from ever Fetching again 
 once this is hit as LogFormatter is storing this data as a static.
 (Also note that LogFormatter.hasLoggedSevere() is also checked in 
 org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
 This must be fixed or Nutch cannot be run as any kind of long-running 
 service.  Furthermore, I believe it is a poor decision to rely on a logging 
 event to determine the state of the application - this could have any number 
 of side-effects that would be extremely difficult to track down.  (As it has 
 already for me.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira