Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
Doug Cutting wrote: http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.html well, I think incrediBILL has an argument, that people might really start excluding bots from their servers if it's becoming too much. What might help is that incrediBILL would offer an index of the site, which should be smaller than the site itself. I am not sure if there exists a standard for something like this. Basically the bot would ask the server if an index exists and where it is located and what the date it is from and then the bot decides to download the index or otherwise starts crawling the site. Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61
search speed
I using dfs. My index contain 3706249 documents. Presently, searching for occupies from 2 before 4 seconds (I test on query with 3 search term). Tomcat started on box with cpu Dual Opteron 2.4 GHz and 16 GB Ram. I think search is very slow now. We can make search faster? What factors influence on search speed?
RE: search speed
Hi, DFS is too slow for the search. What we did, was extracted the segments to the local FS i.e. to the hard disk. Each machine has 2X300GB HD in raid. Bin/hadoop dfs -get index /nutch/index Bin/hadoop dfs -get linkdb /nutch/linkdb Bin/hadoop dfs -get segments /nutch/segments When we run out of disk space for the segments on one web server, we add another web server, use mergesegs to split the segments and use the distributed search. HTH -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Thursday, June 15, 2006 10:09 AM To: nutch-dev@lucene.apache.org Subject: search speed I using dfs. My index contain 3706249 documents. Presently, searching for occupies from 2 before 4 seconds (I test on query with 3 search term). Tomcat started on box with cpu Dual Opteron 2.4 GHz and 16 GB Ram. I think search is very slow now. We can make search faster? What factors influence on search speed?
RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
In my company we changed the default and many other probably did the same. However, we must not ignore the behavior of the irresponsible users of Nutch. And for that reason the use of the default must be blocked in code. Just my 2 cents. -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Thursday, June 15, 2006 9:30 AM To: nutch-dev@lucene.apache.org Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Doug Cutting wrote: http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm l well, I think incrediBILL has an argument, that people might really start excluding bots from their servers if it's becoming too much. What might help is that incrediBILL would offer an index of the site, which should be smaller than the site itself. I am not sure if there exists a standard for something like this. Basically the bot would ask the server if an index exists and where it is located and what the date it is from and then the bot decides to download the index or otherwise starts crawling the site. Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61
[jira] Assigned: (NUTCH-306) DistributedSearch.Client liveAddresses concurrency problem
[ http://issues.apache.org/jira/browse/NUTCH-306?page=all ] Sami Siren reassigned NUTCH-306: Assign To: Sami Siren DistributedSearch.Client liveAddresses concurrency problem -- Key: NUTCH-306 URL: http://issues.apache.org/jira/browse/NUTCH-306 Project: Nutch Type: Bug Components: searcher Versions: 0.7, 0.8-dev Reporter: Grant Glouser Assignee: Sami Siren Priority: Critical Attachments: DistributedSearch.java-patch Under heavy load, hits returned by DistributedSearch.Client can become out of sync with the Client's live server list. DistributedSearch.Client maintains an array of live search servers (liveAddresses). This array is updated at intervals by a watchdog thread. When the Client returns hits from a search, it tracks which hits came from which server by saving an index into the liveAddresses array (as Hit.indexNo). The problem occurs when the search servers cannot service some remote procedure calls before the client times out (due to heavy load, for example). If the Client returns some Hits from a search, and then the array of liveAddresses changes while the Hits are still being used, the indexNos for those Hits can become invalid, referring to different servers than the Hit originated from (or no server at all!). Symptoms of this problem include: - ArrayIndexOutOfBoundsException (when the array of liveAddresses shrinks, a Hit from the last server in liveAddresses in the previous update cycle now has an indexNo past the end of the array) - IOException: read past EOF (suppose a hit comes back from server A with a doc number of 1000. Then the watchdog thread updates liveAddresses and now the Hit looks like it came from server B, but server B only has 900 documents. Trying to get details for the hit will read past EOF in server B's index.) - Of course, you could also get a silent failure in which you find a hit on server A, but the details/summary are fetched from server B. To the user, it would simply look like an incorrect or nonsense hit. We have solved this locally by removing the liveAddresses array. Instead, the watchdog thread updates an array of booleans (same size as the array of defaultAddresses) that indicate whether that address responded to the latest call from the watchdog thread. Hit.indexNo is then always an index into the complete array of defaultAddresses, so it is stable and always valid. Callers of getDetails()/getSummary()/etc. must still be aware that these methods may return null when the corresponding server is unable to respond. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-122) block numbers need a better random number generator
[ http://issues.apache.org/jira/browse/NUTCH-122?page=all ] Sami Siren resolved NUTCH-122: -- Resolution: Invalid this is more related to hadoop block numbers need a better random number generator --- Key: NUTCH-122 URL: http://issues.apache.org/jira/browse/NUTCH-122 Project: Nutch Type: Bug Components: fetcher, indexer, searcher Versions: 0.8-dev Reporter: Paul Baclace Attachments: MersenneTwister.java, MersenneTwister.java In order to support billions of block numbers, a better PRNG than java.util.Random is needed. To reach billions with low probability of collision, 64 bit random numbers are needed (the Birthday Problem is the model for the number of bits needed; the result is that twice as many bits are needed as the number of bits to count the expected number of items.) The built-in java.util.Random keeps only 48 bits of state which is only sufficient for 2^24 items. Using repeated calls to or more than one instance of Random does not increase its total entropy. Analysis util.Random is a linear congruential generator (LCG) identical to drand48. util.Random keeps 48 bits of state and gangs together 2 consecutive values to return 64 bit values. LCGs suffer from periodicity in the low order bits which would make modular binning less than random. low order bits could mean least significant byte. LCGs have periods in the range 106 to 109 when using 32 bit words, a range of poor to fair. seed = ( 0x5DEECE66DL * seed + 0xBL ) ((1L 48) - 1); the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of time. Results of the Birthday Spacings Test look good. References http://www.math.utah.edu/~beebe/java/random/README http://www.pierssen.com/arcview/upload/esoterica/randomizer.html Recommended alternative:MersenneTwister Matsumoto and Nishimura (1998). Longest period of any known generator 2^19937 or about 10^6001. A period that exceeds the number of unique values seems ideal; obviously a shorter period than the number of unique values (like util.Random) is a problem). Faster than java.util.Random (Random was recent tweaked, however). Excellent result for Diehard Birthday Spacings Test. Can be seeded with up to 624 32 bit integers. Doug Cutting wrote on nutch-dev: It just occurred to me that perhaps we could simply use sequential block numbering. All block ids are generated centrally on the namenode. Response from Paul Baclace: I'm not sure what the advantage of sequential block numbers would be since long period PRNG block numbering does not even need to store it's state, just pick a new starting place. Sequential block numbering does have the downside that picking a datanode based on (BlockNum % DataNodeCount) would devolve into round robin. Any attempt to pass the sequence through a hash ends up becoming a random number generator. Sequential numbering provides contiguous numbers, but after G.C. that would be lost, so no advantage there. When human beings eyeball block numbers, many with small differences are more likely to be misread than many that are totally different. If block numbering is sequential, then there is a temptation to use 32 bits instead of 64, but 32 bits leads to wrap-around and uh oh. FSNamesystem uses Random to help pick a target datanode, but it could just use the randomness of block numbers. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-187) Cannot start Nutch datanodes on Windows outside of a cygwin environment because of DF
[ http://issues.apache.org/jira/browse/NUTCH-187?page=all ] Sami Siren closed NUTCH-187: Resolution: Won't Fix closed as requested Cannot start Nutch datanodes on Windows outside of a cygwin environment because of DF -- Key: NUTCH-187 URL: http://issues.apache.org/jira/browse/NUTCH-187 Project: Nutch Type: Improvement Components: ndfs Versions: 0.8-dev Environment: Windows Reporter: Dominik Friedrich Priority: Minor Attachments: DF.diff Currently you cannot start Nutch datanodes on Windows outside of a cygwin environment because it relies on the df command to read the free disk space. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
I think that Nutch has to solve the problem: if you leave the problem to the websites, they're more likely to cut you off than they are to implement their own index storage scheme. Besides, they'd get it wrong, have stale data, etc. Maybe what is needed is brainstorming on a shared crawling scheme implemented in Nutch. Maybe something based on a bittorrent-like protocol? incrediBILL seems to have a pretty good point. -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Thursday, June 15, 2006 12:30 AM To: nutch-dev@lucene.apache.org Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Doug Cutting wrote: http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm l well, I think incrediBILL has an argument, that people might really start excluding bots from their servers if it's becoming too much. What might help is that incrediBILL would offer an index of the site, which should be smaller than the site itself. I am not sure if there exists a standard for something like this. Basically the bot would ask the server if an index exists and where it is located and what the date it is from and then the bot decides to download the index or otherwise starts crawling the site. Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61
[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12416379 ] Chris A. Mattmann commented on NUTCH-258: - Thanks for this patch Chris - even if now it is outdate by NUTCH-303 :-( Since Nutch no more use the deprecated Hadoop LogFormatter, there is no more logSevere check in the code. Oh Jerome. You're always trying to scoop me on stuff! ;) But I'm not sure all these log severe should be marked as severe (fatal level is used now). Agreed. Let's review the places in the patch where severe errors are logged, and then remove/add as deemed necessary. So, what I suggest is to review all the fatal logs and check if they are really fatal for the whole process. Agreed. I'll get on this right away. And finally, why not simply throwing a RuntimeException that will by catched the Fetcher if something wrong really occurs? Because we don't want one RuntimeException killing all subsequent fetching tasks. See the previous discussions on this by Andrzej, Scott, and I. Basically it boils down to ensuring that LOG.severe and its associated checking mechanism is associated within the context of a particular fetching task that executes: we believed that the best way to do that would be to use the Hadoop Configuration (which is task specific). Make sense? Okey dokey, I'll work on an updated patch and submit for review soon (I won't specify an exact date, because I'm always late ;) ). Once Nutch logs a SEVERE log item, Nutch fails forevermore -- Key: NUTCH-258 URL: http://issues.apache.org/jira/browse/NUTCH-258 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: All Reporter: Scott Ganyo Assignee: Chris A. Mattmann Priority: Critical Attachments: NUTCH-258.Mattmann.060906.patch.txt, dumbfix.patch Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. This is from the run() method in Fetcher.java: public void run() { synchronized (Fetcher.this) {activeThreads++;} // count threads try { UTF8 key = new UTF8(); CrawlDatum datum = new CrawlDatum(); while (true) { if (LogFormatter.hasLoggedSevere()) // something bad happened break;// exit Notice the last 2 lines. This will prevent Nutch from ever Fetching again once this is hit as LogFormatter is storing this data as a static. (Also note that LogFormatter.hasLoggedSevere() is also checked in org.apache.nutch.net.URLFilterChecker and will disable this class as well.) This must be fixed or Nutch cannot be run as any kind of long-running service. Furthermore, I believe it is a poor decision to rely on a logging event to determine the state of the application - this could have any number of side-effects that would be extremely difficult to track down. (As it has already for me.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira