Re: Strange search results
Hello, In my experience it is very important to use anchor text giving it quite high boost. It allows me to return http://www.aa.com when user searches for American Airlines - without using anchor text it was impossible to achieve - a lot of sites (spam or not) with american airlines in url and title were returned first. So in my opinion for results quality it is important to use anchor text and also use some techniques to identify spam sites - so anchor text spamming effect would be highly reduced. For generic anchor texts like Clik here!,here, Click to Open in a new window etc - it is quite easy to remove them during indexing. We spent some time on cleaning our index from unwanted pages and I think it was a time well spent. Regards Piotr On 8/3/05, Chirag Chaman [EMAIL PROTECTED] wrote: Howie, Concur with Andy on both points -- Unfortunately, there is no way to go back and remove either of these values without reindexing, so let me save you the trouble if you were thinking of changing the similarity class as a workaround. IMO, the problem with anchors is that you either need to get them all, or not get them at all -- getting just a few anchors can give you really bad results as stuff like click here will give pages a high score that don't contain either of these terms. Another approach is to go in the properties file and change the boost of anchors to 0.05, thus giving them a very very low boost Regarding the norm -- this is done at index time for each field. We've changed the indexing code so that it's always 1 HTH, CC -Original Message- From: Andy Liu [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 03, 2005 8:00 AM To: nutch-dev@lucene.apache.org Subject: Re: Strange search results The fieldNorm is lengthNorm * document boost. The final value is rounded so that's why you're getting such clean numbers for your fieldNorm. If you're finding that these pages have too high of a boost, you can lower indexer.score.power in your conf file. As for your problem in #2, look at the explain page to see how your search result got there. Maybe there's a high score for an anchor match. The anchor text doesn't show up on the text of the page, so maybe that's it. Andy On 8/3/05, Howie Wang [EMAIL PROTECTED] wrote: Hi, I've been noticing some strange search results recently. I seem to be getting two issues. 1. The fieldNorm for certain terms is unusually high for certain sites for anchors and titles. And they are usually just whole numbers (4.0, 5.0, etc). I find this strange since the lengthNorm used to calculate this is very unlikely to result in an integer. It's either 1/sqrt(numTokens) or 1/log(e+numTokens). Where is 5.0 coming from? 2. I'm getting hits for sites that don't contain ANY of the terms in my search. This is exacerbated by issue #1 since the fieldNorm boosts this page to the top of the results. I thought it might be because of my changes for stemming, but this happens for search terms that are not changed by stemming at all. Anyone run into something like this? Any ideas on how to start debugging? Thanks, Howie Howie
Re: Strange search results
Hello, In my experience it is very important to use anchor text giving it quite high boost. It allows me to return http://www.aa.com when user searches for American Airlines - without using anchor text it was impossible to achieve - a lot of sites (spam or not) with american airlines in url and title were returned first. I see your point. I'm probably in a different situation than most of you since my site is a pretty small niche. I practically hand-pick the exact pages that I want to crawl so I don't have to worry about spam. In my case, the anchor text doesn't buy me much, neither does the url. I switched to just using content and title, and I'm getting a lot better results now. Thanks, Howie
Re: near-term plan
I was using a nightly build that Pitor had given me the nutch-nightly.jar (actually it was nutch-dev0.7.jar or something of that nature) I tested it on the windows platform, I had 5 machines running it, 2 at 100 mbit both quad p3 xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1 Athlon 64 3500+. all have 1gb or more of ram. now I have my big server and if you have worked on ndfs since the begining of july I'll test it again, my big server's HD array is very fast 200+mbytes a sec, so it will be able to fully saturate gigabit better. anyway the p4 and the 2 amd machines are hooked into the switch at gigabit and the 2 xeons are hooked into my other switch at 100mbit, but it has a gigabit uplink to my gigabit switch, so both xeons would constantly be saturated at 11mbytes a sec. while the p4 was able to reach higher speeds of 50-60mbytes a sec with its internal raid 0 array (dual 120gb drives) my main pc (athlon 64 3500+) was the namenode and a datanode and also the ndfs client, I could not get nutch to work properly with ndfs, it was setup correctly, it kinda worked but would crash out the namenode when I was trying to fetch segments in the ndfs filesystem or index them, or do much of anything. so I copied all my segment directories, indexes, content.wtahever it was 1.8gb and some dvd images onto ndfs. my primary machine and nutch runs off 1rpm disks raid 0 (2x36gb raptors) they can output about 120mbytes a sec sustained so here is what I found out ( in windows) if I dont start a datanode on the namenode with the conf pointing to 127.0.0.1 instead of its outside ip the namenode will not copy data to the other machines, instead if I'm running datanode on the namenode data will replicate from the datanode to the other 3 datanodes, I tried this a hundred ways to try and make it work with an independant namenode without luck. but the way I saw data go across my network was I would put data into ndfs the namenode would request a datanode and find the internal datanode, copy data to it only then after that the datanode would still be coping data from my other hd's into chunks on the raid array, while copying it would replicate to the p4 via gigabit at 50-60mbytes a sec, then it would replicate from the p4 to the xeons kinda like alternating them as I only had replication at default 2 and i had about 100gbytes to copy in so the copy would finish onto the internal raid array fairly quickly then it finished replication to the p4 and the xeons got a little bit of data, but not near as much as the p4, my guess is it only needs 2 copies and the first copy was datanode on the internal machine, the second was the p4 datanode. the xeons only had a smaller connection so they didnt recieve as many chunks as fast as the p4 could, and the p4 had enough space for all the data so it worked out, I should of put replication to 4. the amd athlon xp 1900+ was running linux suse 9.3 and it would crash the namenode on windows if I connected it as a datanode. so that one didnt get tested, but I was able to put out 50-60 mbytes a sec to 1 machine, but it would not replicate data to multiple machines at the same time it seemed. I would of thought it would of output to the xeons at the same time as the p4, give the xeons 20% of the data and the p4 80% or something of that nature, but it could be that they just arent fast enough to request data before the p4 was recieving its 32mb chunks every 1/2 second? The good news cpu usage was only at 50% on my amd 3500+ that was while it was copying data to the internal datanode from the ndfs client from another internal HD running the namenode and running the datanode internally. does it now work with a separate namenode? I'm getting ready to run nutch in linux full time, if I can ever get the damn driver for my highpoint 2220 raid card to work with suse, any suse, the drivers dont work with dual core cpu's or something??? they are working on it, now I'm stuck with fedora 4 untill they fix it. so its not ready for testing yet. I'll let you know when I can test it in a full linux environment. wow that was a long one!!! -Jay
[jira] Created: (NUTCH-78) German texts on website
German texts on website --- Key: NUTCH-78 URL: http://issues.apache.org/jira/browse/NUTCH-78 Project: Nutch Type: Improvement Components: searcher Reporter: Matthias Jaekle Priority: Minor Attachments: de.properties.tgz The German properties-files with the texts to present on the websites were incomplete, or with wrong spellings. Please find attached the corrected files. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-78) German texts on website
[ http://issues.apache.org/jira/browse/NUTCH-78?page=all ] Matthias Jaekle updated NUTCH-78: - Attachment: de.properties.tgz anchors_de.properties, cached_de.properties, explain_de.properties, search_de.properties, text_de.properties German texts on website --- Key: NUTCH-78 URL: http://issues.apache.org/jira/browse/NUTCH-78 Project: Nutch Type: Improvement Components: searcher Reporter: Matthias Jaekle Priority: Minor Attachments: de.properties.tgz The German properties-files with the texts to present on the websites were incomplete, or with wrong spellings. Please find attached the corrected files. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Crawling directly from URL and Questions about using the index
Hi, since my first experiments were sucessful, I m actually starting implementing Nutch into my Website Visualisation Tool. So I got now to my first questions: 1. I put a class into my Project that works similar to the CrawlerTool.java main class. This works fine if you have written the urls into a file like urls. But now I want to directly crawl a site which means: instead of using WebDBInjector.main(prependFileSystem(fs, nameserver, new String[] { db, -urlfile, rootUrlFile })); I d like to call crawling for a let s say : String url which is a the wanted URL and not a File where urls are in. How could this be done ? Or the other solution: Can I have a nutch process running , that checks a certain file for new urls, and does crawling and indexing for me. So that I can add urls to that file (acting like a queue) out of may program. 2. Using the given .war file for searching the Index after crawling works fine. But where do I have to look , to find out how it works. I m now using the method to start tomcat from the crawleddir where the segments are. Searching works fine. But I d like to implement searching (I used Lucene directly before) in my application, and so it would be interesting how the war works. That s it for the moment. Thanks for any kind of help. Greetings Nils
fetching redirect bug?
Suppose we have to fetch 3 pages. Page A is http://something/login.php Page B is http://yyy/rrr/ which, when fetched, redirects to page A Page C is http://yyy/ttt/ which, when fetched, redirects to page A When fetching A, B, C the fetcher will fetch A B A C A Is there any way to prevent the refetching of page A?