Re: Strange search results

2005-08-05 Thread Piotr Kosiorowski
Hello,
In my experience it is very important to use anchor text giving it
quite high boost. It allows me to return http://www.aa.com when user
searches for American Airlines - without using anchor text it was
impossible to achieve - a lot of sites (spam or not) with american
airlines  in url and title were returned first.

So in my opinion for results quality it is important to use anchor
text and also use some techniques to identify spam sites - so anchor
text spamming effect  would be highly reduced. For generic anchor
texts like Clik here!,here, Click to Open in a new window etc -
it is quite easy to remove them during indexing.
We spent some time on cleaning our index from unwanted pages and I
think it was a time well spent.

Regards
Piotr



On 8/3/05, Chirag Chaman [EMAIL PROTECTED] wrote:
 Howie,
 
 Concur with Andy on both points -- Unfortunately, there is no way to go
 back and remove either of these values without reindexing, so let me save
 you the trouble if you were thinking of changing the similarity class as a
 workaround.
 
 IMO, the problem with anchors is that you either need to get them all, or
 not get them at all -- getting just a few anchors can give you really bad
 results as stuff like click here will give pages a high score that don't
 contain either of these terms.  Another approach is to go in the properties
 file and change the boost of anchors to 0.05, thus giving them a very very
 low boost
 
 Regarding the norm -- this is done at index time for each field. We've
 changed the indexing code so that it's always 1
 
 HTH,
 CC
 
 
 -Original Message-
 From: Andy Liu [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 03, 2005 8:00 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: Strange search results
 
 The fieldNorm is lengthNorm * document boost.  The final value is rounded
 so that's why you're getting such clean numbers for your fieldNorm.  If
 you're finding that these pages have too high of a boost, you can lower
 indexer.score.power in your conf file.
 
 As for your problem in #2, look at the explain page to see how your search
 result got there.  Maybe there's a high score for an anchor match.  The
 anchor text doesn't show up on the text of the page, so maybe that's it.
 
 Andy
 
 On 8/3/05, Howie Wang [EMAIL PROTECTED] wrote:
  Hi,
 
  I've been noticing some strange search results recently. I seem to be
  getting two issues.
 
  1. The fieldNorm for certain terms is unusually high for certain sites
  for anchors and titles. And they are usually just whole numbers (4.0,
  5.0, etc).
  I find this strange since the lengthNorm used to calculate this is
  very unlikely to result in an integer. It's either 1/sqrt(numTokens)
  or 1/log(e+numTokens). Where is 5.0 coming from?
 
  2. I'm getting hits for sites that don't contain ANY of the terms in
  my search. This is exacerbated by issue #1 since the fieldNorm boosts
  this page to the top of the results. I thought it might be because of
  my changes for stemming, but this happens for search terms that are
  not changed by stemming at all.
 
  Anyone run into something like this? Any ideas on how to start debugging?
 
  Thanks,
  Howie
 
 
  Howie
 
 
 
 
 



Re: Strange search results

2005-08-05 Thread Howie Wang

Hello,
In my experience it is very important to use anchor text giving it
quite high boost. It allows me to return http://www.aa.com when user
searches for American Airlines - without using anchor text it was
impossible to achieve - a lot of sites (spam or not) with american
airlines  in url and title were returned first.


I see your point. I'm probably in a different situation than most of you
since my site is a pretty small niche. I practically hand-pick the exact
pages that I want to crawl so I don't have to worry about spam. In
my case, the anchor text doesn't buy me much, neither does
the url. I switched to just using content and title, and I'm getting a lot
better results now.

Thanks,
Howie




Re: near-term plan

2005-08-05 Thread webmaster
I was using a nightly build that Pitor had given me the nutch-nightly.jar 
(actually it was nutch-dev0.7.jar or something of that nature) I tested it on 
the windows platform, I had 5 machines running it, 2 at 100 mbit both quad p3 
xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1 Athlon 64 
3500+. all have 1gb or more of ram. now I have my big server and if you have 
worked on ndfs since the begining of july I'll test it again, my big server's 
HD array is very fast 200+mbytes a sec, so it will be able to fully saturate 
gigabit better. anyway the p4 and the 2 amd machines are hooked into the 
switch at gigabit and the 2 xeons are hooked into my other switch at 100mbit, 
but it has a gigabit uplink to my gigabit switch, so both xeons would 
constantly be saturated at 11mbytes a sec. while the p4 was able to reach 
higher speeds of 50-60mbytes a sec with its internal raid 0 array (dual 120gb 
drives) my main pc (athlon 64 3500+) was the namenode and a datanode and also 
the ndfs client, I could not get nutch to work properly with ndfs, it was 
setup correctly, it kinda worked but would crash out the namenode when I 
was trying to fetch segments in the ndfs filesystem or index them, or do much 
of anything. so I copied all my segment directories, indexes, 
content.wtahever it was 1.8gb and some dvd images onto ndfs. my primary 
machine and nutch runs off 1rpm disks raid 0 (2x36gb raptors) they can 
output about 120mbytes a sec sustained so here is what I found out ( in 
windows) if I dont start a datanode on the namenode with the conf pointing to 
127.0.0.1 instead of its outside ip the namenode will not copy data to the 
other machines, instead if I'm running datanode on the namenode data will 
replicate from the datanode to the other 3 datanodes, I tried this a hundred 
ways to try and make it work with an independant namenode without luck. but 
the way I saw data go across my network was I would put data into ndfs the 
namenode would request a datanode and find the internal datanode, copy data 
to it only then after that the datanode would still be coping data from my 
other hd's into chunks on the raid array, while copying it would replicate to 
the p4 via gigabit at 50-60mbytes a sec, then it would replicate from the p4 
to the xeons kinda like alternating them as I only had replication at default 
2 and i had about 100gbytes to copy in so the copy would finish onto the 
internal raid array fairly quickly then it finished replication to the p4 and 
the xeons got a little bit of data, but not near as much as the p4, my guess 
is it only needs 2 copies and the first copy was datanode on the internal 
machine, the second was the p4 datanode. the xeons only had a smaller 
connection so they didnt recieve as many chunks as fast as the p4 could, and 
the p4 had enough space for all the data so it worked out, I should of put 
replication to 4. the amd athlon xp 1900+ was running linux suse 9.3 and it 
would crash the namenode on windows if I connected it as a datanode. so that 
one didnt get tested, but I was able to put out 50-60 mbytes a sec to 1 
machine, but it would not replicate data to multiple machines at the same 
time it seemed. I would of thought it would of output to the xeons at the 
same time as the p4, give the xeons 20% of the data and the p4 80% or 
something of that nature, but it could be that they just arent fast enough to 
request data before the p4 was recieving its 32mb chunks every 1/2 second?
The good news cpu usage was only at 50% on my amd 3500+ that was while it was 
copying data to the internal datanode from the ndfs client from another 
internal HD running the namenode and running the datanode internally. does it 
now work with a separate namenode? I'm getting ready to run nutch in linux 
full time, if I can ever get the damn driver for my highpoint 2220 raid card 
to work with suse, any suse, the drivers dont work with dual core cpu's or 
something??? they are working on it, now I'm stuck with fedora 4 untill they 
fix it. so its not ready for testing yet. I'll let you know when I can test 
it in a full linux environment.
wow that was a long one!!!
-Jay


[jira] Created: (NUTCH-78) German texts on website

2005-08-05 Thread Matthias Jaekle (JIRA)
German texts on website
---

 Key: NUTCH-78
 URL: http://issues.apache.org/jira/browse/NUTCH-78
 Project: Nutch
Type: Improvement
  Components: searcher  
Reporter: Matthias Jaekle
Priority: Minor
 Attachments: de.properties.tgz

The German properties-files with the texts to present on the websites were 
incomplete, or with wrong spellings.
Please find attached the corrected files.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-78) German texts on website

2005-08-05 Thread Matthias Jaekle (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-78?page=all ]

Matthias Jaekle updated NUTCH-78:
-

Attachment: de.properties.tgz

anchors_de.properties, cached_de.properties, explain_de.properties, 
search_de.properties, text_de.properties

 German texts on website
 ---

  Key: NUTCH-78
  URL: http://issues.apache.org/jira/browse/NUTCH-78
  Project: Nutch
 Type: Improvement
   Components: searcher
 Reporter: Matthias Jaekle
 Priority: Minor
  Attachments: de.properties.tgz

 The German properties-files with the texts to present on the websites were 
 incomplete, or with wrong spellings.
 Please find attached the corrected files.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Crawling directly from URL and Questions about using the index

2005-08-05 Thread Nils Hoeller
Hi, 

since my first experiments were sucessful, I m actually starting
implementing Nutch into my Website Visualisation Tool.

So I got now to my first questions:

1. I put a class into my Project that works similar to the 
CrawlerTool.java main class. 
This works fine if you have written the urls into a file 
like urls. But now I want to directly crawl a site
which means:

instead of using

WebDBInjector.main(prependFileSystem(fs, nameserver, new String[] { db,
-urlfile, rootUrlFile })); 

I d like to call crawling for a let s say : String url which 
is a the wanted URL and not a File where urls are in.

How could this be done ? 

Or the other solution:

Can I have a nutch process running , that checks a certain file for 
new urls, and does crawling and indexing for me. 
So that I can add urls to that file (acting like a queue) out of may
program.


2. 

Using the given .war file for searching the Index after crawling works
fine. But where do I have to look , to find out how it works. 
I m now using the method to start tomcat from the crawleddir where 
the segments are. Searching works fine. 
But I d like to implement searching (I used Lucene directly before)
in my application, and so it would be interesting how the war works.


That s it for the moment.
Thanks for any kind of help.

Greetings Nils



fetching redirect bug?

2005-08-05 Thread EM
Suppose we have to fetch 3 pages. 

Page A is http://something/login.php 
Page B is http://yyy/rrr/ which, when fetched, redirects to page A
Page C is http://yyy/ttt/ which, when fetched, redirects to page A

When fetching A, B, C the fetcher will fetch 
A
B
A
C
A

Is there any way to prevent the refetching of page A?