interesting paper with competing index systems

2006-01-19 Thread Byron Miller
http://www.cs.yorku.ca/~mladen/pdf/Read6_u.pisa-attardi.tera.pdf

Anyone have any further details on this?




wildcard matches not working?

2006-01-19 Thread cilquirm . 20552126
i can't seem to get wildcard matches ( test* ) to work in my index using
the default nutch search application.

is there something i'm missing?


i'm using nutch built from trunk, with a patch applied that lets htdig-noindex
boundaries not be indexed.

Thanks in advance for any help,
-a


Re: XP/Cygwin setup problems

2006-01-19 Thread Pashabhai
Hi 

  You get that error while running earlier 0.7 nutch
tutorial running on 0.8dev nutch.

  Use the tutorial  for 0.8 dev 
http://wiki.media-style.com/display/nutchDocu/quick+tutorial+for+nutch+0.8+and+later.

  Or add following property to nutch-site.xml.

 property
  namemapred.input.dir/name
 
valueC:/cygwin/usr/local/src/nutch-nightly/conf/value
  descriptionThe proxy port./description
/property


P

Hi all,

Having some problems getting nutch to run on
XP/Cygwin.
This is re nutch-2006-01-17

Intranet crawl

When I do this (after making urls file, etc.):

   bin/nutch crawl urls -dir cdir -depth 2 log

I get this in the log:

060117 114833 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-default.xml
060117 114834 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/crawl-tool.xml
060117 114834 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/mapred-default.xml
060117 114834 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-site.xml
060117 114834 crawl started in: cdir
060117 114834 rootUrlDir = urls
060117 114834 threads = 10
060117 114834 depth = 2
060117 114834 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-default.xml
060117 114834 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/crawl-tool.xml
060117 114834 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-site.xml
060117 114834 Injector: starting
060117 114834 Injector: crawlDb: cdir\crawldb
060117 114834 Injector: urlDir: urls
060117 114834 Injector: Converting injected urls to
crawl db entries.
060117 114834 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-default.xml
060117 114834 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/crawl-tool.xml
060117 114834 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/mapred-default.xml
060117 114834 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/mapred-default.xml
060117 114834 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-site.xml
060117 114834 Running job: job_krj0e1
060117 114834 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-default.xml
060117 114834 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/mapred-default.xml
060117 114835 parsing
\tmp\nutch\mapred\local\localRunner\job_krj0e1.xml
060117 114835 parsing
file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-site.xml
java.io.IOException: No input directories specified
in: NutchConf: nutch-default.xml , mapred-default.xml
, \tmp\nutch\mapred\local\localRunner\job_krj0e1.xml ,
nutch-site.xml
at
org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:85)
at
org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:95)
at
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:63)
060117 114835  map 0%
java.io.IOException: Job failed!
at
org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
at
org.apache.nutch.crawl.Injector.inject(Injector.java:102)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
Exception in thread main 

I see that:

nutch-site.xml is empty
mapred-default is empty


Whole Web setup 

When I do this: (after mkdirs)

bin/nutch admin db -create
 
I get this at the prompt:

Exception in thread main
java.lang.NoClassDefFoundError: admin

I don't speak Java, so I'm not sure what it's saying.


Please help.

TIA.





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam
protection around http://mail.yahoo.com 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: So many Unfetched Pages using MapReduce

2006-01-19 Thread Mike Smith
Hi Florent

I did some more testings. Here is the results:

I have 3 machines, P4 and 1G ram. All three are data node and one is
namenode. I started from 8 seed urls and tried to see the effect of
depth 1 crawl for different configuration.

Number of unfetch pages changes with different configurations:

--Configuration 1
Number of map tasks: 3
Number of reduce tasks: 3
Number of fetch threads: 40
Number of thread per host: 2
http.timeout: 10 sec
---
6700 pages fetched

--Configuration 2
Number of map tasks: 12
Number of reduce tasks: 6
Number of fetch threads: 500
Number of thread per host: 20
http.timeout: 10 sec
---
18000 pages fetched

--Configuration 3
Number of map tasks: 40
Number of reduce tasks: 20
Number of fetch threads: 500
Number of thread per host: 20
http.timeout: 10 sec
---
37000 pages fetched

--Configuration 4
Number of map tasks: 100
Number of reduce tasks: 20
Number of fetch threads: 100
Number of thread per host: 20
http.timeout: 10 sec
---
34000 pages fetched


--Configuration 5
Number of map tasks: 50
Number of reduce tasks: 50
Number of fetch threads: 40
Number of thread per host: 100
http.timeout: 20 sec
---
52000 pages fetched

--Configuration 6
Number of map tasks: 50
Number of reduce tasks: 100
Number of fetch threads: 40
Number of thread per host: 100
http.timeout: 20 sec
---
57000 pages fetched

--Configuration 7
Number of map tasks: 50
Number of reduce tasks: 120
Number of fetch threads: 250
Number of thread per host: 20
http.timeout: 20 sec
---
6 pages fetched



Do you have any idea why pages are missing from the fetcher without the any
log or exceptions? It seems it really depends on the number of reduce
tasks!
Thanks, Mike



On 1/17/06, Mike Smith [EMAIL PROTECTED] wrote:

 I've experienced the same effect. When I decrease number of map/reduce
 tasks, I can fetch more web pages. but increasing those increases unfetched
 pages. I also get some java.net.SocketTimeoutException : Read timed out
 exceptions in my datanode log files. But those time out problems couldn't
 cause this much missing pages!! I agree the problem should be somewhere is
 the fetcher.

 Mike


 On 1/17/06, Florent Gluck [EMAIL PROTECTED] wrote:
 
  I'm having the exact same problem.
  I noticed that changing the number of map/reduce tasks gives me
  different DB_fetched results.
  Looking at the logs, a lot of urls are actually missing.  I can't find
  their trace *anywhere* in the logs (whether on the slaves or the
  master).  I'm puzzled.  Currently I'm trying to debug the code to see
  what's going on.
  So far, I noticed the generator is fine, so the issue must lay further
  in the pipeline (fetcher?).
 
  Let me know if you find anything regarding this issue. Thanks.
 
  --Flo
 
  Mike Smith wrote:
 
  Hi,
  
  I have setup for boxes using MapReduce, everything goes smoothly, I
  have
  feeded about 8 seed nodes for begining and I have crawled by depth
  2.
  Only 1900 pages (about 300MG) data and the rest is marked and db
  unfetched.
  Does any one know what could be wrong?
  
  This is the output of (bin/nutch readdb h2/crawldb -stats):
  
  060115 171625 Statistics for CrawlDb: h2/crawldb
  060115 171625 TOTAL urls:   99403
  060115 171625 avg score:1.01
  060115 171625 max score:7.382
  060115 171625 min score:1.0
  060115 171625 retry 0:  99403
  060115 171625 status 1 (DB_unfetched):  97470
  060115 171625 status 2 (DB_fetched):1933
  060115 171625 CrawlDb statistics: done
  
  Thanks,
  Mike
  
  
  
 
 



Re: interesting paper with competing index systems

2006-01-19 Thread Byron Miller
Thats exactly how i felt. No mention of JVM/Platform
or options (or versions) used. I've just been
bombarded with someone (who i can probably assume
works or uses the afformentioned program) asking me
why i use lucene on all of my projects.

The paper hardly seems acadamic even though it appears
that is what they're going for.

Thanks again for the quick follow up.

--- Doug Cutting [EMAIL PROTECTED] wrote:

 Byron Miller wrote:
 

http://www.cs.yorku.ca/~mladen/pdf/Read6_u.pisa-attardi.tera.pdf
  
  Anyone have any further details on this?
 
 The first author of the paper is also the founder of
 the company which 
 sells the software described, so these benchmarks
 should not be 
 considered entirely objective.
 
 That's not to say that IXE is not faster than
 Lucene, it might well be. 
   But they do not list any JVM details, the Lucene
 version or any Lucene 
 options.  Chances are, with a few informed tweaks,
 one could improve 
 Lucene's performance on this benchmark.  Chances are
 also that IXE was 
 configured for optimal performance on this
 benchmark, since it was 
 performed by the authors of IXE.
 
 Also note that this is a micro-benchmark, designed
 to highlight their 
 skip implementation.  A better comparison would
 average times from a log 
 of real user queries.
 
 Please feel free to try to obtain the IXE software
 and perform 
 benchmarks of your own.
 
 Doug
 



Re: How do I control log level with MapReduce?

2006-01-19 Thread Doug Cutting

Chris Schneider wrote:
I'm trying to bring up a MapReduce system, but am confused about how to 
control the logging level. It seems like most of the Nutch code is still 
logging the way it used to, but the -logLevel parameter that was getting 
passed to each tool's main() method no longer exists (not that these 
main methods are getting called by Crawl.java, of course). Previously, 
if -logLevel was omitted, each tool would set its logLevel field to 
INFO, but those fields no longer exist either. The result seems to be 
that the logging level defaults all the way back to the LogFormatter, 
which sets all of its handlers to FINEST.


I was sort of expecting there to be a new configuration property 
(perhaps a job configuration property?) that would control the logging 
level, but I don't see anything like this. Any guidance would be greatly 
appreciated.


There is no config property to control logging level.  That would be a 
useful addition, if someone wishes to contribute it.


In the meantime, Nutch uses Java's built-in logging mechanism. 
Instructions for configuring that are in:


http://java.sun.com/j2se/1.4.2/docs/api/java/util/logging/LogManager.html

Doug


Re: interesting paper with competing index systems

2006-01-19 Thread Jérôme Charron
 Not only that, they mention that each test was run _twice_ to get an
 average score. With hotspot JVMs this is ridiculous, you need to run at
 least a dozen or more cycles so that the hotspots are recompiled... This
 alone discredits the results in my eyes.

Yes, running a load or performance test by running only two tests makes
really no sense with a JVM.
On some complex telco systems we notice that a JVM becomes hot after many
minutes with at least 100 req/s
(so running only two tests is REALLY RIDICULOUS)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Can't index some pages

2006-01-19 Thread Doug Cutting

Michael Plax wrote:

Question summery:
Q: How can I set up crawler in order to index all web site?

I'm trying to run crawl with command from tutorial

1. In urls file I have start page (index.html). 
2. In the configuration file conf/crawl-urlfilter.txt domain was changed.

3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10  crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
   output:
  $ bin/nutch readdb crawledtottaly/db -stats
  run java in C:\Sun\AppServer\jdk
  060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
  060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
  060118 155526 No FS indicated, using default:local
  Stats for [EMAIL PROTECTED]
  ---
  Number of pages: 63
  Number of links: 3906
6. I get less pages than I have expected.


This is a common question, but there's not a common answer.  The problem 
could be that urls are blocked by your url filter, or by 
http.max.delays, or something else.


What might help is if the fetcher and crawl db printed more detailed 
statistics.  In particular, the fetcher could categorize failures and 
periodically print a list of failure counts by category.  The crawl db 
updater could also list the number of urls that are filtered.


In the meantime, please examine the logs, particularly watching for 
errors while fetching.


Doug


Re: So many Unfetched Pages using MapReduce

2006-01-19 Thread Florent Gluck
Hi Mike,

Your differents tests are really interesting, thanks for sharing!
I didn't do as many tests. I changed the number of fetch threads and the
number of map and reduce tasks and noticed that it gave me quite
different results in terms of pages fetched.
Then, I wanted to see if this issue would still happen when running the
crawl (single pass) on one single machine running everything locally,
without ndfs.
So I injected 5 urls and got 2315 urls fetched.  I couldn't find a
trace in the logs of most of the urls.
I noticed that if I put a counter at the beginning of the
/while(true)/** loop in the method /run/ in /Fetcher.java,/ I don't
end up with 5!
After some poking around, I noticed that if I comment out the line doing
the page fetch /ProtocolOutput output = protocol.getProtocolOutput(key,
datum);/, then I get 5.
There seems to be something really wrong with that.  I seems to mean
that some threads are dying without notification in the http protocol
code (if it makes any sense).
I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient
With the old protocol I got 5 as expected.

The following bug seems to be very similar to what we are encountering:
http://issues.apache.org/jira/browse/NUTCH-136
Check out the latest comment.  I'm gonna remove line 211 and run some
tests to see how it behaves (with protocol-http and protocol-httpclient).

I'll let you know what I find out,
--Florent

Mike Smith wrote:

Hi Florent

I did some more testings. Here is the results:

I have 3 machines, P4 and 1G ram. All three are data node and one is
namenode. I started from 8 seed urls and tried to see the effect of
depth 1 crawl for different configuration.

Number of unfetch pages changes with different configurations:

--Configuration 1
Number of map tasks: 3
Number of reduce tasks: 3
Number of fetch threads: 40
Number of thread per host: 2
http.timeout: 10 sec
---
6700 pages fetched

--Configuration 2
Number of map tasks: 12
Number of reduce tasks: 6
Number of fetch threads: 500
Number of thread per host: 20
http.timeout: 10 sec
---
18000 pages fetched

--Configuration 3
Number of map tasks: 40
Number of reduce tasks: 20
Number of fetch threads: 500
Number of thread per host: 20
http.timeout: 10 sec
---
37000 pages fetched

--Configuration 4
Number of map tasks: 100
Number of reduce tasks: 20
Number of fetch threads: 100
Number of thread per host: 20
http.timeout: 10 sec
---
34000 pages fetched


--Configuration 5
Number of map tasks: 50
Number of reduce tasks: 50
Number of fetch threads: 40
Number of thread per host: 100
http.timeout: 20 sec
---
52000 pages fetched

--Configuration 6
Number of map tasks: 50
Number of reduce tasks: 100
Number of fetch threads: 40
Number of thread per host: 100
http.timeout: 20 sec
---
57000 pages fetched

--Configuration 7
Number of map tasks: 50
Number of reduce tasks: 120
Number of fetch threads: 250
Number of thread per host: 20
http.timeout: 20 sec
---
6 pages fetched



Do you have any idea why pages are missing from the fetcher without the any
log or exceptions? It seems it really depends on the number of reduce
tasks!
Thanks, Mike
  




Re: So many Unfetched Pages using MapReduce

2006-01-19 Thread Doug Cutting

Florent Gluck wrote:

I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient
With the old protocol I got 5 as expected.


There have been a number of complaints about unreliable fetching with 
protocol-httpclient, so I've switched the default back to protocol-http.


Doug


please help: Recovered from failed datanode connection

2006-01-19 Thread Gal Nitzan
 I am getting more and more of that.
 
 Though it seems there is no side effect
 
 Any info would be appreciated.
 
 G.
 
 
 
 
 




Re: Error at end of MapReduce run with indexing

2006-01-19 Thread Doug Cutting

Matt Zytaruk wrote:
I am having this same problem during the reduce phase of fetching, and 
am now seeing:

060119 132458 Task task_r_obwceh timed out.  Killing.


That is a different problem: a different timeout.  This happens when a 
task does not report status for too long then it is assumed to be hung.



Will the jobtracker restart this job?


It will retry that task up to three times.

If so, if I change the ipc timeout 
in the config, will the tasktracker read in the new value when the job 
restarts?


The ipc timeout is not the relevant timeout.  The task timeout is what's 
involved here.  And, no, at present I think the tasktracker only reads 
this when it is started, not per job.


Doug


Re: Can't index some pages

2006-01-19 Thread Matt Kangas
Doug, would it make sense to print a LOG.info() message every time  
the fetcher bumps into one of these db.max limits? This would help  
users find out when they need to adjust their configuration.


I can prepare a patch if it seems sensible.

--Matt

On Jan 19, 2006, at 5:34 PM, Michael Plax wrote:


Thank you very much,

I changed db.max.outlinks.per.page and db.max.anchor.length to 200  
and I got whole web site indexed.

This particular web site has more than 100 outbound links per page.

Michael

- Original Message - From: Steven Yelton  
[EMAIL PROTECTED]

To: nutch-user@lucene.apache.org
Sent: Thursday, January 19, 2006 5:29 AM
Subject: Re: Can't index some pages



Is it not catching all the outbound links?

db.max.outlinks.per.page

I think the default is 100.  I had to bump it up significantly to  
index a reference site...


Steven

Michael Plax wrote:


Hello,

Question summery:
Q: How can I set up crawler in order to index all web site?

I'm trying to run crawl with command from tutorial

1. In urls file I have start page (index.html). 2. In the  
configuration file conf/crawl-urlfilter.txt domain was changed.
3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10   
crawl.log

4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
  output:
 $ bin/nutch readdb crawledtottaly/db -stats
 run java in C:\Sun\AppServer\jdk
 060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
 060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
 060118 155526 No FS indicated, using default:local
 Stats for [EMAIL PROTECTED]
 ---
 Number of pages: 63
 Number of links: 3906
6. I get less pages than I have expected.

What I did:
0. I read http://www.mail-archive.com/nutch- 
[EMAIL PROTECTED]/msg02458.html

1. I changed the depth to 10,100, 1000- same results
2. I changed start page to page that did  not appear - I do get  
that page indexed

   output:
 $ bin/nutch readdb crawledtottaly/db -stats
 run java in C:\Sun\AppServer\jdk
 060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml
 060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml
 060118 162103 No FS indicated, using default:local
 Stats for [EMAIL PROTECTED]
 ---
 Number of pages: 64
 Number of links: 3906
This page appears in depth 3 from index.html
Q: How can I set up crawler in order to index all web site?

Thank you
Michael

P.S.
I have attached configuration files


urls

http://www.totallyfurniture.com/index.html



crawl-url-filter.txt

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:,  mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz| 
rpm|tgz|mov|MOV|exe|png|PNG)$


# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]


# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*totallyfurniture.com/
+^http://([a-z0-9]*\.)*yahoo.net/


# skip everything else
-.





--
Matt Kangas / [EMAIL PROTECTED]




Re: Can't index some pages

2006-01-19 Thread Doug Cutting

att Kangas wrote:
Doug, would it make sense to print a LOG.info() message every time the 
fetcher bumps into one of these db.max limits? This would help users 
find out when they need to adjust their configuration.


I can prepare a patch if it seems sensible.


Sure, this is sensible.  But it's not done under the fetcher, but when 
the links are read, under db update.


Doug


RE: interesting paper with competing index systems

2006-01-19 Thread Fuad Efendi
Another interesting tool to perform linguistic analysis on natural language
data:

http://www.alias-i.com/lingpipe/
- is it really indexing engine?

They are using NekoHTML parser.


-Original Message-
From: Byron Miller 
http://www.cs.yorku.ca/~mladen/pdf/Read6_u.pisa-attardi.tera.pdf
Anyone have any further details on this?






Re: getOutlinks doesn't work properly

2006-01-19 Thread Matt Kangas
Good call. That's another limit where it'd be nice to see a log  
message when it's exceeded. I'll try to add a patch to NUTCH-182  
tomorrow for this.


--Matt

On Jan 19, 2006, at 11:39 PM, Fuad Efendi wrote:


property
  namefile.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content, in bytes.
  If this value is larger than zero, content longer than it will be
  truncated; otherwise (zero or negative), no truncation at all.
  /description
/property

(default is 65536)



-Original Message-
From: Jack Tang

Hi

pls change the value of db.max.outlinks.per.page(default is 100)
property to say 1000.

property
  namedb.max.outlinks.per.page/name
  value1000/value
  descriptionThe maximum number of outlinks that we'll process  
for a page.

  /description
/property

/Jack

On 1/20/06, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote:

  Hi everyone,

  I found that getOutlinks function in html-parser/ 
DOMContentUtils.java

doesn't work correctly for some cases. An example is this website:
http://blog.donews.com/boyla/. The function returns only 170 records,

while

in fact it contains a lot more (Firefox returns 356 links!).

  When I compare the hyperlink list with the one returned by  
Firefox, the
orders are exactly identical, meaning that the 170th link of  
getOutlinks
function is the same as the 170th link of Firefox. Therefore, it  
seems

that

the algorithm is correct, but there is some bug around. There is no
threshold at this point, since the max outlinks parameter is set at

updatedb
part. Even when I increase the max outlinks to 1000, the situation  
still

remains.

  Any suggestions are very appreciated.

  Regards,
  Giang





--
Keep Discovering ... ...
http://www.jroller.com/page/jmars




--
Matt Kangas / [EMAIL PROTECTED]




org.apache.nutch.indexer.IndexMerger (Nutch 0.7)

2006-01-19 Thread Chun Wei Ho
Hi,

Could anyone let me know definitively if the
IndexMerger(NutchFileSystem nfs, File[] segments, File outputIndex,
File localWorkingDir)
merge operation merges the segments and overwrites any existing index
at outputIndex, or merges the segments into the existing indec at
outputIndex.

If it overwrites, is there another method to merge segments into an
existing index without needing to copy the existing index to a
temporary area and specifying it as one of the input segments?

Thanks. I am using Nutch 0.7

Regards,
CW