Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

2007-09-20 Thread Lyndon Maydwell
I've been having problems with the merge portion of the script too. My solution was to check the success status of the merge ( $? ), and if it failed, try again, or wait until next time. nutch_bin/nutch mergesegs $merged_segment -dir $segments if [ $? -ne 0 ] then echo merging segments

Blank result page

2007-09-20 Thread balachanthar palanivelu
Dear nutch users . I have some problem with nutch result some times it gives me blank page without any error but when i see the log file i got some error. I don't understand how to solve it i tried all the ways up to my extend. So i though of asking you. I am using fedora 6 + tomcat5 + jre6+

Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

2007-09-20 Thread Tomislav Poljak
Hi, I had the same problem using re-crawl scripts from wiki. They all work fine with nutch versions up to 0.9 (0.9 included), but when using nutch-1.0-dev (from trunk) they brak at merge of indexes. Reason is that merge in nutch-0.9 (from re-crawl scripts): bin/nutch merge crawl/indexes

Re: maybe dumb question about nutch index and segments file

2007-09-20 Thread Martin Kuen
hi, regarding hit summaries: The summaries are generated at search time. This is necessary, since different queries will generate different summaries (and different terms will be highlighted). The parsed text is stored in the various segments/timestamp folders. I don't know which directory it

Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

2007-09-20 Thread Alexis Votta
Hi Tomislav and Nutch users I could not solve the problem with your instructions. I crawled two times. In re-crawl. It generated crawl/NEWindexes. crawl/indexes was generated in 1st crawl. I merged == bin/nutch merge crawl/index crawl/indexes/ crawl/NEWindexes/ Now search.jsp is showing

Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

2007-09-20 Thread Alexis Votta
I have merged the old as well as new segments into segments dir. Still the same error comes. On 9/20/07, Tomislav Poljak [EMAIL PROTECTED] wrote: Hi Alexis, I think that your problem is not so much in index (or merging indexes) but in segments, because if you look at the exception you will see

Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

2007-09-20 Thread Susam Pal
We can do two things to solve this problem. SOLUTION 'A' 1. Once the 'depth' loop is complete, merge the segments in 'crawl/segments/'. ('crawl/segments/' will have one merged segment of the past plus all the segments generated in the depth loop, one for each iteration of the loop.) They are now

Nutch Dedup Question

2007-09-20 Thread karthik085
Hi, I am little confused about what exactly dedup does? a. Does dedup delete duplicate documents from Index and Segments? b. Is there a way that we could delete duplicated documents for two segments? Let me know. Thanks. -- View this message in context:

Indexing Process

2007-09-20 Thread Jeff Maki
Hello everyone, I'm not going to post my config files as not to spam you all, but I have a general question: I'm trying to index the pages of a website (obviously), and I've created a special page with a link to all the pages I want to index. I then pointed nutch to this special link page. I set

cached page not showing images

2007-09-20 Thread Joseph M.
I am having a problem with cached pages. images are not showing in them. how can I make images show in them? I am new to Nutch and having difficulties. please help me to show images in cached page.

Re: Nutch Dedup Question

2007-09-20 Thread Andrzej Bialecki
karthik085 wrote: Hi, I am little confused about what exactly dedup does? a. Does dedup delete duplicate documents from Index and Segments? Only from the index. b. Is there a way that we could delete duplicated documents for two segments? bin/nutch mergesegs -- Best regards, Andrzej

Re: Nutch Dedup Question

2007-09-20 Thread karthik085
Thanks - that's much clearer. Andrzej Bialecki wrote: karthik085 wrote: Hi, I am little confused about what exactly dedup does? a. Does dedup delete duplicate documents from Index and Segments? Only from the index. b. Is there a way that we could delete duplicated documents

Re: cached page not showing images

2007-09-20 Thread Susam Pal
See NUTCH-281. https://issues.apache.org/jira/browse/NUTCH-281 On 9/20/07, Joseph M. [EMAIL PROTECTED] wrote: I am having a problem with cached pages. images are not showing in them. how can I make images show in them? I am new to Nutch and having difficulties. please help me to show images

Changing HTTP/1.0 to HTTP/1.1

2007-09-20 Thread Joseph M.
Nutch uses HTTP/1.0 GET request. if I change the java program in HttpResponse.java to reqStr.append( HTTP/1.1\r\n); will it create any problem?

Newbie questions about filter, bandwidth, NTLM and threads

2007-09-20 Thread Bent Hugh
I have some newbie questions. - There are two filters crawl-urlfilter.txt and regex-urlfilter.txt. Which one should be configured in which condition? - Is it possible to see howmuch bandwidth Nutch crawl consumes? - Can the Nutch bot do NTLM authentication for websites in a domain? - Is there

Re: Indexing Process

2007-09-20 Thread Carl Cerecke
Look in nutch-default.xml The properties db.max.outlinks.per.page and http.content.limit might need to have their values increased. Cheers, Carl. Jeff Maki wrote: Hello everyone, I'm not going to post my config files as not to spam you all, but I have a general question: I'm trying to

Policy of merging patches

2007-09-20 Thread Bent Hugh
I was browsing Nutch JIRA. As per my observation, some patches are merged into trunk and some are merged into hudson - Nutch-Nightly. This is pretty confusing to me as a user. As a user, which branch should I check out if I want the latest Nutch with cutting-edge features and least open issues?