hi all

2006-11-03 Thread kauu
hi i have a problem now. i want to crawl the pages which's url contain ...item_detail,but i must crawl from the www..com ,and if i set rules in the crawl-urlfilter.txt,i can't get the pages what i want at all. so what i need to do now ? should i do something with the regex-urlfilter.txt

Amazon S3 and EC2

2006-11-03 Thread Zaheed Haque
Hi: I am just wondering if any of you had tried running Nutch on Amazon EC2 and try to save crawl data on Amazon S3? Could you please tell us about your experience. EC2 is closed beta so I haven't been able to try it. Regards, Zaheed

Re: Amazon S3 and EC2

2006-11-03 Thread kauu
that' s very good if it work. On 11/3/06, Zaheed Haque [EMAIL PROTECTED] wrote: Hi: I am just wondering if any of you had tried running Nutch on Amazon EC2 and try to save crawl data on Amazon S3? Could you please tell us about your experience. EC2 is closed beta so I haven't been able to try

Re: Amazon S3 and EC2

2006-11-03 Thread Andrzej Bialecki
kauu wrote: that' s very good if it work. On 11/3/06, Zaheed Haque [EMAIL PROTECTED] wrote: Hi: I am just wondering if any of you had tried running Nutch on Amazon EC2 and try to save crawl data on Amazon S3? Could you please tell us about your experience. EC2 is closed beta so I haven't

.7x - .8x

2006-11-03 Thread Josef Novak
Hi, Very short question (hopefully). Is it possible to get bin/nutch fetch to print a log of the pages being downloaded to the command terminal? I have been using 0.7.2 up until now; in that version the fetch command outputs errors and the names of urls that the fetcher is attempting to

whoops

2006-11-03 Thread Josef Novak
Hi again, Nevermind my previous mail. I found the log files. However, they don't seem to explain why, while I have many, many entries like this: 2006-11-03 21:00:55,710 INFO fetcher.Fetcher - fetching http://www.viaveneto.com.mx/ My segments and data files are only about 73k. And do not

Use and configuration of RegexUrlNormalize

2006-11-03 Thread Javier P. L.
Hi, I am using Nutch for news sites crawling, I have a problem with one of them that publishes the urls with amp; instead of . I discovered the use of the url normalizer and the regex-normalize.xml configuration file. Unfortunately I did not find too much examples about how to use the regular

Re: Use and configuration of RegexUrlNormalize

2006-11-03 Thread Andrzej Bialecki
Javier P. L. wrote: Hi, I am using Nutch for news sites crawling, I have a problem with one of them that publishes the urls with amp; instead of . I discovered the use of the url normalizer and the regex-normalize.xml configuration file. Unfortunately I did not find too much examples about how

Re: Use and configuration of RegexUrlNormalize

2006-11-03 Thread Josef Novak
regex pattern(.*)\amp;amp;TEXTO(.*)/pattern substitution$1amp;TEXTO$2/substitution /regex I'm not a hundred percent sure but, perhaps you need another escape character in your substitution? substitution$1amp;TEXTO$2/substitution -- substitution$1\amp;TEXTO$2/substitution The examples in

Re: Use and configuration of RegexUrlNormalize

2006-11-03 Thread Stefan Neufeind
Javier P. L. wrote: Hi, I am using Nutch for news sites crawling, I have a problem with one of them that publishes the urls with amp; instead of . I discovered the use of the url normalizer and the regex-normalize.xml configuration file. Unfortunately I did not find too much examples about

Re : Urgent : Fetcher aborts with hung threads

2006-11-03 Thread Aïcha
Hi, I don't know why but I have no answer on the 3 forums where I sent my problem As the problem of Fetcher freezes occurs every time I try to fetch my file system I can't imagine that I am the only one who have this problem and as I said in my last e-mail, I found many mails about

Re: Use and configuration of RegexUrlNormalize

2006-11-03 Thread Andrzej Bialecki
Josef Novak wrote: regex pattern(.*)\amp;amp;TEXTO(.*)/pattern substitution$1amp;TEXTO$2/substitution /regex I'm not a hundred percent sure but, perhaps you need another escape character in your substitution? substitution$1amp;TEXTO$2/substitution -- substitution$1\amp;TEXTO$2/substitution

Newbie question - syntax error on bin/nutch

2006-11-03 Thread Kevin Dewalt
My apologies if this question is answered in the wiki or listserv archives; I searched both extensively and cannot find an answer. I followed the directions for the .8.x tutorial at http://lucene.apache.org/nutch/tutorial8.html and the wiki entry GettingNutchRunningWithWindows,

map-reduce takes too long before/after fetching

2006-11-03 Thread AJ Chen
I'm using nutch 0.9-dev to crawl web on 1 linux server. With default hadoop configuration (local file system, no distributed crawling), the Generator and Fetcher spend unproportional amount of time on map-reduce opearations. For example: 2006-11-01 20:32:44,074 INFO crawl.Generator - Generator:

Re: Re : Urgent : Fetcher aborts with hung threads

2006-11-03 Thread Dennis Kubes
The reason no one answered is because it has been answered before a couple of times. If you do a search on this mailing list for fetcher slowness or fetcher hung threads you will get answers. You can also take a look at NUTCH-344. This problem has come up before and there are patches which

Re: .7x - .8x

2006-11-03 Thread Tomi NA
2006/11/3, Josef Novak [EMAIL PROTECTED]: Hi, Very short question (hopefully). Is it possible to get bin/nutch fetch to print a log of the pages being downloaded to the command terminal? I have been using 0.7.2 up until now; in that version the fetch command outputs errors and the names of

Plain Explanation for NutchAnalysis.jj

2006-11-03 Thread Josef Novak
Hi, I was wondering if anyone knew of a resource, or could concisely explain, how the javacc-generated default nutch analyzer goes about tokenizing text. What I'm really looking for is a plain, nuts'n'bolts explanation of what gets tokenized, and what doesn't. I searched the web for a while