date:20061103

hi all

2006-11-03 Thread kauu

hi i have a problem now. i want to crawl the pages which's url contain ...item_detail,but i must crawl from the www..com ,and if i set rules in the crawl-urlfilter.txt,i can't get the pages what i want at all. so what i need to do now ? should i do something with the regex-urlfilter.txt

Amazon S3 and EC2

2006-11-03 Thread Zaheed Haque

Hi: I am just wondering if any of you had tried running Nutch on Amazon EC2 and try to save crawl data on Amazon S3? Could you please tell us about your experience. EC2 is closed beta so I haven't been able to try it. Regards, Zaheed

Re: Amazon S3 and EC2

2006-11-03 Thread kauu

that' s very good if it work. On 11/3/06, Zaheed Haque [EMAIL PROTECTED] wrote: Hi: I am just wondering if any of you had tried running Nutch on Amazon EC2 and try to save crawl data on Amazon S3? Could you please tell us about your experience. EC2 is closed beta so I haven't been able to try

Re: Amazon S3 and EC2

2006-11-03 Thread Andrzej Bialecki

kauu wrote: that' s very good if it work. On 11/3/06, Zaheed Haque [EMAIL PROTECTED] wrote: Hi: I am just wondering if any of you had tried running Nutch on Amazon EC2 and try to save crawl data on Amazon S3? Could you please tell us about your experience. EC2 is closed beta so I haven't

.7x - .8x

2006-11-03 Thread Josef Novak

Hi, Very short question (hopefully). Is it possible to get bin/nutch fetch to print a log of the pages being downloaded to the command terminal? I have been using 0.7.2 up until now; in that version the fetch command outputs errors and the names of urls that the fetcher is attempting to

whoops

2006-11-03 Thread Josef Novak

Hi again, Nevermind my previous mail. I found the log files. However, they don't seem to explain why, while I have many, many entries like this: 2006-11-03 21:00:55,710 INFO fetcher.Fetcher - fetching http://www.viaveneto.com.mx/ My segments and data files are only about 73k. And do not

Use and configuration of RegexUrlNormalize

2006-11-03 Thread Javier P. L.

Hi, I am using Nutch for news sites crawling, I have a problem with one of them that publishes the urls with amp; instead of . I discovered the use of the url normalizer and the regex-normalize.xml configuration file. Unfortunately I did not find too much examples about how to use the regular

Re: Use and configuration of RegexUrlNormalize

2006-11-03 Thread Andrzej Bialecki

Javier P. L. wrote: Hi, I am using Nutch for news sites crawling, I have a problem with one of them that publishes the urls with amp; instead of . I discovered the use of the url normalizer and the regex-normalize.xml configuration file. Unfortunately I did not find too much examples about how

Re: Use and configuration of RegexUrlNormalize

2006-11-03 Thread Josef Novak

regex pattern(.*)\amp;amp;TEXTO(.*)/pattern substitution$1amp;TEXTO$2/substitution /regex I'm not a hundred percent sure but, perhaps you need another escape character in your substitution? substitution$1amp;TEXTO$2/substitution -- substitution$1\amp;TEXTO$2/substitution The examples in

Re: Use and configuration of RegexUrlNormalize

2006-11-03 Thread Stefan Neufeind

Javier P. L. wrote: Hi, I am using Nutch for news sites crawling, I have a problem with one of them that publishes the urls with amp; instead of . I discovered the use of the url normalizer and the regex-normalize.xml configuration file. Unfortunately I did not find too much examples about

Re : Urgent : Fetcher aborts with hung threads

2006-11-03 Thread Aïcha

Hi, I don't know why but I have no answer on the 3 forums where I sent my problem As the problem of Fetcher freezes occurs every time I try to fetch my file system I can't imagine that I am the only one who have this problem and as I said in my last e-mail, I found many mails about

Re: Use and configuration of RegexUrlNormalize

2006-11-03 Thread Andrzej Bialecki

Josef Novak wrote: regex pattern(.*)\amp;amp;TEXTO(.*)/pattern substitution$1amp;TEXTO$2/substitution /regex I'm not a hundred percent sure but, perhaps you need another escape character in your substitution? substitution$1amp;TEXTO$2/substitution -- substitution$1\amp;TEXTO$2/substitution

Newbie question - syntax error on bin/nutch

2006-11-03 Thread Kevin Dewalt

My apologies if this question is answered in the wiki or listserv archives; I searched both extensively and cannot find an answer. I followed the directions for the .8.x tutorial at http://lucene.apache.org/nutch/tutorial8.html and the wiki entry GettingNutchRunningWithWindows,

map-reduce takes too long before/after fetching

2006-11-03 Thread AJ Chen

I'm using nutch 0.9-dev to crawl web on 1 linux server. With default hadoop configuration (local file system, no distributed crawling), the Generator and Fetcher spend unproportional amount of time on map-reduce opearations. For example: 2006-11-01 20:32:44,074 INFO crawl.Generator - Generator:

Re: Re : Urgent : Fetcher aborts with hung threads

2006-11-03 Thread Dennis Kubes

The reason no one answered is because it has been answered before a couple of times. If you do a search on this mailing list for fetcher slowness or fetcher hung threads you will get answers. You can also take a look at NUTCH-344. This problem has come up before and there are patches which

Re: .7x - .8x

2006-11-03 Thread Tomi NA

2006/11/3, Josef Novak [EMAIL PROTECTED]: Hi, Very short question (hopefully). Is it possible to get bin/nutch fetch to print a log of the pages being downloaded to the command terminal? I have been using 0.7.2 up until now; in that version the fetch command outputs errors and the names of

Plain Explanation for NutchAnalysis.jj

2006-11-03 Thread Josef Novak

Hi, I was wondering if anyone knew of a resource, or could concisely explain, how the javacc-generated default nutch analyzer goes about tokenizing text. What I'm really looking for is a plain, nuts'n'bolts explanation of what gets tokenized, and what doesn't. I searched the web for a while

hi all

Amazon S3 and EC2

Re: Amazon S3 and EC2

Re: Amazon S3 and EC2

.7x - .8x

whoops

Use and configuration of RegexUrlNormalize

Re: Use and configuration of RegexUrlNormalize

Re: Use and configuration of RegexUrlNormalize

Re: Use and configuration of RegexUrlNormalize

Re : Urgent : Fetcher aborts with hung threads

Re: Use and configuration of RegexUrlNormalize

Newbie question - syntax error on bin/nutch

map-reduce takes too long before/after fetching

Re: Re : Urgent : Fetcher aborts with hung threads

Re: .7x - .8x

Plain Explanation for NutchAnalysis.jj

17 matches

Site Navigation

Mail list logo

Footer information