hi
i have a problem now.
i want to crawl the pages which's url contain ...item_detail,but i
must crawl from the www..com
,and if i set rules in the crawl-urlfilter.txt,i can't get the pages what
i want at all.
so what i need to do now ?
should i do something with the regex-urlfilter.txt
Hi:
I am just wondering if any of you had tried running Nutch on Amazon
EC2 and try to save crawl data on Amazon S3? Could you please tell us
about your experience. EC2 is closed beta so I haven't been able to
try it.
Regards,
Zaheed
that' s very good if it work.
On 11/3/06, Zaheed Haque [EMAIL PROTECTED] wrote:
Hi:
I am just wondering if any of you had tried running Nutch on Amazon
EC2 and try to save crawl data on Amazon S3? Could you please tell us
about your experience. EC2 is closed beta so I haven't been able to
try
kauu wrote:
that' s very good if it work.
On 11/3/06, Zaheed Haque [EMAIL PROTECTED] wrote:
Hi:
I am just wondering if any of you had tried running Nutch on Amazon
EC2 and try to save crawl data on Amazon S3? Could you please tell us
about your experience. EC2 is closed beta so I haven't
Hi,
Very short question (hopefully). Is it possible to get bin/nutch
fetch to print a log of the pages being downloaded to the command
terminal? I have been using 0.7.2 up until now; in that version the
fetch command outputs errors and the names of urls that the fetcher is
attempting to
Hi again,
Nevermind my previous mail. I found the log files. However, they
don't seem to explain why, while I have many, many entries like this:
2006-11-03 21:00:55,710 INFO fetcher.Fetcher - fetching
http://www.viaveneto.com.mx/
My segments and data files are only about 73k. And do not
Hi,
I am using Nutch for news sites crawling, I have a problem with one of
them that publishes the urls with amp; instead of . I discovered the
use of the url normalizer and the regex-normalize.xml configuration
file. Unfortunately I did not find too much examples about how to use
the regular
Javier P. L. wrote:
Hi,
I am using Nutch for news sites crawling, I have a problem with one of
them that publishes the urls with amp; instead of . I discovered the
use of the url normalizer and the regex-normalize.xml configuration
file. Unfortunately I did not find too much examples about how
regex
pattern(.*)\amp;amp;TEXTO(.*)/pattern
substitution$1amp;TEXTO$2/substitution
/regex
I'm not a hundred percent sure but, perhaps you need another escape
character in your substitution?
substitution$1amp;TEXTO$2/substitution
--
substitution$1\amp;TEXTO$2/substitution
The examples in
Javier P. L. wrote:
Hi,
I am using Nutch for news sites crawling, I have a problem with one of
them that publishes the urls with amp; instead of . I discovered the
use of the url normalizer and the regex-normalize.xml configuration
file. Unfortunately I did not find too much examples about
Hi,
I don't know why but I have no answer on the 3 forums where I sent my
problem
As the problem of Fetcher freezes occurs every time I try to fetch my file
system I can't imagine that I am the only one who have this problem and as I
said in my last e-mail, I found many mails about
Josef Novak wrote:
regex
pattern(.*)\amp;amp;TEXTO(.*)/pattern
substitution$1amp;TEXTO$2/substitution
/regex
I'm not a hundred percent sure but, perhaps you need another escape
character in your substitution?
substitution$1amp;TEXTO$2/substitution
--
substitution$1\amp;TEXTO$2/substitution
My apologies if this question is answered in the wiki or listserv
archives; I searched both extensively and cannot find an answer.
I followed the directions for the .8.x tutorial at
http://lucene.apache.org/nutch/tutorial8.html and the wiki entry
GettingNutchRunningWithWindows,
I'm using nutch 0.9-dev to crawl web on 1 linux server. With default hadoop
configuration (local file system, no distributed crawling), the Generator
and Fetcher spend unproportional amount of time on map-reduce opearations.
For example:
2006-11-01 20:32:44,074 INFO crawl.Generator - Generator:
The reason no one answered is because it has been answered before a
couple of times. If you do a search on this mailing list for fetcher
slowness or fetcher hung threads you will get answers. You can also
take a look at NUTCH-344. This problem has come up before and there are
patches which
2006/11/3, Josef Novak [EMAIL PROTECTED]:
Hi,
Very short question (hopefully). Is it possible to get bin/nutch
fetch to print a log of the pages being downloaded to the command
terminal? I have been using 0.7.2 up until now; in that version the
fetch command outputs errors and the names of
Hi,
I was wondering if anyone knew of a resource, or could concisely
explain, how the javacc-generated default nutch analyzer goes about
tokenizing text. What I'm really looking for is a plain, nuts'n'bolts
explanation of what gets tokenized, and what doesn't. I searched the
web for a while
17 matches
Mail list logo