Hey there ;)
May anybody help me out, why nutch simply doesnt find outlinks for the
following page?
http://www.soccer-forum.de/
When i copy a link out of the site to my testsite i the Crwaler find the
added outlink. But within this page it simply doesnt find any outlinklink.
any suggenstions?
Uhhh,
i'm dealing with almost all config file. robots.txt etc... and finally it's
so easy ;)
Thanks a lot..
--
View this message in context:
http://lucene.472066.n3.nabble.com/Nutch-Crawler-simple-doesnt-find-outlinks-tp3880377p3880397.html
Sent from the Nutch - User mailing list archive
Hey,
i have the same problem. No urls to fetch.. For couple urls. Have no clou
how to fix that. Did you meanwhile solve your problem?
Cheers, Philipp
--
View this message in context:
I have this site: http://www.soccer-forum.de/
When i put it into my browser its fine. No Robots.txt, no redirect... but
simply no urls to fetch. where can i find the reasons for not fetching?
What do you mean by saying seed list.
Is there a way to disable all filters at once?
--
View this
I still have that problem. Can't crawl http://www.ostsee-zeitung.de/ with
nutch. There is a redirect to index.phtml. when i simply copy that file to
my own webserver localhost/index.phtml it gets crawled. When i use a
different crawler like searchblox it works with http://www.ostsee-zeitung.de
as
I thought regex-normalizer is handling that. But when i added SID to it
nothing happend.
regex
pattern([;_]?((?i)l|j|bv_)?((?i)*SID*|sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern
substitution$4/substitution
/regex
What file do i have to change? Can you give me an example?
--
View
I there a problem with a standalone questionmark?
How can i evaluate a source on possably errors? May you check that url in
you nutch-environment?
--
View this message in context:
http://lucene.472066.n3.nabble.com/urls-won-t-get-crawled-tp3650610p3842207.html
Sent from the Nutch - User
It seems that there is no problem with the parser. When i use ParserChecker i
get the following result:
Is there still a redirect problem, when i get a result with parserchecker
for the specified url?
/-
Url
---
http://www.lequipe.fr/Football/
-
ParseData
-
Thanks for the reply.
I tried your recommended solution. But it still does not crawl it. When i
run the crawler main-method in standalone with the url
http://www.lequipe.fr/Football/; it works fine: outlinks, content...
everything what was expected. Even if i use
@Julien Nioche-4: believe me i did check the site in my browser first. I
can't see any redirect. I already set up nutch config property
http.redirect.max to 3 to avoid that problem.
Seriously, when i type that url into my browser i see my expected page. When
inspect the site with google chrome
Hi there,
i'm despret with two urls:
http://www.lequipe.fr/Football/
http://www.ostsee-zeitung.de/nachrichten
Everything seems ok: robots.txt and meta-tags allowing me to crawl there
page. I'm struggling with nutch-default.xml for a while without any
soloutions.
When i copy the source-html to
I'm using that command to crawl:
bin/nutch crawl urls -dir crawl -solr http://localhost:8983/solr/ -depth 3
i don't have a crawl-urlfilter.txt. I'm using the regex-url-filter.txt with
following content:
/# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we
Hi,
my concern is to use the Nutch HtmlParser as a standalone Application.
Therefor i followed the instructions for RunNutchInEclipse. Now i have a
working Eclipse Project, wich i can use to start my claimed plugin in a
standalone Application (running the main class in HtmlParser.java). Now i
configuration do i have to provide, to index the fully
parsed content of a page?
thanks for help
Jepse
--
View this message in context:
http://lucene.472066.n3.nabble.com/Content-field-does-not-provied-fully-parsed-text-Why-tp3493471p3493471.html
Sent from the Nutch - User mailing list archive
14 matches
Mail list logo