Nutch simple doesnt crawl webpages

2012-04-03 Thread jepse
Hey there ;) May anybody help me out, why nutch simply doesnt find outlinks for the following page? http://www.soccer-forum.de/ When i copy a link out of the site to my testsite i the Crwaler find the added outlink. But within this page it simply doesnt find any outlinklink. any suggenstions?

Re: Nutch simple doesnt crawl webpages

2012-04-03 Thread jepse
Uhhh, i'm dealing with almost all config file. robots.txt etc... and finally it's so easy ;) Thanks a lot.. -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Crawler-simple-doesnt-find-outlinks-tp3880377p3880397.html Sent from the Nutch - User mailing list archive

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread jepse
Hey, i have the same problem. No urls to fetch.. For couple urls. Have no clou how to fix that. Did you meanwhile solve your problem? Cheers, Philipp -- View this message in context:

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread jepse
I have this site: http://www.soccer-forum.de/ When i put it into my browser its fine. No Robots.txt, no redirect... but simply no urls to fetch. where can i find the reasons for not fetching? What do you mean by saying seed list. Is there a way to disable all filters at once? -- View this

Re: urls won't get crawled

2012-03-20 Thread jepse
I still have that problem. Can't crawl http://www.ostsee-zeitung.de/ with nutch. There is a redirect to index.phtml. when i simply copy that file to my own webserver localhost/index.phtml it gets crawled. When i use a different crawler like searchblox it works with http://www.ostsee-zeitung.de as

Re: urls won't get crawled

2012-03-20 Thread jepse
I thought regex-normalizer is handling that. But when i added SID to it nothing happend. regex pattern([;_]?((?i)l|j|bv_)?((?i)*SID*|sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern substitution$4/substitution /regex What file do i have to change? Can you give me an example? -- View

Re: urls won't get crawled

2012-03-20 Thread jepse
I there a problem with a standalone questionmark? How can i evaluate a source on possably errors? May you check that url in you nutch-environment? -- View this message in context: http://lucene.472066.n3.nabble.com/urls-won-t-get-crawled-tp3650610p3842207.html Sent from the Nutch - User

Re: urls won't get crawled

2012-01-18 Thread jepse
It seems that there is no problem with the parser. When i use ParserChecker i get the following result: Is there still a redirect problem, when i get a result with parserchecker for the specified url? /- Url --- http://www.lequipe.fr/Football/ - ParseData -

Re: urls won't get crawled

2012-01-17 Thread jepse
Thanks for the reply. I tried your recommended solution. But it still does not crawl it. When i run the crawler main-method in standalone with the url http://www.lequipe.fr/Football/; it works fine: outlinks, content... everything what was expected. Even if i use

Re: urls won't get crawled

2012-01-17 Thread jepse
@Julien Nioche-4: believe me i did check the site in my browser first. I can't see any redirect. I already set up nutch config property http.redirect.max to 3 to avoid that problem. Seriously, when i type that url into my browser i see my expected page. When inspect the site with google chrome

urls won't get crawled

2012-01-11 Thread jepse
Hi there, i'm despret with two urls: http://www.lequipe.fr/Football/ http://www.ostsee-zeitung.de/nachrichten Everything seems ok: robots.txt and meta-tags allowing me to crawl there page. I'm struggling with nutch-default.xml for a while without any soloutions. When i copy the source-html to

Re: urls won't get crawled

2012-01-11 Thread jepse
I'm using that command to crawl: bin/nutch crawl urls -dir crawl -solr http://localhost:8983/solr/ -depth 3 i don't have a crawl-urlfilter.txt. I'm using the regex-url-filter.txt with following content: /# skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we

HtmlParser parse-html-plugin

2011-12-22 Thread jepse
Hi, my concern is to use the Nutch HtmlParser as a standalone Application. Therefor i followed the instructions for RunNutchInEclipse. Now i have a working Eclipse Project, wich i can use to start my claimed plugin in a standalone Application (running the main class in HtmlParser.java). Now i

Content field does not provied fully parsed text. Why?

2011-11-09 Thread jepse
configuration do i have to provide, to index the fully parsed content of a page? thanks for help Jepse -- View this message in context: http://lucene.472066.n3.nabble.com/Content-field-does-not-provied-fully-parsed-text-Why-tp3493471p3493471.html Sent from the Nutch - User mailing list archive