from:"jepse"

Nutch simple doesnt crawl webpages

2012-04-03 Thread jepse

Hey there ;) May anybody help me out, why nutch simply doesnt find outlinks for the following page? http://www.soccer-forum.de/ When i copy a link out of the site to my testsite i the Crwaler find the added outlink. But within this page it simply doesnt find any outlinklink. any suggenstions?

Re: Nutch simple doesnt crawl webpages

2012-04-03 Thread jepse

Uhhh, i'm dealing with almost all config file. robots.txt etc... and finally it's so easy ;) Thanks a lot.. -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Crawler-simple-doesnt-find-outlinks-tp3880377p3880397.html Sent from the Nutch - User mailing list archive

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread jepse

Hey, i have the same problem. No urls to fetch.. For couple urls. Have no clou how to fix that. Did you meanwhile solve your problem? Cheers, Philipp -- View this message in context:

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread jepse

I have this site: http://www.soccer-forum.de/ When i put it into my browser its fine. No Robots.txt, no redirect... but simply no urls to fetch. where can i find the reasons for not fetching? What do you mean by saying seed list. Is there a way to disable all filters at once? -- View this

Re: urls won't get crawled

2012-03-20 Thread jepse

I still have that problem. Can't crawl http://www.ostsee-zeitung.de/ with nutch. There is a redirect to index.phtml. when i simply copy that file to my own webserver localhost/index.phtml it gets crawled. When i use a different crawler like searchblox it works with http://www.ostsee-zeitung.de as

Re: urls won't get crawled

2012-03-20 Thread jepse

I thought regex-normalizer is handling that. But when i added SID to it nothing happend. regex pattern([;_]?((?i)l|j|bv_)?((?i)*SID*|sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern substitution$4/substitution /regex What file do i have to change? Can you give me an example? -- View

Re: urls won't get crawled

2012-03-20 Thread jepse

I there a problem with a standalone questionmark? How can i evaluate a source on possably errors? May you check that url in you nutch-environment? -- View this message in context: http://lucene.472066.n3.nabble.com/urls-won-t-get-crawled-tp3650610p3842207.html Sent from the Nutch - User

Re: urls won't get crawled

2012-01-18 Thread jepse

It seems that there is no problem with the parser. When i use ParserChecker i get the following result: Is there still a redirect problem, when i get a result with parserchecker for the specified url? /- Url --- http://www.lequipe.fr/Football/ - ParseData -

Re: urls won't get crawled

2012-01-17 Thread jepse

Thanks for the reply. I tried your recommended solution. But it still does not crawl it. When i run the crawler main-method in standalone with the url http://www.lequipe.fr/Football/; it works fine: outlinks, content... everything what was expected. Even if i use

Re: urls won't get crawled

2012-01-17 Thread jepse

@Julien Nioche-4: believe me i did check the site in my browser first. I can't see any redirect. I already set up nutch config property http.redirect.max to 3 to avoid that problem. Seriously, when i type that url into my browser i see my expected page. When inspect the site with google chrome

urls won't get crawled

2012-01-11 Thread jepse

Hi there, i'm despret with two urls: http://www.lequipe.fr/Football/ http://www.ostsee-zeitung.de/nachrichten Everything seems ok: robots.txt and meta-tags allowing me to crawl there page. I'm struggling with nutch-default.xml for a while without any soloutions. When i copy the source-html to

Re: urls won't get crawled

2012-01-11 Thread jepse

I'm using that command to crawl: bin/nutch crawl urls -dir crawl -solr http://localhost:8983/solr/ -depth 3 i don't have a crawl-urlfilter.txt. I'm using the regex-url-filter.txt with following content: /# skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we

HtmlParser parse-html-plugin

2011-12-22 Thread jepse

Hi, my concern is to use the Nutch HtmlParser as a standalone Application. Therefor i followed the instructions for RunNutchInEclipse. Now i have a working Eclipse Project, wich i can use to start my claimed plugin in a standalone Application (running the main class in HtmlParser.java). Now i

Content field does not provied fully parsed text. Why?

2011-11-09 Thread jepse

configuration do i have to provide, to index the fully parsed content of a page? thanks for help Jepse -- View this message in context: http://lucene.472066.n3.nabble.com/Content-field-does-not-provied-fully-parsed-text-Why-tp3493471p3493471.html Sent from the Nutch - User mailing list archive

Nutch simple doesnt crawl webpages

Re: Nutch simple doesnt crawl webpages

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

Re: urls won't get crawled

Re: urls won't get crawled

Re: urls won't get crawled

Re: urls won't get crawled

Re: urls won't get crawled

Re: urls won't get crawled

urls won't get crawled

Re: urls won't get crawled

HtmlParser parse-html-plugin

Content field does not provied fully parsed text. Why?

14 matches

Site Navigation

Mail list logo

Footer information