Re: Hi
Did u check crawl-urlfilter.txt? All the domain names that you'd like to crawl have to mentioned. e.g. # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*mersin\.edu\.tr/ +^http://([a-z0-9]*\.)*tubitak\.gov\.tr/ Also check property db.ignore.external.links in nutch-default.xml. Should be set to false. 2010/5/5 Zehra Göçer zgocer...@hotmail.com i have problems about nutch.my project is link analysis i crawled www.mersin.edu.tr and i analyse linkdb and i saw all about mersin.edu.trlinks.But i have to find other links in site example www.tubitak.gov.tr bu i cannot find?i have to find these links ?please help me _ Yeni Windows 7: Size en uygun bilgisayarı bulun. Daha fazla bilgi edinin. http://windows.microsoft.com/shop
AbstractMethodError for cyberneko parser
Hi, I am running the latest version for nutch. While crawling one particular site I get a AbstractMethodError in the cyberneko plugin for all of it pages when doing a Fetch. As i understand, this has to do because of difference between the runtime and compile version. However, I am running it afresh after an ant clean. Any suggestions would be helpful. Btw, i am using java version 1.6.0_18 on a windows environment java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
Re: AbstractMethodError for cyberneko parser
Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to fix the problem. On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch harrynu...@gmail.com wrote: Hi, I am running the latest version for nutch. While crawling one particular site I get a AbstractMethodError in the cyberneko plugin for all of it pages when doing a Fetch. As i understand, this has to do because of difference between the runtime and compile version. However, I am running it afresh after an ant clean. Any suggestions would be helpful. Btw, i am using java version 1.6.0_18 on a windows environment java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
Re: AbstractMethodError for cyberneko parser
Thanks Julien. I have changed nutch-site.xml to have only parse-(tika) instead of parse-(text | html | js | tika) in plugin.includes property. It works now as it doesn't pick up any other parser besides tika. On Wed, Apr 21, 2010 at 7:42 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Harry, Could you try using parse-tika instead and see if you are getting the same problem? I gather from your email that you are using Nutch 1.1 or the SVN version, so parse-tika should be used by default. Have you deactivated it? Thanks Julien On 21 April 2010 11:58, Harry Nutch harrynu...@gmail.com wrote: Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to fix the problem. On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch harrynu...@gmail.com wrote: Hi, I am running the latest version for nutch. While crawling one particular site I get a AbstractMethodError in the cyberneko plugin for all of it pages when doing a Fetch. As i understand, this has to do because of difference between the runtime and compile version. However, I am running it afresh after an ant clean. Any suggestions would be helpful. Btw, i am using java version 1.6.0_18 on a windows environment java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) -- DigitalPebble Ltd http://www.digitalpebble.com
Re: Format of the Nutch Results
I think you need to specify the individual segment.. bin/nutch readseg -dump crawl-20100420112025/segments/20100422092816 dumpSegmentDirectory On Wed, Apr 21, 2010 at 9:38 PM, nachonieto3 jinietosanc...@gmail.comwrote: Thank you a lot! Now I'm working on that, I have some doubts more...I'm not able to run the command readseg...I've been consulting some help forum and the basic synthesis is readseg path of the file withe the segments I have the segments in this path: D:\nutch-0.9\crawl-20100420112025\segments The file named crawl-20100420112025 is the one where are stored the segments. So I'm trying to execute the command using these but none is working: readseg d/nutch-0.9/crawl-20100420112025/segments readseg crawl-20100420112025/segments readseg crawl-20100420112025 What I'm doing wrong??When I try to execute I get bash: readseg:command not found. Any idea??Thank you in advance. -- View this message in context: http://n3.nabble.com/Format-of-the-Nutch-Results-tp729918p739952.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Format of the Nutch Results
try bin/nutch on the console. It will give you a list of commands. You could use them to read segments e.g bin/nutch readdb .. On Mon, Apr 19, 2010 at 11:36 PM, nachonieto3 jinietosanc...@gmail.comwrote: I have a doubt...How are the final results of Nutch stored?I mean, in which format is stored the information contained in the links analyzed? I understood that Nutch need the information in plan text to parse it...but in which format is stored finally?I know is stored in segments but how can I access to this information in order to convert it to plan text?Is it possible? Thank you in advance -- View this message in context: http://n3.nabble.com/Format-of-the-Nutch-Results-tp729918p729918.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
Did you check robots.txt On Wed, Apr 21, 2010 at 7:57 AM, joshua paul jos...@neocodesoftware.comwrote: after getting this email, I tried commenting out this line in regex-urlfilter.txt = #-[...@=] but it didn't help... i still get same message - no urls to feth regex-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. crawl-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. # we don't want to skip #-[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ +^http://([a-z0-9]*\.)*fmforums.com/ # skip everything else -. arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM: What is in your regex-urlfilter.txt? -Original Message- From: joshua paul [mailto:jos...@neocodesoftware.com] Sent: Wednesday, 21 April 2010 9:44 AM To: nutch-user@lucene.apache.org Subject: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com. I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://www.fmforums.com/ - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ Note - my nutch setup indexes other sites fine. For example I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://dispatch.neocodesoftware.com - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/ And nutch generates a good crawl. How can I troubleshoot why nutch says No URLs to fetch? -- catching falling stars... https://www.linkedin.com/in/joshuascottpaul MSN coga...@hotmail.com AOL neocodesoftware Yahoo joshuascottpaul Skype neocodesoftware Toll Free 1.888.748.0668 Fax 1-866-336-7246 #238 - 425 Carrall St YVR BC V6B 6E3 CANADA www.neocodesoftware.com store.neocodesoftware.com www.monicapark.ca www.digitalpostercenter.com
Re: nutch 1.1 crawl d/n complete issue
I am new to nutch and still trying to figure out the code flow, however, as a work around to issue #1, after the crawl finishes you could run linkdb and index command separately from cygwin. $bin/nutch invertlinks crawl/linkdb -dir crawl/segments $ bin/nutch index crawl/indexes crawl/crawldb/ crawl/linkdb crawl/segments/20100415163946 crawl/segments/20100415164106 This seems to work for me. You may have already tried this workaround, but just in case. -Harry On Fri, Apr 16, 2010 at 3:34 AM, matthew a. grisius mgris...@comcast.netwrote: Two observations using the nutch 1.1. nightly build nutch-2010-04-14_04-00-47: 1) Previously I was using nutch 1.0 to crawl successfully, but had problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to parse all of the 'problem' pdfs that parse-pdf could not handle. The crawldb and segments directories are created and appear to be valid. However, the overall crawl does not finish now: nutch crawl urls/urls -dir crawl -depth 10 ... Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20100415015102] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Exception in thread main java.lang.NullPointerException at org.apache.nutch.crawl.Crawl.main(Crawl.java:133) Nutch 1.0 would complete like this: nutch crawl urls/urls -dir crawl -depth 10 ... Generator: 0 records selected for fetching, exiting ... Stopping at depth=7 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656 LinkDb: done Indexer: starting Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Dedup: done merging indexes to: crawl/index Adding file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-0 done merging crawl finished: crawl Any ideas? 2) if there is a 'space' in any component dir then $NUTCH_OPTS is invalid and causes this problem: m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin nutch crawl urls/urls -dir crawl -depth 10 -topN 10 NUTCH_OPTS: -Dhadoop.log.dir=/home/mag/Desktop/untitled folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log -Djava.library.path=/home/mag/Desktop/untitled folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32 Exception in thread main java.lang.NoClassDefFoundError: folder/nutch-2010-04-14_04-00-47/logs Caused by: java.lang.ClassNotFoundException: folder.nutch-2010-04-14_04-00-47.logs at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs. Program will exit. m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin Obviously the work around is to rename 'untitled folder' to 'untitledFolderWithNoSpaces' Thanks, any help w/b appreciated w/ issue #1 above. -m.