Re: Nutch didn't (fail) to create new segment dir
The logs say this: Generator: 0 records selected for fetching, exiting ... This is because there are no urls that generator could pass to form a segment. Injector: total number of urls injected after normalization and filtering: 0 Inject did NOT add anything to the crawldb. Check if you are over-filtering the input urls. Also it would be nice to see the urls that you are injecting are valid. From the logs looks like there were just 4 urls in the seeds file. Thanks, Tejas On Fri, Feb 14, 2014 at 4:43 PM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Hi, From what I know that nutch generate will create a new segment directory every round nutch is running. I have a problem (never happened before) that nutch won't create new segment. It always only fetch and parse the latest segment. - from the logs: 2014-02-15 07:20:02,036 INFO fetcher.Fetcher - Fetcher: segment: /opt/searchengine/nutch/BappenasCrawl/segments/20140205213835 Even though I repeat the processes (generate fetch parse update) many times. What should I check for the configuration of nutch? or any hints to solve this problem. I use nutch 1.7. And here is the part of hadoop log file: http://pastebin.com/kpi48gK6 Thank you. -- wassalam, [bayu]
Re: HTML tag filtering
That means that there were changes to the source files since the patch was created. You need to manually add the changes from patch to the source files. Thanks, Tejas On Thu, Feb 13, 2014 at 12:02 AM, Markus Källander markus.kallan...@nasdaqomx.com wrote: Hi, Trying to run the patch command and get this error: $ patch -p0 blacklist_whitelist_plugin.patch (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/index-blacklist-whitelist/src/java/at/scintillation/nutch/BlacklistWhitelistIndexer.java (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/index-blacklist-whitelist/src/java/at/scintillation/nutch/BlacklistWhitelistParser.java (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/index-blacklist-whitelist/README.txt (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/index-blacklist-whitelist/build.xml (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/index-blacklist-whitelist/ivy.xml (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/index-blacklist-whitelist/plugin.xml (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/build.xml Hunk #1 FAILED at 62 (different line endings). 1 out of 1 hunk FAILED -- saving rejects to file src/plugin/build.xml.rej Any hints? I try to patch it in the source for the tagged 1.7 release. Markus Källander Mobile +46 73 622 0547 -Original Message- From: Tejas Patil [mailto:tejas.patil...@gmail.com] Sent: den 13 februari 2014 01:40 To: user@nutch.apache.org Subject: Re: HTML tag filtering On Wed, Feb 12, 2014 at 6:04 AM, Markus Källander markus.kallan...@nasdaqomx.com wrote: Hi, The patch seems to fulfil my needs, but how do I use it with Nutch 1.7 From your local trunk checkout, run these commands in shell: *wget https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch * *patch -p0 blacklist_whitelist_plugin.patch* *ant clean runtime* Now you have successfully applied that patch to your local copy of nutch codebase. That patch is old and I am not sure if it would compile correctly so you have to look in the codebase and tweak it. Thanks, Tejas ? Is the patch not release yet? Markus Källander Mobile +46 73 622 0547 -Original Message- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] Sent: den 11 februari 2014 17:44 To: user@nutch.apache.org Subject: Re: HTML tag filtering Hi Markus, in short, you have to write a parse filter plugin which does in the filter(...) method: 1. traverse the DOM tree and constructs a clean text by skipping certain content. See o.a.n.utils.NodeWalker o.a.n.parse.html.DOMContentUtils.getTextHelper(...) (part of parse-html plugin) 2. then replace the old plain text in ParseResult by new clean text Maybe this issue can help (there is also a patch but I'm not sure whether it's working and fulfills your needs): https://issues.apache.org/jira/browse/NUTCH-585 Sebastian On 02/11/2014 04:24 PM, Markus Källander wrote: Hi, How do I skip indexing of HTML tags with certain id:s or css classes? I am using Nutch 1.7. Thanks Markus
Re: sizing guide
On Wed, Feb 12, 2014 at 11:08 PM, Deepa Jayaveer deepa.jayav...@tcs.comwrote: Thanks for your reply. I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with Hbase once I get a fair idea about Nutch. For our use case, I need to crawl large documents for around 100 web sites weekly and our functionality demands to crawl on daily basis or even hourly basis to extract specific information from around 20 different host. Say, Need to extract product details from the retailer's site. In that case, we need to recrawl the pages to get the latest information As you mentioned, I can do a batch delete the crawled html data once I extract the information from the crawled data. I can expect the crawled data roughly to be around 1 TB (could be deleted on scheduled basis) If you process the data as soon it is available, then you might not need to have 1 TB.. unless Nutch gets that much data in a single fetch cycle. Will these sizing be fine for Nutch installation in production? 4 Node Hadoop cluster with 2 TB storage each 64 GB RAM each 10 GB heap Looks fine. You need to monitor the crawl for first week or two so as to know if you need to change this setup. Apart from that, need to do HBase data sizing to store the product details(which would be around 400 GB of data) can I use the same HBase cluster to store the extracted data where Nutch is raining Yes you can. HBase is a black box to me and it would have a bunch of its own configs which you could tune. Can you please let me know your suggestion or recommendations. Thanks and Regards Deepa Devi Jayaveer Mobile No: 9940662806 Tata Consultancy Services Mailto: deepa.jayav...@tcs.com Website: http://www.tcs.com Experience certainty. IT Services Business Solutions Consulting From: Tejas Patil tejas.patil...@gmail.com To: user@nutch.apache.org user@nutch.apache.org Date: 02/13/2014 05:58 AM Subject: Re: sizing guide If you are looking for specific Nutch 2.1 + MySQL combination, I think that there won;t be any on the project wiki. There is no perfect answer for this as it depends on these factors (this list may go on): - Nature of data that you are crawling: small html files or large documents. - Is it a continuous crawl or few levels ? - Are you re-crawling urls ? - How big is the crawl space ? - Is it a intranet crawl ? How frequently are the pages changed ? Nutch 1.x would be a perfect fit for prod level crawls. If you still want to use Nutch 2.x, it would be better to switch to some other datastore (eg. HBase). Below are my experiences with two use cases wherein Nutch was used over prod with Nutch 1.x: (A) Targeted crawl of a single host In this case I wanted to get the data crawled quickly and didn't bother about the updates that would happen to the pages. I started off with a five node Hadoop cluster but later did the math that it won't get my work done in few days (remember that you need to have a delay between successive requests which the server agrees on else your crawler is banned). Later I bumped the cluster to 15 nodes. The pages were HTML files with size roughly 200k. The crawled data roughly needed 200GB and I had storage of about 500GB. (B) Open crawl of several hosts The configs and memory settings were driven by the prod hardware. I had a 4 node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every hadoop job with an exception of generate job which needed more heap (8-10 GB). There was no need to store the crawled data and every batch was deleted as soon as it was processed. That said that disk had a capacity of 2 TB. Thanks, Tejas On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer deepa.jayav...@tcs.comwrote: Hi , Am using Nutch2.1 with MySQL. Is there a sizing guide available for Nutch 2.1? Is there any recommendations could be ginven on sizing memory,CPU and Disk Space for crawling. Thanks and Regards Deepa Devi Jayaveer Mobile No: 9940662806 Tata Consultancy Services Mailto: deepa.jayav...@tcs.com Website: http://www.tcs.com Experience certainty. IT Services Business Solutions Consulting =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately
Re: how cam I download the source code of Nutch's dependence jars
Have you tried this ? http://java.dzone.com/articles/ivy-how-retrieve-source-codes Thanks, Tejas On Wed, Feb 12, 2014 at 12:43 AM, Gavin 274614...@qq.com wrote: Maven can do this. How can i do this with ivy? -- Original -- From: 274614348;274614...@qq.com; Date: Wed, Feb 12, 2014 04:37 PM To: useruser@nutch.apache.org; Subject: how cam I download the source code of Nutch's dependence jars How can I download the dependence jars' source code with ivy. Thinks a lot!
Re: sizing guide
If you are looking for specific Nutch 2.1 + MySQL combination, I think that there won;t be any on the project wiki. There is no perfect answer for this as it depends on these factors (this list may go on): - Nature of data that you are crawling: small html files or large documents. - Is it a continuous crawl or few levels ? - Are you re-crawling urls ? - How big is the crawl space ? - Is it a intranet crawl ? How frequently are the pages changed ? Nutch 1.x would be a perfect fit for prod level crawls. If you still want to use Nutch 2.x, it would be better to switch to some other datastore (eg. HBase). Below are my experiences with two use cases wherein Nutch was used over prod with Nutch 1.x: (A) Targeted crawl of a single host In this case I wanted to get the data crawled quickly and didn't bother about the updates that would happen to the pages. I started off with a five node Hadoop cluster but later did the math that it won't get my work done in few days (remember that you need to have a delay between successive requests which the server agrees on else your crawler is banned). Later I bumped the cluster to 15 nodes. The pages were HTML files with size roughly 200k. The crawled data roughly needed 200GB and I had storage of about 500GB. (B) Open crawl of several hosts The configs and memory settings were driven by the prod hardware. I had a 4 node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every hadoop job with an exception of generate job which needed more heap (8-10 GB). There was no need to store the crawled data and every batch was deleted as soon as it was processed. That said that disk had a capacity of 2 TB. Thanks, Tejas On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer deepa.jayav...@tcs.comwrote: Hi , Am using Nutch2.1 with MySQL. Is there a sizing guide available for Nutch 2.1? Is there any recommendations could be ginven on sizing memory,CPU and Disk Space for crawling. Thanks and Regards Deepa Devi Jayaveer Mobile No: 9940662806 Tata Consultancy Services Mailto: deepa.jayav...@tcs.com Website: http://www.tcs.com Experience certainty. IT Services Business Solutions Consulting =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Re: HTML tag filtering
On Wed, Feb 12, 2014 at 6:04 AM, Markus Källander markus.kallan...@nasdaqomx.com wrote: Hi, The patch seems to fulfil my needs, but how do I use it with Nutch 1.7 From your local trunk checkout, run these commands in shell: *wget https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch* *patch -p0 blacklist_whitelist_plugin.patch* *ant clean runtime* Now you have successfully applied that patch to your local copy of nutch codebase. That patch is old and I am not sure if it would compile correctly so you have to look in the codebase and tweak it. Thanks, Tejas ? Is the patch not release yet? Markus Källander Mobile +46 73 622 0547 -Original Message- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] Sent: den 11 februari 2014 17:44 To: user@nutch.apache.org Subject: Re: HTML tag filtering Hi Markus, in short, you have to write a parse filter plugin which does in the filter(...) method: 1. traverse the DOM tree and constructs a clean text by skipping certain content. See o.a.n.utils.NodeWalker o.a.n.parse.html.DOMContentUtils.getTextHelper(...) (part of parse-html plugin) 2. then replace the old plain text in ParseResult by new clean text Maybe this issue can help (there is also a patch but I'm not sure whether it's working and fulfills your needs): https://issues.apache.org/jira/browse/NUTCH-585 Sebastian On 02/11/2014 04:24 PM, Markus Källander wrote: Hi, How do I skip indexing of HTML tags with certain id:s or css classes? I am using Nutch 1.7. Thanks Markus
Re: Nutch 2.2.1 Build stuck while trying to access http://ant.apache.org/ivy/
This has to do more with ant and nothing about nutch. Here is a wild idea: Grab a linux box without any internet restrictions, download nutch over it and build it. In the user home, there would a hidden directory .ivy2 which is a local ivy cache. Create a tarball of the same and scp it over your work machine, extract it in home directory and then run nutch build. PS: I have never done this for ivy but for maven and it had worked. ~tejas On Fri, Feb 7, 2014 at 2:18 PM, A Laxmi a.lakshmi...@gmail.com wrote: Hi, I am having issues building Nutch 2.2.1 behind my company firewall. My build gets stuck here: [ivy:resolve] :: loading settings :: file = ~/nutchtest/nutch/ivy/ivysettings.xml When I contacted the hosting admin, they said - Ant is trying to download files from internet and it will have problems with our firewalls. You will either have to download the files yourself and then scp/sftp them to the machine. Unfortunately we don't have an http proxy. From further digging, I could see Ant is trying to access this link http://ant.apache.org/ivy/. Could anyone please advise what I should do to make Ant compile Nutch without accessing the internet? I can download required files from http://ant.apache.org/ivy/ and scp/sftp to the server but I am not sure what files to download and where to put them? Thanks for your help!!
Re: Strange: Nutch didn't crawl level 2 (depth 2) pages
On Sun, Feb 2, 2014 at 5:54 PM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Hi Tejas, It's works and great! :) After reconfigured and many times of generate, fetch, parse update, the pages on 2nd level is being crawled. 1 question, Is it fine and correct if I modified my current crawler+indexing script into this pseudo (skeleton): # example number of levels / depth (loop) LOOP=4 nutch-inject() loop[ = $LOOP] { nutch-generate() nutch-fetch(a_segment) nutch-parse(a_segment) nutch-updatedb(a_segment) } nutch-solrindex() I don't think that this should be a problem. Remember to pass all the segments generated in the crawl loop to the solrindex job using -dir option. Thank you! On Mon, Jan 27, 2014 at 3:46 AM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: OK I will apply it first and update the result. Thanks.- On Sun, Jan 26, 2014 at 11:01 PM, Tejas Patil tejas.patil...@gmail.com wrote: Please copy this at the end (but above the end tag '/configuration') in your $NUTCH/conf/nutch-site.xml: property namehttp.content.limit/name value9/value /property property namehttp.timeout/name value2147483640/value /property property namedb.max.outlinks.per.page/name value9/value /property Please check if the url got fetched correctly after every round: For the first round with seed as http://bappenas.go.id, after updatedb job, run these to check if they are into the crawldb. The first url must be db_fetched while the second one must be db_unfetched: bin/nutch readdb YOUR_CRAWLDB -url http://bappenas.go.id/ bin/nutch readdb YOUR_CRAWLDB -url http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/ Now crawl for the next depth. After updatedbjob, check if the second url got fetched using the same command again. ie. bin/nutch readdb YOUR_CRAWLDB -url http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/ Note that if there was any redirection, you need to look out the target url in the redirection chain and use that url ahead for debugging. Verify if the content you got for that url had text Liberal Party in the parsed output using this command: bin/nutch readseg -get LATEST_SEGMENT http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/ For larger segments, you might get a OOM error. So in that case, take the entire segment dump using: bin/nutch readseg -dump LATEST_SEGMENT OUTPUT After all this is verified and everything looks good from the crawling side, run solrindex and check if you get the query results. If not, then there was a problem while indexing the stuff. Thanks, Tejas On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Hi, I just realized that my nutch didn't crawl the articles/pages (depth 2) which shown on frontpage. My target URL is: http://bappenas.go.id As shown on that frontpage (top right below the slider banners) three is a text link: Kerjasama Pembangunan Indonesia-Australia Setelah PM Tony Abbot and its URL: http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/?kid=1390691937 I tried to search with keyword Liberal Party (with quotes) which appear on link (page) above but has no result :( Following is the search link queried: http://bappenas.go.id/index.php/bappenas_search/result?q=%22Liberal+Party%22 I use individual script to crawl below: === # Defines env variables export JAVA_HOME=/opt/searchengine/jdk1.7.0_45 export PATH=$JAVA_HOME/bin:$PATH NUTCH=/opt/searchengine/nutch # Start by injecting the seed url(s) to the nutch crawldb: $NUTCH/bin/nutch inject $NUTCH/BappenasCrawl/crawldb $NUTCH/urls/seed.txt # Generate fetch list $NUTCH/bin/nutch generate $NUTCH/BappenasCrawl/crawldb $NUTCH/BappenasCrawl/segments # last segment export SEGMENT=$NUTCH/BappenasCrawl/segments/`ls -tr $NUTCH/BappenasCrawl/segments|tail -1` # Launch the crawler! $NUTCH/bin/nutch fetch $SEGMENT -noParsing # Parse the fetched content: $NUTCH/bin/nutch parse $SEGMENT # We need to update the crawl database to ensure that for all future crawls, Nutch only checks the already crawled pages, and only fetches new and changed pages. $NUTCH/bin/nutch updatedb $NUTCH/BappenasCrawl/crawldb $SEGMENT -filter -normalize # Indexing our crawl DB with solr $NUTCH/bin/nutch solrindex http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb -dir http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb-dir $NUTCH/BappenasCrawl/segments === I run this script daily
Re: Email and blogs crawling
Nutch has these protocols implemented : http, https, ftp, file. As long as you get links to your documents in those schemes, Nutch would do the crawl. Thanks, Tejas On Tue, Jan 28, 2014 at 10:07 PM, rashmi maheshwari maheshwari.ras...@gmail.com wrote: I could crawl internet webpage and local directory folder to some extent. How to implement email and inranet blogs crawling? -- Rashmi Be the change that you want to see in this world!
Re: Order of robots file
Hi Markus, I am trying to understand the problem you described. You meant that with the original Nutch's robots parsing code, the robots file below allowed your crawler to crawl stuff: User-agent: * Disallow: / User-agent: our_crawler Allow: / But now that started using the change from NUTCH-1031 [0], (ie. delegation of robots parsing to crawler commons), it blocked your crawler. To make things work, you had to change your robots file to this: User-agent: our_crawler Allow: / User-agent: * Disallow: / Did I understand the problem correctly ? [0] : https://issues.apache.org/jira/browse/NUTCH-1031 Thanks, Tejas On Fri, Jan 24, 2014 at 7:29 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, I am attempting to merge some Nutch changes back to our own. We aren't using Nutch' CrawlerCommons impl but the old stuff. But because of recording of response time and rudimentary SSL support i decided to move it back to our version. Suddenly i realized a local crawl does not work anymore, it seems because of the order of the robots definitions. For example: User-agent: * Disallow: / User-agent: our_crawler Allow: / Does not allow our crawler to fetch URL's. But User-agent: our_crawler Allow: / User-agent: * Disallow: / Does! This was not the case before, anyone here aware of this? By design? Or is it a flaw? Thanks Markus
Re: Order of robots file
I am working on the scenario you just pointed out. By Apache Nutch, you mean the current codebase with CC or version before that ? CC differs from original nutch code as CC has kinda a greedy approach wherein it tries to get a match / mismatch after every line is sees from the robots file. While the time I was working on delegation of robots parsing to Crawler commons (CC), I remember that there was difference in the semantics of original parsing code and CC's implementation for multiple robots agents. Here was my observation at that time: https://issues.apache.org/jira/browse/NUTCH-1031?focusedCommentId=13558217page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13558217 ~tejas On Fri, Jan 24, 2014 at 8:11 PM, Markus Jelsma markus.jel...@openindex.iowrote: Tejas, the problem exists in Apache Nutch as well. We'll take localhost as example and the following config and robots.txt # cat /var/www/robots.txt User-agent: * Disallow: / User-agent: nutch Allow: / config: property namehttp.agent.name/name valueMozilla/value /property property namehttp.agent.version/name value5.0/value /property property namehttp.robots.agents/name valuenutch,*/value /property property namehttp.agent.description/name valuecompatible; NutchCrawler/value /property property namehttp.agent.url/name value+http://example.org//value /property URL: http://localhost/ Version: 7 Status: 3 (db_gone) Fetch time: Mon Mar 10 15:36:48 CET 2014 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 3888000 seconds (45 days) Score: 0.0 Signature: null Metadata: _pst_=robots_denied(18), lastModified=0 Can you confirm? -Original message- From:Markus Jelsma markus.jel...@openindex.io Sent: Friday 24th January 2014 15:29 To: user@nutch.apache.org Subject: RE: Order of robots file Hi, sorry for being unclear. You understand correctly, i had to change the robots.txt order and put our crawler ABOVE User-Agent: *. I have tried a unit test for lib-http to demonstrate the problem but i does not fail. I think i did something wrong in merging the code base. I'll look further. markus@midas:~/projects/apache/nutch/trunk$ svn diff src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java Index: src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java === --- src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java (revision 1560984) +++ src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java (working copy) @@ -50,6 +50,23 @@ + + CR + User-Agent: * + CR + Disallow: /foo/bar/ + CR; // no crawl delay for other agents + + private static final String ROBOTS_STRING_REVERSE = + User-Agent: * + CR + + Disallow: /foo/bar/ + CR // no crawl delay for other agents + + + CR + + User-Agent: Agent1 #foo + CR + + Disallow: /a + CR + + Disallow: /b/a + CR + + #Disallow: /c + CR + + Crawl-delay: 10 + CR // set crawl delay for Agent1 as 10 sec + + + CR + + + CR + + User-Agent: Agent2 + CR + + Disallow: /a/bloh + CR + + Disallow: /c + CR + + Disallow: /foo + CR + + Crawl-delay: 20 + CR; private static final String[] TEST_PATHS = new String[] { http://example.com/a;, @@ -80,6 +97,29 @@ /** * Test that the robots rules are interpreted correctly by the robots rules parser. */ + public void testRobotsAgentReverse() { +rules = parser.parseRules(testRobotsAgent, ROBOTS_STRING_REVERSE.getBytes(), CONTENT_TYPE, SINGLE_AGENT); + +for(int counter = 0; counter TEST_PATHS.length; counter++) { + assertTrue(testing on agent ( + SINGLE_AGENT + ), and + + path + TEST_PATHS[counter] + + got + rules.isAllowed(TEST_PATHS[counter]), + rules.isAllowed(TEST_PATHS[counter]) == RESULTS[counter]); +} + +rules = parser.parseRules(testRobotsAgent, ROBOTS_STRING_REVERSE.getBytes(), CONTENT_TYPE, MULTIPLE_AGENTS); + +for(int counter = 0; counter TEST_PATHS.length; counter++) { + assertTrue(testing on agents ( + MULTIPLE_AGENTS + ), and + + path + TEST_PATHS[counter] + + got + rules.isAllowed(TEST_PATHS[counter]), + rules.isAllowed(TEST_PATHS[counter]) == RESULTS[counter]); +} + } + + /** + * Test that the robots rules are interpreted correctly by the robots rules parser. + */ public void testRobotsAgent() { rules = parser.parseRules(testRobotsAgent, ROBOTS_STRING.getBytes(), CONTENT_TYPE, SINGLE_AGENT); -Original
Re: WrongRegionException after updatedb
This is tied with HBase and not Nutch. It would be beneficial if you get a complete stack trace and post it over the HBase user group too. ~tejas On Thu, Jan 23, 2014 at 6:49 PM, cervenkovab cervenko...@gmail.com wrote: I run generate-fetch-parse and after update I got this exception. 2014-01-23 13:40:56,905 ERROR store.HBaseStore - Failed 747 actions: WrongRegionException: 747 times, servers with issues: server.eu:43556, 2014-01-23 13:40:56,905 ERROR store.HBaseStore - [Ljava.lang.StackTraceElement;@12101d00 Can you please help me to understand where can be problem? using versions: Nutch 2.2.1, Hbase 0.90.6 -- View this message in context: http://lucene.472066.n3.nabble.com/WrongRegionException-after-updatedb-tp4112982.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to Get Links With Nutch
Correct me if I am wrong: You want the anchor text and the outlink. Right ? If you crawl the seed url for depth 1 using Nutch 1.x and then get a segment dump of the segment generated after crawl, it should have that information. On Wed, Jan 22, 2014 at 9:46 PM, Teague James teag...@insystechinc.comwrote: I am trying to use Nutch to crawl a site and return all of the links that are on a page. As a simple example, the page might look like this if its address were www.example.com and each of the items in [brackets] were links of some sort - relative or full URLs: Article 1 text blah blah blah [Read more] Download [Article 1 PDF] Article 2 text blah blah blah [Read more] Download [Article 2 PDF] In partnership with [Some Partner] [Home]|[Articles]|[Contact Us] What I want to get is a list of all the links and destination URLs, something like: [Read more] /article1 [Article 1 PDF] /pdfs/article1.pdf [Read more] /article2 [Article 2 PDF] /pdfs/article2.pdf [Some Partner] www.somepartner.com [Home] /home [Articles] /articles [Contact Us] /contact us Note that a lot of the links are relative. I don't care whether I can get only the relative /article1 or the full www.example.com/article1 and I do not necessarily need Nutch to go to each of those links and crawl them. I just want Nutch to report on all of the links on the page. Can anyone offer me any advice on how to accomplish this?
Re: How to Get Links With Nutch
On Wed, Jan 22, 2014 at 10:33 PM, Teague James teag...@insystechinc.comwrote: Tejas, Thanks for your response, that is exactly correct. Ultimately I want to be able to index the Nutch crawl with Solr to make it all searchable. After doing my crawl, I use: bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments/ But I do not get all of the anchors. I get some anchors as a comma delimited list in the anchor field. HTML parser in Nutch is doing this. I do not get any of the outlinks. I think that the indexer would only index the parsed content from the segments and the outlinks won't be included. Thats why you see that happening. I did a dump with readdb of the crawldb and found that the links I want are there. I will take a look at doing a segment dump as you suggest. ok. Look out for the outlinks section in the segment dump. Will that make these outlinks available to Solr or are there additional steps I need to take? I am not sure if there is a better way than this: Write your own HTML parser (or tweak the one provided with Nutch) which would just emit the outlinks along with its anchor text. -Original Message- Correct me if I am wrong: You want the anchor text and the outlink. Right ? If you crawl the seed url for depth 1 using Nutch 1.x and then get a segment dump of the segment generated after crawl, it should have that information. On Wed, Jan 22, 2014 at 9:46 PM, Teague James teag...@insystechinc.comwrote: I am trying to use Nutch to crawl a site and return all of the links that are on a page. As a simple example, the page might look like this if its address were www.example.com and each of the items in [brackets] were links of some sort - relative or full URLs: Article 1 text blah blah blah [Read more] Download [Article 1 PDF] Article 2 text blah blah blah [Read more] Download [Article 2 PDF] In partnership with [Some Partner] [Home]|[Articles]|[Contact Us] What I want to get is a list of all the links and destination URLs, something like: [Read more] /article1 [Article 1 PDF] /pdfs/article1.pdf [Read more] /article2 [Article 2 PDF] /pdfs/article2.pdf [Some Partner] www.somepartner.com [Home] /home [Articles] /articles [Contact Us] /contact us Note that a lot of the links are relative. I don't care whether I can get only the relative /article1 or the full www.example.com/article1 and I do not necessarily need Nutch to go to each of those links and crawl them. I just want Nutch to report on all of the links on the page. Can anyone offer me any advice on how to accomplish this?
Request for reviewing HostDb and Sitemap features
Hi, Is anyone interested in reviewing or trying out the patch for these new features ? I have recently updated [0] and [1] and would like to hear back comments on the same. [0] : https://issues.apache.org/jira/browse/NUTCH-1325 [1] : https://issues.apache.org/jira/browse/NUTCH-1465 Thanks, Tejas
Re: The problem caused by failed with: java.io.IOException: unzipBestEffort returned null
I tried to debug the issue. You are hit by https://issues.apache.org/jira/browse/NUTCH-1647 Thanks, Tejas On Fri, Dec 27, 2013 at 8:54 AM, yan wang dayank...@gmail.com wrote: Hi, guys Yesterday, I tried to crawl a website (a Chinese website) with some seed links like this: http://www.ccgp.gov.cn/cggg/dfbx/gkzb/default_4.shtml but the crawl process failed because of a problem shown as following: fetching http://www.ccgp.gov.cn/cggg/dfbx/gkzb/default_4.shtml (queue crawl delay=5000ms) fetch of http://www.ccgp.gov.cn/cggg/dfbx/gkzb/default_4.shtml failed with: java.io.IOException: unzipBestEffort returned null At first, I used nutch-1.5.1 to crawl the website and had the above problem, then I changed to use nutch-1.7 to do it again but it failed again. Now, I totally have no idea how to handle the problem! I would really appreciate any feedback! -Yan Wang
Re: Using ParseUtils in MR job (not as part of nutch crawl)
On Sun, Dec 22, 2013 at 4:39 AM, Amit Sela am...@infolinks.com wrote: Hi all, I'm trying to use the nutch ParseUtil to parse nutch Content with parse-tika and parse-html By nutch content, you mean nutch segment ? Please try using the 'bin/nutch parse' command instead. but I keep getting: RuntimeException: x point org.apache.nutch.parse.Parser not found This smells like some problem in loading the plugins. I'm running this in a MR outside of the nutch crawl jobs, and when I run it in IDE I have to add the build/ directory to project classpath in order to solve it. The bin/nutch script generates appropriate classpath before invoking the class. You can get the value of CLASSPATH formed by the script and try to get the same in IDE. Glad that you found a way around. I hoped distributing the apache-nutch-1.7.jar (version I use) to data nodes classpath directories would help, I even added parse-plugins.xml but it won't do... I hope that you were running from runtime/deploy for distributed mode. No need to distribute the jar. Hadoop does that for you. Even the configs are inside the runtime/deploy/apache-nutch-1.XX-.job file. Anyone managed that ? Thanks, Amit.
Re: Crawling a specific site only
You need to provide topN parameter to run Generate.. can't skip that. What I meant was that set its value more than 2000. Note: The max allowable value for topN is (2^63)-1. Don't exceed that. Thanks, Tejas On Wed, Dec 18, 2013 at 2:14 AM, Vangelis karv karvouni...@hotmail.comwrote: Thanks for the support guys! I'll crawl again with generate.count.mode=host and generate.max.count=-1. Although, if i dont set -topN in the nutch script it won't let me run GeneratorJob. Subject: RE: Crawling a specific site only From: markus.jel...@openindex.io To: user@nutch.apache.org Date: Wed, 18 Dec 2013 09:38:04 + Increase it to a reasonable high value or don't set it at all, it will then attempt to crawl as much as it can. Also check generate.count.mode and generate.max.count. -Original message- From:Vangelis karv karvouni...@hotmail.com Sent: Wednesday 18th December 2013 9:56 To: user@nutch.apache.org Subject: RE: Crawling a specific site only Can you be a little more specific about that, Tejas? Date: Tue, 17 Dec 2013 23:32:46 -0800 Subject: Re: Crawling a specific site only From: tejas.patil...@gmail.com To: user@nutch.apache.org You should bump the value of topN instead of setting to 2000. That would make lot of the urls eligible for fetching. Thanks, Tejas On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv karvouni...@hotmail.comwrote: Markus and Wang thank you very much for your fast responses. I forgot to mention that i use nutch 2.2.1 and mysql. Both DomainFilter and ignore.external.links ideas are awesome! What really bothers me is that dreaded -topN. I really want to live without it! :) I hate it when I open my database and I see that i have for example 2000 links unfetched, which means they are not parsed-useless, and only 2000 fetched. Subject: Re: Crawling a specific site only From: wangyi1...@gmail.com To: user@nutch.apache.org Date: Tue, 17 Dec 2013 18:53:55 +0800 HI Just set namedb.ignore.external.links/name valuetrue/value and run crawl script for several times, the default number of pages to be added is 50,000. Is it right? Wang -Original Message- From: Vangelis karv karvouni...@hotmail.com Reply-to: user@nutch.apache.org To: user@nutch.apache.org user@nutch.apache.org Subject: Crawling a specific site only Date: Tue, 17 Dec 2013 12:15:00 +0200 Hi again! My goal is to crawl a specific site. I want to crawl all the links that exist under that site. For example, if i decide to crawl http://www.uefa.com/, I want to parse all its inlinks(photos, videos, htmls etc) and not only the best scoring urls for this site= topN. So, my question here is: how can we tell Nutch to crawl everything in a site and not only the sites that have the best score?
Re: Memory leak when crawling repeatedly?
You should use the bin/crawl script instead of directly invoking Crawl() Thanks, Tejas On Tue, Dec 17, 2013 at 7:04 AM, yann yann1...@yahoo.com wrote: Thanks Julien, I will give it a try and report back. Is there sample code in trunk on what to replace the Crawl() with? Yann -- View this message in context: http://lucene.472066.n3.nabble.com/Memory-leak-when-crawling-repeatedly-tp4106960p4107114.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Crawling a specific site only
You should bump the value of topN instead of setting to 2000. That would make lot of the urls eligible for fetching. Thanks, Tejas On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv karvouni...@hotmail.comwrote: Markus and Wang thank you very much for your fast responses. I forgot to mention that i use nutch 2.2.1 and mysql. Both DomainFilter and ignore.external.links ideas are awesome! What really bothers me is that dreaded -topN. I really want to live without it! :) I hate it when I open my database and I see that i have for example 2000 links unfetched, which means they are not parsed-useless, and only 2000 fetched. Subject: Re: Crawling a specific site only From: wangyi1...@gmail.com To: user@nutch.apache.org Date: Tue, 17 Dec 2013 18:53:55 +0800 HI Just set namedb.ignore.external.links/name valuetrue/value and run crawl script for several times, the default number of pages to be added is 50,000. Is it right? Wang -Original Message- From: Vangelis karv karvouni...@hotmail.com Reply-to: user@nutch.apache.org To: user@nutch.apache.org user@nutch.apache.org Subject: Crawling a specific site only Date: Tue, 17 Dec 2013 12:15:00 +0200 Hi again! My goal is to crawl a specific site. I want to crawl all the links that exist under that site. For example, if i decide to crawl http://www.uefa.com/, I want to parse all its inlinks(photos, videos, htmls etc) and not only the best scoring urls for this site= topN. So, my question here is: how can we tell Nutch to crawl everything in a site and not only the sites that have the best score?
Re: Memory leak when crawling repeatedly?
Did you see the logs and figure out from stack trace which portion of the code is responsible for OOM ? Thanks, Tejas On Mon, Dec 16, 2013 at 9:32 AM, yann yann1...@yahoo.com wrote: Hi guys, I'm writing a server / rest API for Nutch, but I'm running into a memory leak issue. I simplified the problem down to this: crawling a site repeatedly (as below) will eventually run out of memory; when looking at the running JVM with VisualVM, the permGen space grows indefinitely at the same rate, until it runs out and the application crashes. I suspect there is a memory leak in Nutch or in Hadoop, as I wouldn't expect the code below not to grow its memory footprint indefinitely. The code: while (true) { Configuration configuration = NutchConfiguration.create(); String crawlArg = config/urls/dev -dir crawls/dev -threads 5 -depth 2 -topN 100 ; ToolRunner.run(configuration, new Crawl(), MiscUtils.tokenize(crawlArg)); } Anything I can do on my side to fix this? Thanks for all comments, Yann -- View this message in context: http://lucene.472066.n3.nabble.com/Memory-leak-when-crawling-repeatedly-tp4106960.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch with YARN (aka Hadoop 2.0)
I am not able to locate the logs either to confirm that. Can you please let me know how to retrieve logs from Nutch on Hadoop? You should have seen the jobs over Hadoop UI and for each mapper and reducer you can view the logs over the browser by clicking on the job. YARN supposed to be backward compatible? It is. Nutch has not migrated fully to support the new MapReduce API from which YARN was sprung. On Mon, Dec 9, 2013 at 3:11 PM, S.L simpleliving...@gmail.com wrote: Isnt ,YARN supposed to be backward compatible? Sent from my HTC Inspire™ 4G on ATT - Reply message - From: Julien Nioche lists.digitalpeb...@gmail.com To: user@nutch.apache.org user@nutch.apache.org Cc: d...@nutch.apache.org d...@nutch.apache.org Subject: Nutch with YARN (aka Hadoop 2.0) Date: Mon, Dec 9, 2013 3:54 am I don't think Nutch has been fully ported to the new mapreduce API which is a prerequisite for running it on Hadoop 2. I can't think of a reason why that the performance would be any different with Yarn. Julien On 9 December 2013 06:42, Tejas Patil tejas.patil...@gmail.com wrote: Has anyone tried out running Nutch over YARN ? If so, were there were any performance gains with the same ? Thanks, Tejas -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: Nutch Hadoop Job plugins property
When you run Nutch over Hadoop ie. deploy mode, you use the job file (apache-nutch-1.X.job). This is nothing but a big fat zip file containing (you can unzip it and verify yourself) : (a) all the nutch classes compiled, (b) config files and (c) dependent jars When hadoop launches map-reduce jobs for nutch: 1. This nutch job file is copied over to the node where your job is executed (say map task), 2. It is unpacked 3. Nutch gets the nutch-site.xml and nutch-default.xml, loads the configs. 4. By default, plugin.folders is set to plugins which is a relative path. It would search the plugin classes in the classpath under a directory named plugins. 5. The plugins directory is under a directory named classes which is in the classpath (this is inside the extracted job file). Now, required plugin classes are loaded from here and everything runs fine. In short: Leave it as it is. It should work over Hadoop by default. Thanks, Tejas On Mon, Dec 9, 2013 at 4:54 PM, S.L simpleliving...@gmail.com wrote: What should be the plugins property be set to when running Nutch as a Hadoop job ? I just created a deploy mode jar running the ant script , I see that the value of the plugins property is being copied and used from the confiuration into the hadoop job. While it seems to be getting the plugins directory because Hadoop is being run on the same machine , I am sure it will fail when moved to a different machine. How should I set the plugins property so that it is relative to the hadoop job? Thanks
Re: Nutch with YARN (aka Hadoop 2.0)
On Mon, Dec 9, 2013 at 4:50 PM, S.L simpleliving...@gmail.com wrote: It is. Nutch has not migrated fully to support the new MapReduce API from which YARN was sprung. Yes Nutch 1.7 is using old API, does this mean that Nutch 1.7 is inompatible with Hadoop 2.2 ? I am thinking it is compatible in its current state and that's why they are calling Hadoop 2.2 as backward compatible. Isn't that the case? Well, I have not ran it with latest 2.X hadoop version, but you are right about backward compatibility. They would just not make the old code break unless some strong reason. Having said that, if Nutch keeps using the old deprecated hadoop API, you won't get any benefits of the changes done in newer map reduce versions. PS: Here is the relevant nutch jira for up-gradation to new Hadoop API https://issues.apache.org/jira/browse/NUTCH-1219 Thanks, Tejas On Mon, Dec 9, 2013 at 6:22 PM, Tejas Patil tejas.patil...@gmail.com wrote: I am not able to locate the logs either to confirm that. Can you please let me know how to retrieve logs from Nutch on Hadoop? You should have seen the jobs over Hadoop UI and for each mapper and reducer you can view the logs over the browser by clicking on the job. YARN supposed to be backward compatible? It is. Nutch has not migrated fully to support the new MapReduce API from which YARN was sprung. On Mon, Dec 9, 2013 at 3:11 PM, S.L simpleliving...@gmail.com wrote: Isnt ,YARN supposed to be backward compatible? Sent from my HTC Inspire™ 4G on ATT - Reply message - From: Julien Nioche lists.digitalpeb...@gmail.com To: user@nutch.apache.org user@nutch.apache.org Cc: d...@nutch.apache.org d...@nutch.apache.org Subject: Nutch with YARN (aka Hadoop 2.0) Date: Mon, Dec 9, 2013 3:54 am I don't think Nutch has been fully ported to the new mapreduce API which is a prerequisite for running it on Hadoop 2. I can't think of a reason why that the performance would be any different with Yarn. Julien On 9 December 2013 06:42, Tejas Patil tejas.patil...@gmail.com wrote: Has anyone tried out running Nutch over YARN ? If so, were there were any performance gains with the same ? Thanks, Tejas -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: load plugin from jar file
I am not familiar with Clojure at all. Nutch plugin loading code is tricky and hacking it to straight away invoke Clojure code should be a significant work. My feeling is that if you cook up with a plugin in java and then call Clojure code through this java wrapper, this might work. Thanks, Tejas On Mon, Dec 9, 2013 at 5:03 PM, Olle Romo oller...@metasound.ch wrote: Hi All, According to NUTCH-609 there were some talk about allowing plugins to load from jar files. Looks like that was from a few years ago. Can I write a plugin in Clojure and have it load ok? Best, Olle
Re: Nutch Hadoop Job plugins property
On Mon, Dec 9, 2013 at 6:07 PM, S.L simpleliving...@gmail.com wrote: Thanks for a great reply! Right now I have a 4 urls in my seed file with domains d1,d2,d3,d4. I see that when the nutch job is being run on Hadoop its only picking up URLs for d4, there does not seem to be any parallelism . I would recommend you to run all phases of nutch INDIVIDUALLY and look into the logs for the generate and fetch phases. Set log level for generate to DEBUG. One possible reason: All urls of host 'd4' had more score than the other ones. This is less likely to cause this issue as your topN value is large. I am running the Nutch job using the following command. bin/hadoop jar /home/general/workspace/nutch/runtime/deploy/apache-nutch-1.8-SNAPSHOT.job org.apache.nutch.crawl.Crawl urls -dir crawldirectory -depth 1000 -topN 3 I am not sure but I think that the crawl command is deprecated. You might have to use 'bin/crawl' script instead. On Mon, Dec 9, 2013 at 8:16 PM, Tejas Patil tejas.patil...@gmail.com wrote: When you run Nutch over Hadoop ie. deploy mode, you use the job file (apache-nutch-1.X.job). This is nothing but a big fat zip file containing (you can unzip it and verify yourself) : (a) all the nutch classes compiled, (b) config files and (c) dependent jars When hadoop launches map-reduce jobs for nutch: 1. This nutch job file is copied over to the node where your job is executed (say map task), 2. It is unpacked 3. Nutch gets the nutch-site.xml and nutch-default.xml, loads the configs. 4. By default, plugin.folders is set to plugins which is a relative path. It would search the plugin classes in the classpath under a directory named plugins. 5. The plugins directory is under a directory named classes which is in the classpath (this is inside the extracted job file). Now, required plugin classes are loaded from here and everything runs fine. In short: Leave it as it is. It should work over Hadoop by default. Thanks, Tejas On Mon, Dec 9, 2013 at 4:54 PM, S.L simpleliving...@gmail.com wrote: What should be the plugins property be set to when running Nutch as a Hadoop job ? I just created a deploy mode jar running the ant script , I see that the value of the plugins property is being copied and used from the confiuration into the hadoop job. While it seems to be getting the plugins directory because Hadoop is being run on the same machine , I am sure it will fail when moved to a different machine. How should I set the plugins property so that it is relative to the hadoop job? Thanks
Re: Unsuccessful fetch/parse of large page with many outlinks
I think that you narrowed it down and most probably its some bug/incompatibility of the HTTP library which nutch uses to talk with the server. Were both the servers where you hosted the url of IIS 6.0 ? If yes, then there is more :) Thanks, Tejas On Mon, Dec 9, 2013 at 3:32 PM, Iain Lopata ilopa...@hotmail.com wrote: Out of ideas at this point. I can retrieve the page with Curl I can retrieve the page with Wget I can view the page in my browser I can retrieve the page by opening a socket from a PHP script I can retrieve the page with nutch if I move the page to another host But Any page I try and fetch from www.friedfrank.com with Nutch reads just 198 bytes and then closes the stream. Debug code inserted in HttpResponse and WireShark both show that this is the case. Could someone else please try and fetch a page from this host from your config? My suspicion is that it is related to this host being on IIS 6.0 with this problem being a potential cause: http://support.microsoft.com/kb/919797 -Original Message- From: Iain Lopata [mailto:ilopa...@hotmail.com] Sent: Monday, December 09, 2013 7:36 AM To: user@nutch.apache.org Subject: RE: Unsuccessful fetch/parse of large page with many outlinks Parses 652 outlinks from the ebay url without any difficulty. Didn't want to change the title and thereby break this thread, but at this point, and as stated in my last post, I am reasonably confident that for some reason the InputReader in HttpResponse.java sees the stream as closed after reading only 198 bytes. Why I do not know. -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Sunday, December 08, 2013 11:44 PM To: user@nutch.apache.org Subject: Re: Unsuccessful fetch/parse of large page with many outlinks I faced a similar problem with this page http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 when I was running Nutch from within eclipse , I was able to crawl all the outlinks successfully when I ran nutch as a jar outside of eclipse, at that point it was considered to be an issue with running ti in eclipse. Can you please try this URL with your setup ? this has atleast 600+ outlinks. On Sun, Dec 8, 2013 at 10:07 PM, Iain Lopata ilopa...@hotmail.com wrote: Some further analysis - no solution. The pages in question do not return a Content-Length header. Since the http.content.limit is set to -1, http-protocol sets the maximum read length to 2147483647. At line 231 of HttpResponse.java the loop: for (int i = in.read(bytes); i != -1 length + i = contentLength; i = in.read(bytes)) executes once and once only and returns a stream of just 198 bytes. No exceptions are thrown. So, I think, the question becomes why would this connection close before the end of the stream? It certainly seems to be server specific since I can retrieve the file successfully from a different host domain. -Original Message- From: Tejas Patil [mailto:tejas.patil...@gmail.com] Sent: Sunday, December 08, 2013 2:29 PM To: user@nutch.apache.org Subject: Re: Unsuccessful fetch/parse of large page with many outlinks debug code that I have inserted in a custom filter shows that the file that was retrieved is only 198 bytes long. I am assuming that this code did not hinder the crawler. A better way to see the content would be to take a segment dump [0] and then analyse it. Also, turn on DEBUG mode of the log4j for the http protocol classes and fetcher class. attempted to crawl it from that site and it works fine, retrieving all 597KB and parsing it successfully. You mean that you ran a nutch crawl with the problematic url as a seed and used the EXACT same config on both machines. One machine gave perfect content and the other one was not. Note that using EXACT same config over these 2 runs is important. the page has about 350 characters of LineFeeds CarriageRetruns and spaces No way. The HTTP request gets a byte stream as response. Also, had it been the case that LF or CR chars create problem, then it must hit nutch irrespective of from which machine you run nutch...but thats not what your experiments suggest. [0] : http://wiki.apache.org/nutch/bin/nutch_readseg On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata ilopa...@hotmail.com wrote: I do not know whether this would be a factor, but I have noticed that the page has about 350 characters of LineFeeds CarriageRetruns and spaces before the !DOCTYPE declaration. Could this be causing a problem for http-protocol in some way? Howver, I can't explain why the same file with the same LF, CR and whitespace would read correctly from a different host. -Original Message- From: Iain Lopata [mailto:ilopa...@hotmail.com] Sent: Sunday, December 08, 2013 12:06 PM To: user@nutch.apache.org Subject: Unsuccessful fetch/parse of large page
Re: Unsuccessful fetch/parse of large page with many outlinks
debug code that I have inserted in a custom filter shows that the file that was retrieved is only 198 bytes long. I am assuming that this code did not hinder the crawler. A better way to see the content would be to take a segment dump [0] and then analyse it. Also, turn on DEBUG mode of the log4j for the http protocol classes and fetcher class. attempted to crawl it from that site and it works fine, retrieving all 597KB and parsing it successfully. You mean that you ran a nutch crawl with the problematic url as a seed and used the EXACT same config on both machines. One machine gave perfect content and the other one was not. Note that using EXACT same config over these 2 runs is important. the page has about 350 characters of LineFeeds CarriageRetruns and spaces No way. The HTTP request gets a byte stream as response. Also, had it been the case that LF or CR chars create problem, then it must hit nutch irrespective of from which machine you run nutch...but thats not what your experiments suggest. [0] : http://wiki.apache.org/nutch/bin/nutch_readseg On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata ilopa...@hotmail.com wrote: I do not know whether this would be a factor, but I have noticed that the page has about 350 characters of LineFeeds CarriageRetruns and spaces before the !DOCTYPE declaration. Could this be causing a problem for http-protocol in some way? Howver, I can't explain why the same file with the same LF, CR and whitespace would read correctly from a different host. -Original Message- From: Iain Lopata [mailto:ilopa...@hotmail.com] Sent: Sunday, December 08, 2013 12:06 PM To: user@nutch.apache.org Subject: Unsuccessful fetch/parse of large page with many outlinks I am running Nutch 1.6 on Ubuntu Server. I am experiencing a problem with one particular webpage. If I use parsechecker against the problem url the output shows (host name changed to example.com): fetching: http://www.example.com/index.cfm?pageID=12 text/html parsing: http://www.example.com/index.cfm?pageID=12 contentType: text/html signature: a9c640626fcad48caaf3ad5f94bea446 - Url --- http://www.example.com/index.cfm?pageID=12 - ParseData - Version: 5 Status: success(1,0) Title: Outlinks: 0 Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html; charset=UTF-8 Connection=close X-Powered-By=ASP.NET Server=Microsoft-IIS/6.0 Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 However, this page has 3775 outlinks. If I run a crawl with this page as a seed the log file shows that the file it fetched successfully, but debug code that I have inserted in a custom filter shows that the file that was retrieved is only 198 bytes long. For some reason the file would seem to be truncated or otherwise corrupted. I can retrieve the file with wget and can see that the file is 597KB. I copied the file that I retrieved with wget to another web server and attempted to crawl it from that site and it works fine, retrieving all 597KB and parsing it successfully. This would suggest that my current configuration does not have a problem processing this large file. I have checked the robots.txt file on the original host and it allows retrieval of this web page. Other relevant configuration settings may be: property namehttp.content.limit/name value-1/value /property property namehttp.timeout/name value6/value description/description /property Any ideas on what to check next?
Nutch with YARN (aka Hadoop 2.0)
Has anyone tried out running Nutch over YARN ? If so, were there were any performance gains with the same ? Thanks, Tejas
Re: Manipulating Nutch 2.2.1 scoring system
Hi Vangelis, You can write your own implementation of scoring and make nutch to use it via a plugin. - Go through [0] to understand how to write a custom plugin - For scoring, you class should implement the ScoringFilter interface [1]. [0] : http://wiki.apache.org/nutch/WritingPluginExample [1] : http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/scoring/ScoringFilter.java Thanks, Tejas On Sat, Dec 7, 2013 at 4:32 AM, Talat UYARER talat.uya...@agmlab.comwrote: Hi Olle, I dont know happened any working diagram for Nutch 1.x But If you can, we will be glad. :) Talat 06-12-2013 14:00 tarihinde, Olle Romo yazdı: Hi Talat, I'm at an early stage of learning Nutch and your diagram is _very_ helpful. Would you happen to have a diagram for 1.x too? Or is there not much difference at he architecture level? Best, Olle On Dec 5, 2013, at 9:07 PM, Talat UYARER talat.uya...@agmlab.com wrote: Hi Vangelis, I draw a Nutch Software Architecture diagram. Maybe it can be help you. https://drive.google.com/file/d/0B2kKrOleEOkRQllaTGdRZGFMY2M/ edit?usp=sharing Talat 05-12-2013 19:09 tarihinde, Vangelis karv yazdı: It is clear that OPICScoringfilter does all the job in creating the scores for the urls. I just wanted to know if it is possible to implement another function for scoring and if there are any available on the Internet. Does anybody know where exactly in the code, Nutch calls that function(Generator,Fetcher,Parser)?
Re: Manipulating Nutch 2.2.1 scoring system
I think that the CrawlDatum of each url contains its score. You can get a crawldb dump and see that. Tejas On Sat, Dec 7, 2013 at 1:37 PM, Ing. Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: How can I send the linkrank or opic scoring into solr/hbase ? - Mensaje original - De: Tejas Patil tejas.patil...@gmail.com Para: user@nutch.apache.org Enviados: Sábado, 7 de Diciembre 2013 12:44:16 Asunto: Re: Manipulating Nutch 2.2.1 scoring system Hi Vangelis, You can write your own implementation of scoring and make nutch to use it via a plugin. - Go through [0] to understand how to write a custom plugin - For scoring, you class should implement the ScoringFilter interface [1]. [0] : http://wiki.apache.org/nutch/WritingPluginExample [1] : http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/scoring/ScoringFilter.java Thanks, Tejas On Sat, Dec 7, 2013 at 4:32 AM, Talat UYARER talat.uya...@agmlab.com wrote: Hi Olle, I dont know happened any working diagram for Nutch 1.x But If you can, we will be glad. :) Talat 06-12-2013 14:00 tarihinde, Olle Romo yazdı: Hi Talat, I'm at an early stage of learning Nutch and your diagram is _very_ helpful. Would you happen to have a diagram for 1.x too? Or is there not much difference at he architecture level? Best, Olle On Dec 5, 2013, at 9:07 PM, Talat UYARER talat.uya...@agmlab.com wrote: Hi Vangelis, I draw a Nutch Software Architecture diagram. Maybe it can be help you. https://drive.google.com/file/d/0B2kKrOleEOkRQllaTGdRZGFMY2M/ edit?usp=sharing Talat 05-12-2013 19:09 tarihinde, Vangelis karv yazdı: It is clear that OPICScoringfilter does all the job in creating the scores for the urls. I just wanted to know if it is possible to implement another function for scoring and if there are any available on the Internet. Does anybody know where exactly in the code, Nutch calls that function(Generator,Fetcher,Parser)? III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 2014. Ver www.uci.cu III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 2014. Ver www.uci.cu
Re: Cannot run program /bin/ls: java.io.IOException: error=11, Resource temporarily unavailable
java.io.IOException: error=11, Resource temporarily unavailable can be accounted to 2 reasons: Too many processes are already running ( http://stackoverflow.com/questions/8384000/java-io-ioexception-error-11) OR Too many files open ( http://stackoverflow.com/questions/15494749/java-lang-outofmemoryerror-unable-to-create-new-native-thread-for-big-data-set ) Coming to your load/environment: Nutch / Hadoop won't leave behind stale processes or open files which would pile up after running several rounds. After every round, it is expected to clear up things. To verify this, have a script to capture #live-processes and #open-file-handles periodically while nutch is running. Thanks, Tejas On Sat, Dec 7, 2013 at 2:19 PM, Martin Aesch martin.ae...@googlemail.comwrote: Dear Jon, dear nutchers, thanks - do you remember by chance more about the backgrounds of that problem? I am using nutch-1.7 and have currently the same issue while parsing. Nutch-1.7 in pseudo distributed mode, 32GB total, 768M per mapper/reducer task, 8G for hadoop, 2G for nutch. 6 mappers, max total in segment 10 URLs. According to the log, each URL takes 0-1 ms to be parsed. Suddenly, the 1min load of the machine goes up to 200 and higher, even (1/5/15-200-500-1200) I could see. But there is moderate CPU-usage, low IO-Wait and ~50 percent idle. Currently, I am running under same conditions, but only 10k URLs per segment. Up to now for 30 generate-fetch-parse-update-cycles no problem. I am already a veteran with ulimit problems and set values (ulimit -n: 25, ulimit -u 32) very high. Now I am out of ideas. Any ideas, suggestions? Cheers, Martin -Original Message- From: Jon Uhal jonu...@gmail.com Reply-to: user@nutch.apache.org To: user@nutch.apache.org Subject: Cannot run program /bin/ls: java.io.IOException: error=11, Resource temporarily unavailable Date: Wed, 20 Nov 2013 16:47:33 -0500 I just wanted to leave this here since it took me way too long to figure out. For some people, this might be an obvious problem, but since it wasn't to me, I want to make sure anyone else that gets this can have this answer. I kept getting the following error when I was running a crawl. For me, it was consistently happening, but I couldn't find any similar issues or solutions on the typical sites. The closest thing I could find was this: http://www.nosql.se/2011/10/hadoop-tasktracker-java-lang-outofmemoryerror/ Below is the error I was seeing. This is just one of several exceptions that would happen during the parse but in the end, the parse step would have too many errors and fail the Nutch error limit. 13/11/20 20:14:19 INFO parse.ParseSegment: ParseSegment: segment: test/segments/20131120201240 13/11/20 20:14:20 INFO mapred.FileInputFormat: Total input paths to process : 2 13/11/20 20:14:21 INFO mapred.JobClient: Running job: job_201311202006_0017 13/11/20 20:14:22 INFO mapred.JobClient: map 0% reduce 0% 13/11/20 20:14:34 INFO mapred.JobClient: map 40% reduce 0% 13/11/20 20:14:36 INFO mapred.JobClient: map 50% reduce 0% 13/11/20 20:14:36 INFO mapred.JobClient: Task Id : attempt_201311202006_0017_m_01_0, Status : FAILED java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOException: Cannot run program /bin/ls: java.io.IOException: error=11, Resource temporarily unavailable at java.lang.ProcessBuilder.start(ProcessBuilder.java:460) at org.apache.hadoop.util.Shell.runCommand(Shell.java:200) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) at org.apache.hadoop.util.Shell.execCommand(Shell.java:461) at org.apache.hadoop.util.Shell.execCommand(Shell.java:444) at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:712) at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:448) at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:431) at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:267) at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124) at org.apache.hadoop.mapred.Child$4.run(Child.java:260) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.io.IOException: java.io.IOException: error=11, Resource temporarily unavailable at java.lang.UNIXProcess.init(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:453) ... 15 more at
Re: Possible Bug Nutch 1.7 Crawl Script
Thanks for reporting that. I think this point was brought up over [0] but was left off. Could you try out this and tell if it works ? SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1` [0] : https://issues.apache.org/jira/browse/NUTCH-1087?focusedCommentId=13554353page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13554353 On Thu, Sep 5, 2013 at 5:18 AM, Scheffel, Aaron aaron.schef...@washpost.com wrote: The 1.7 crawl script has the following line 125 SEGMENT=`ls -l $CRAWL_PATH/segments/ | sed -e s/ /\\n/g | egrep 20[0-9]+ | sort -n | tail -n 1` There is an ls -l there which is causing the script to behave badly (not work) at least on OS X. A simple ls seems to fix it. -Aaron
Re: nutch data from HBase to Oracle
This question is orthogonal to nutch. You could write a code to read data from HBase and then write to Oracle. On Thu, Sep 5, 2013 at 10:51 AM, A Laxmi a.lakshmi...@gmail.com wrote: Since HBase was the only stable datastore option recommeded for nutch 2.2.1, I wanted to know if it is possible to migrate/move crawled data stored in HBase to Oracle. The reason I am interested to have data available in Oracle is because I want to utilize some of the features in Oracle db such as triggers what HBase cannot offer. Please let me know if it is possible to move crawled data from HBase to Oracle?
Re: 回复: Aborting with 10 hung threads?
Could you check if urls could make it to the crawldb through the inject operation ? They can be filtered due to regex urlfilter. Run your urls against: bin/nutch plugin urlfilter-regex org.apache.nutch.urlfilter.regex.RegexURLFilter On Fri, Aug 30, 2013 at 1:42 AM, Jonathan.Wei 252637...@qq.com wrote: I run bin/nutch readdb -stats. return this message for me: hadoop@nutch1:/data/projects/clusters/apache-nutch-2.2/runtime/local$ bin/nutch readdb -stats WebTable statistics start Statistics for WebTable: jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=449839104, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1146, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=914496, FILE_BYTES_WRITTEN=1036378}, File Output Format Counters ={BYTES_WRITTEN=98 TOTAL urls: 0 WebTable statistics: done jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=449839104, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1146, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=914496, FILE_BYTES_WRITTEN=1036378}, File Output Format Counters ={BYTES_WRITTEN=98 TOTAL urls: 0 Why is 0 urls? The urls file has 216 urls! -- 原始邮件 -- 发件人: kaveh minooie [via Lucene]ml-node+s472066n408744...@n3.nabble.com ; 发送时间: 2013年8月30日(星期五) 下午4:30 收件人: 基勇252637...@qq.com; 主题: Re: Aborting with 10 hung threads? so fetch does hang when there is nothing for it to fetch. the most likely thing that has happened here is that your inject command did not go through successfully. you can check it by looking in to your hbase and see if the webpage table has been created and has values (your urls that you injected) in it. alliteratively you can just run 'nutch readdb -stats' and see what you get. if there was nothing there double check your config files. On 08/30/2013 12:12 AM, Jonathan.Wei wrote: And I run bin/nutch inject urls and bin/nutch generate -topN250. I checked the hbase,not any data! Where is the problem? -- View this message in context: http://lucene.472066.n3.nabble.com/Aborting-with-10-hung-threads-tp4087433p4087438.html Sent from the Nutch - User mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Aborting-with-10-hung-threads-tp4087433p4087447.html To unsubscribe from Aborting with 10 hung threads?, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/Aborting-with-10-hung-threads-tp4087450.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How nutch2.2 to parse rss?
AFAIK, the RSS plugin in 2.x ain't migrated.. i mean its code copied from 1.x trunk and would need modifications to get things working with 2.x. Thats why it was disabled in the build file. On Thu, Aug 29, 2013 at 6:34 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Jonathan, This has been a long outstanding issue IIRC. I have not used Nutch for feed crawling for a while if I am honest, and I honestly can't recall when and if I have done it with 2.x. You will see [0], that by default the plugin is not actually initialized. So for starters you should uncomment the various targets within this file [0] to get it working and to have it cleaned up etc. You can then try building... but I have a feeling that it will not build. Please check on our Jira for issues related to this... there may be patches but I am not sure. Kiran did some work a while back IIRC concerning getting following plugins to compile and run ant dir=feed target=deploy/ ant dir=parse-ext target=deploy/ ant dir=parse-swf target=deploy/ ant dir=parse-zip target=deploy/ But there is more work to be done. Please keep us updated on this on. Sorry for late reply. [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/build.xml On Thu, Aug 29, 2013 at 1:29 AM, Jonathan.Wei 252637...@qq.com wrote: Hello!Every body! I want to use nutch2.2 to parse RSS ! But nutch2.x different with nutch1.x!So I down know how to parse rss!Can you help me? Use crawl command grab 24 URL, but the results suggestAborting with 10 hung threads. log content is : 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 466 0 kb/s, 13 URLs in 1 queues 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 461 0 kb/s, 13 URLs in 1 queues 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 455 0 kb/s, 13 URLs in 1 queues 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 450 0 kb/s, 13 URLs in 1 queues 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 444 0 kb/s, 13 URLs in 1 queues 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 439 0 kb/s, 13 URLs in 1 queues 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 434 0 kb/s, 13 URLs in 1 queues Aborting with 10 hung threads. What causes this? How I can fix it? Thank you! -- View this message in context: http://lucene.472066.n3.nabble.com/How-nutch2-2-to-parse-rss-tp4087168.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: How nutch2.2 to parse rss?
The 1.x RSS plugin works post this jira ( https://issues.apache.org/jira/browse/NUTCH-1494). There is open jira ( https://issues.apache.org/jira/browse/NUTCH-1515) for its 2.x counterpart On Thu, Aug 29, 2013 at 8:10 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: yeah there is work to be done here for sure. there must be an issue open for this? On Thursday, August 29, 2013, Jonathan.Wei 252637...@qq.com wrote: Thank's! I try it! But I have a felling that it will not build too! Because some class file not find in nutch2.2! example : ParseData ParseResult! Thank you! -- 原始邮件 -- 发件人: lewis john mcgibbney [via Lucene] ml-node+s472066n4087394...@n3.nabble.com; 发送时间: 2013年8月30日(星期五) 上午9:34 收件人: 基勇252637...@qq.com; 主题: Re: How nutch2.2 to parse rss? Hi Jonathan, This has been a long outstanding issue IIRC. I have not used Nutch for feed crawling for a while if I am honest, and I honestly can't recall when and if I have done it with 2.x. You will see [0], that by default the plugin is not actually initialized. So for starters you should uncomment the various targets within this file [0] to get it working and to have it cleaned up etc. You can then try building... but I have a feeling that it will not build. Please check on our Jira for issues related to this... there may be patches but I am not sure. Kiran did some work a while back IIRC concerning getting following plugins to compile and run ant dir=feed target=deploy/ ant dir=parse-ext target=deploy/ ant dir=parse-swf target=deploy/ ant dir=parse-zip target=deploy/ But there is more work to be done. Please keep us updated on this on. Sorry for late reply. [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/build.xml On Thu, Aug 29, 2013 at 1:29 AM, Jonathan.Wei [hidden email] wrote: Hello!Every body! I want to use nutch2.2 to parse RSS ! But nutch2.x different with nutch1.x!So I down know how to parse rss!Can you help me? Use crawl command grab 24 URL, but the results suggestAborting with 10 hung threads. log content is : 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 466 0 kb/s, 13 URLs in 1 queues 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 461 0 kb/s, 13 URLs in 1 queues 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 455 0 kb/s, 13 URLs in 1 queues 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 450 0 kb/s, 13 URLs in 1 queues 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 444 0 kb/s, 13 URLs in 1 queues 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 439 0 kb/s, 13 URLs in 1 queues 0/10 spinwaiting/active, 11 pages, 0 errors, 0.0 0 pages/s, 434 0 kb/s, 13 URLs in 1 queues Aborting with 10 hung threads. What causes this? How I can fix it? Thank you! -- View this message in context: http://lucene.472066.n3.nabble.com/How-nutch2-2-to-parse-rss-tp4087168.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis* If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/How-nutch2-2-to-parse-rss-and-Aborting-with-10-hung-threads-question-tp4087168p4087394.html To unsubscribe from How nutch2.2 to parse rss? and Aborting with 10 hung threads question, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/How-nutch2-2-to-parse-rss-tp4087399.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: Issues Running Nutch 1.7 in Eclipse-- Please Help
The logs say: Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. Please get the segment dump and analyse it for the outlinks extracted. Also check your filters. On Sun, Aug 18, 2013 at 8:02 PM, S.L simpleliving...@gmail.com wrote: Hello All, I am running Nutch 1.7 in eclipse and I start out with the Crawl job with the following settings. Main Class :org.apache.nutch.crawl.Crawl Arguments : urls -dir crawl -depth 10 -topN 10 In the urls directory I have only one URL http://www.ebay.com and I expect the whole website to be crawled , however I get the following log output and the crawl seems to stop after a few urls are fetched. I use the nutch-default.xml and have already set http.content.limit to -1 in it as mentioned in the other message in this mailing list. However the crawl stops after a few URLs are fetched, please see the log below and advise. I am running eclipse on CentOS 6.4/Nutch 1.7 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 10 solrUrl=null topN = 10 Injector: starting at 2013-08-18 22:48:45 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 1 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: finished at 2013-08-18 22:48:47, elapsed: 00:00:02 Generator: starting at 2013-08-18 22:48:47 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 10 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20130818224849 Generator: finished at 2013-08-18 22:48:51, elapsed: 00:00:03 Fetcher: starting at 2013-08-18 22:48:51 Fetcher: segment: crawl/segments/20130818224849 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 fetching http://www.ebay.com/ (queue crawl delay=5000ms) -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2013-08-18 22:48:56, elapsed: 00:00:05 ParseSegment: starting at 2013-08-18 22:48:56 ParseSegment: segment: crawl/segments/20130818224849 Parsed (15ms):http://www.ebay.com/ ParseSegment: finished at 2013-08-18 22:48:57, elapsed: 00:00:01 CrawlDb update: starting at 2013-08-18 22:48:57 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20130818224849] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2013-08-18 22:48:58, elapsed: 00:00:01 Generator: starting at 2013-08-18 22:48:58 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 10 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2013-08-18 22:48:59 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment:
Re: Issues Running Nutch 1.7 in Eclipse-- Please Help
As I said earlier, take a dump of the segment. Use the dump option of the command here (http://wiki.apache.org/nutch/bin/nutch_readseg). Once thats done, in the output file, look for the entry for the parent link ( http://www.ebay.com/) and check its status, what was crawled, what was parsed, were any outlinks extracted. On Mon, Aug 19, 2013 at 2:22 PM, S.L simpleliving...@gmail.com wrote: How can I analyze the segment dump from Nutch , its has number of folders it seems , can you please let me know which specific folder in the segments folder do I need to look into , also the index and the data files are not exactly text files to make any sense out of them . I am using the default regex-filter that comes with nutch 1.7 , I have not changed that. Thank You. On Mon, Aug 19, 2013 at 4:07 AM, Tejas Patil tejas.patil...@gmail.com wrote: The logs say: Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. Please get the segment dump and analyse it for the outlinks extracted. Also check your filters. On Sun, Aug 18, 2013 at 8:02 PM, S.L simpleliving...@gmail.com wrote: Hello All, I am running Nutch 1.7 in eclipse and I start out with the Crawl job with the following settings. Main Class :org.apache.nutch.crawl.Crawl Arguments : urls -dir crawl -depth 10 -topN 10 In the urls directory I have only one URL http://www.ebay.com and I expect the whole website to be crawled , however I get the following log output and the crawl seems to stop after a few urls are fetched. I use the nutch-default.xml and have already set http.content.limit to -1 in it as mentioned in the other message in this mailing list. However the crawl stops after a few URLs are fetched, please see the log below and advise. I am running eclipse on CentOS 6.4/Nutch 1.7 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 10 solrUrl=null topN = 10 Injector: starting at 2013-08-18 22:48:45 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 1 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: finished at 2013-08-18 22:48:47, elapsed: 00:00:02 Generator: starting at 2013-08-18 22:48:47 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 10 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20130818224849 Generator: finished at 2013-08-18 22:48:51, elapsed: 00:00:03 Fetcher: starting at 2013-08-18 22:48:51 Fetcher: segment: crawl/segments/20130818224849 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 fetching http://www.ebay.com/ (queue crawl delay=5000ms) -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2013-08-18 22:48:56, elapsed: 00:00:05 ParseSegment: starting at 2013-08-18 22:48:56 ParseSegment: segment: crawl/segments
Re: protocol-file org.apache.nutch.protocol.file.FileError: File Error: 404
Hi Lewis, Can you try the patch attached over here: https://issues.apache.org/jira/browse/NUTCH-1483 Thanks, Tejas On Tue, Aug 6, 2013 at 7:24 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, Now using Nutch trunk 1.8-SNAPSHOT HEAD Back at this tonight. When attempting to fetch file://home/law/Downloads/asf/solr-4.3.1/example/e001 (notice two slashes) which contains loads of HTML files, I get the error as below. Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 org.apache.nutch.protocol.file.FileError: File Error: 404 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703) fetch of file://home/law/Downloads/asf/solr-4.3.1/example/e001 failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2013-08-06 18:59:00, elapsed: 00:00:02 I then deleted the crawldb changed the seed URL to file:/home/law/Downloads/asf/solr-4.3.1/example/e001 (notice one slash) But when I eventually get fetching after a few rounds of generate, fetch, parse, updatedb, I am landed with fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html (queue crawl delay=500ms) org.apache.nutch.protocol.file.FileError: File Error: 404 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703) fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html (queue crawl delay=500ms) org.apache.nutch.protocol.file.FileError: File Error: 404 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703) fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 Same as before... this happens with every single URL in the directory I am trying to crawl. Any advice here please? Thanks Lewis -- *Lewis*
Re: Way to fetch only new sites
Nutch 2.1 officially had support for MySQL as datastore. There were lot of issues reported with MySQL and so in the newer version ie. 2.2.X, the MySQL support is removed. I would recommend using HBase as its the most stable backend amongst all supported ones. On Thu, Aug 1, 2013 at 7:01 AM, Jayadeep Reddy jayad...@ehealthaccess.comwrote: Thank you Julien, Will get hbase and try to crawl. On Thu, Aug 1, 2013 at 7:10 PM, A Laxmi a.lakshmi...@gmail.com wrote: Julien - whatever you are saying about Nutch 2.x and SQL - does it apply for the recent release 2.2.1 as well? On Thu, Aug 1, 2013 at 9:38 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: If you are using Nutch 2.x then you are actually accessing the SQL storage via Apache GORA. The SQL backend in GORA does not work and it is not advised to use it. If you want to use Nutch 2 then use a different backend like HBase or Cassandra or use Nutch 1.x On 1 August 2013 14:32, Jayadeep Reddy jayad...@ehealthaccess.com wrote: No Julien Using Mysql On Thu, Aug 1, 2013 at 7:00 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: What GORA backend are you using? On 1 August 2013 14:03, Jayadeep Reddy jayad...@ehealthaccess.com wrote: I am using Nutch 2.1 every time I run crawl from dmoz directory my existing crawled pages in the database are fetched again(Taking long time/). Is there a way to crawl only new sites. Thank you -- Jayadeep Reddy.S, M.D C.E.O e Health Access Pvt.Ltd www.ehealthaccess.com Hyderabad-Chennai-Banglore http://www.youtube.com/watch?v=0k5LX8mw6Sk -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Jayadeep Reddy.S, M.D C.E.O e Health Access Pvt.Ltd www.ehealthaccess.com Hyderabad-Chennai-Banglore http://www.youtube.com/watch?v=0k5LX8mw6Sk -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Jayadeep Reddy.S, M.D C.E.O e Health Access Pvt.Ltd www.ehealthaccess.com Hyderabad-Chennai-Banglore http://www.youtube.com/watch?v=0k5LX8mw6Sk
Re: Duplicate Fetches for Fetch Job
1.x has speculative execution turned off: Fetcher.java:1328:job.setSpeculativeExecution(false); but 2.x doesn't. It makes sense to do that. I don't see any good reason to not have it in 2.x. Could you open a jira for this and upload a patch ? On Wed, Jul 24, 2013 at 11:40 PM, Talat UYARER talat.uya...@agmlab.comwrote: Hi, We are using nutch for high volume crawls. We noticed that FetcherJob ReduceTask fetches some websites multiple times for long lasting queues. I have discovered the reason of this is mapred.reduce.tasks.**speculative.execution settings in hadoop. This comes true as default. I suggest this value should be false for FetcherJob. What do you think? Talat
Re: Nutch 2.2.1 - scripts crawl and nutch
bin/nutch : allows to run individual commands separately. bin/crawl : contains calls to the bin/nutch script and invokes nutch commands required for a typical nutch crawl cycle. This makes life easy for users as you need not know the internal phases (and thus commands) of nutch and yet run a crawl. On Fri, Jul 12, 2013 at 8:09 AM, A Laxmi a.lakshmi...@gmail.com wrote: Hello, I have installed Nutch 2.2.1 without any issues. However, I could find two scripts crawl and nutch instead of one script - nutch like in earlier releases. Could anyone tell me why we have two scripts? what is the advantage of using one over the other? Thanks for your help!
Re: Nutch scalability tests
The second run, still shows 1 reduce running, although it shows as 100% complete, so my thought is it is writing out to the disk, though it has been about 30+ minutes. This one reducers log on the jobtracker however, is empty. This is weird. There can be a explanation for first line: The data crawled was large so dumping would take a lot of time but as you said there were very less urls so it should not take 30+ mins unless you crawled some super large files. Have you checked the job attempts for the job ? If there are no logs there then there is something weird going on with your cluster. On Wed, Jul 3, 2013 at 8:32 AM, h b hb6...@gmail.com wrote: oh and yes, generate.max.count is set to 5000 On Wed, Jul 3, 2013 at 8:29 AM, h b hb6...@gmail.com wrote: I dropped my webpage database, restarted with 5 seed urls. First fetch completed in a few seconds. The second run, still shows 1 reduce running, although it shows as 100% complete, so my thought is it is writing out to the disk, though it has been about 30+ minutes. Again, I had 80 reducers, when I look at the log of these reducers in the hadoop jobtracker, I see 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues in all of them, which leads me to think that the completed 79 reducers actually fetched nothing, which might explain why this 1 stuck reducer is working so hard. This may be expected, since I am crawling a single domain. This one reducers log on the jobtracker however, is empty. Don't know what to make of that. On Tue, Jul 2, 2013 at 4:15 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, On Tue, Jul 2, 2013 at 3:53 PM, h b hb6...@gmail.com wrote: So, I tried this with the generate.max.count property set to 5000, rebuild ant; ant jar; ant job and reran fetch. It still appears the same, first 79 reducers zip through and the last one is crawling, literally... Sorry I should have been more explicit. This property does not directly affect fetching. It is used when GENERATING fetch lists. Meaning that it needs to be present and acknowledged at the generate phase... before fetching is executed. Besides this, is there any progress being made at all on the last reduce? if you look at your CPU (and heap) for the box this is running on, it is usual to notice high levels for both of these respectively. Maybe this output writer is just taking a good while to write data down to HDFS... assuming you are using 1.x. As for the logs, I mentioned on one of my earlier threads that when I run from the deploy directory, I am not getting any logs generated. I looked for the logs directory under local as well as under deploy, and just to make sure, also in the grid. I do not see the logs directory. So I created it manually under deploy before starting fetch, and still there is nothing in this directory, OK so when you run Nutch as a deployed job in your logs are present within $HADOOP_LOG_DIR... you can check some logs on the JobTracker WebApp e.g. you will be able to see the reduce tasks for the fetch job and you will also be able to see varying snippets or all of the log here.
Re: Nutch scalability tests
The steps you performed are right. Did you get the log for that one hardworking reducer ? It will hint us why the job took so much. Ideally you should get logs for every job and its attempts. If you cannot get the log for that reducer, then I feel that your cluster is having some problem and this needs to be addressed. On Wed, Jul 3, 2013 at 8:47 AM, h b hb6...@gmail.com wrote: Hi Tejas, looks like we were tying at the same time So anyway, my job ended fine, just to be sure what I am doing is right, I have cleared the db and started another round again. If I stumble again, will respond back on this thread. On Wed, Jul 3, 2013 at 8:43 AM, Tejas Patil tejas.patil...@gmail.com wrote: The second run, still shows 1 reduce running, although it shows as 100% complete, so my thought is it is writing out to the disk, though it has been about 30+ minutes. This one reducers log on the jobtracker however, is empty. This is weird. There can be a explanation for first line: The data crawled was large so dumping would take a lot of time but as you said there were very less urls so it should not take 30+ mins unless you crawled some super large files. Have you checked the job attempts for the job ? If there are no logs there then there is something weird going on with your cluster. On Wed, Jul 3, 2013 at 8:32 AM, h b hb6...@gmail.com wrote: oh and yes, generate.max.count is set to 5000 On Wed, Jul 3, 2013 at 8:29 AM, h b hb6...@gmail.com wrote: I dropped my webpage database, restarted with 5 seed urls. First fetch completed in a few seconds. The second run, still shows 1 reduce running, although it shows as 100% complete, so my thought is it is writing out to the disk, though it has been about 30+ minutes. Again, I had 80 reducers, when I look at the log of these reducers in the hadoop jobtracker, I see 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues in all of them, which leads me to think that the completed 79 reducers actually fetched nothing, which might explain why this 1 stuck reducer is working so hard. This may be expected, since I am crawling a single domain. This one reducers log on the jobtracker however, is empty. Don't know what to make of that. On Tue, Jul 2, 2013 at 4:15 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, On Tue, Jul 2, 2013 at 3:53 PM, h b hb6...@gmail.com wrote: So, I tried this with the generate.max.count property set to 5000, rebuild ant; ant jar; ant job and reran fetch. It still appears the same, first 79 reducers zip through and the last one is crawling, literally... Sorry I should have been more explicit. This property does not directly affect fetching. It is used when GENERATING fetch lists. Meaning that it needs to be present and acknowledged at the generate phase... before fetching is executed. Besides this, is there any progress being made at all on the last reduce? if you look at your CPU (and heap) for the box this is running on, it is usual to notice high levels for both of these respectively. Maybe this output writer is just taking a good while to write data down to HDFS... assuming you are using 1.x. As for the logs, I mentioned on one of my earlier threads that when I run from the deploy directory, I am not getting any logs generated. I looked for the logs directory under local as well as under deploy, and just to make sure, also in the grid. I do not see the logs directory. So I created it manually under deploy before starting fetch, and still there is nothing in this directory, OK so when you run Nutch as a deployed job in your logs are present within $HADOOP_LOG_DIR... you can check some logs on the JobTracker WebApp e.g. you will be able to see the reduce tasks for the fetch job and you will also be able to see varying snippets or all of the log here.
Re: Integration of Apache-nutch and eclipse.
Have to looked at http://wiki.apache.org/nutch/RunNutchInEclipse ? This is recently been updated and worked for several people over the user-group. It has some cool screen shots which would make your life easy setting up Nutch with eclipse. On Wed, Jul 3, 2013 at 12:39 AM, Ramakrishna ramakrishna...@dioxe.comwrote: Guys.. I'm extremely sorry for posting/asking same doubt again.. After reading many documents also i dint get how to integrate nutch and eclipse. I've apache-nutch-2.2 and eclipse-juno versions are there. Plz tel me step by step, how to integrate eclipse and nutch with documentation. If possible plz send me screenshots of every steps from the beginning File-new-java project.. Also i'm working on windows.. My colleague did simple project without using cygwin/svn.. so plz don't tel again to use these software's/tools and don't send any links again.. already fed-up with those things. Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Integration-of-Apache-nutch-and-eclipse-tp4075003.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: a plugin extending IndexWriter
From the info you gave, its hard to tell. Can you look at src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java and compare. Looking at a indexing plugin which works will help you figure if you missed something. Also, had you added the entry for your plugin into nutch-site.xml - plugin.includes property before running ? The default value is: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property On Mon, Jul 1, 2013 at 6:59 AM, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: I wrote a plugin implementing IndexWriter It compiles and creates the jar as requested. However, Nutch does not succeed to find my IndexWriter in its constructor: this.indexWriters = (IndexWriter[]) objectCache .getObject(IndexWriter.class.getName()); After this call the indexWriters field is null... Where should I define that ? Best regards Benjamin
Re: Updating the documentation for crawl via 2.x
I think that the wiki page was made with an intention that users knew about 1.x and would now be switching to 2.x. So it had only the gora and datastore setup steps. I agree with you that it should contain complete set of steps. *@dev:* Unless there is any objection or better suggestion, I would get this done in coming days. On Sun, Jun 30, 2013 at 4:14 AM, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Hi I think we may update the documentation of crawl instructions Currently, the instructions stop at the inject step. And we are supposed to follow the instructions in Nutch 1.x However in these instructions, the syntax is quite different For example: bin/nutch generate does not expect crawldb and segents path etc... I think an update would be very useful. Benjamin
Re: Crawl in Nutch2.2
I think that you are hitting something that one the users faced few of days back. Can you try the things mentioned here: http://mail-archives.apache.org/mod_mbox/nutch-user/201306.mbox/%3CCAFKhtFwPozH3dokk%2B_bZKqVT81h86aCpQzbL4rR4U3wZ-%2BOmHg%40mail.gmail.com%3E On Sun, Jun 30, 2013 at 5:10 AM, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Thanks a lot for your help however, I still did not resovle this issue... I attach there the logs after 2 rounds of generate/fetch/parse/updatedb the DB still contains only the seed url , not more... On Thu, Jun 27, 2013 at 12:37 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Try each step with a crawlId and see if this provides you with better results. Unless you truncated all data between Nutch tasks then you should be seeing more data in HBase. As Tejas asked... what do the logs say? On Wed, Jun 26, 2013 at 3:40 AM, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Hi Lewis, Thanks for your reply I just set the values: gora.datastore.default=org.apache.gora.hbase.store.HBaseStore I already removed the Hbase table in the past. Can it be a cause? Benjamin On Tue, Jun 25, 2013 at 7:34 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Have you changed from the default MemStore gora storage to something else? On Tuesday, June 25, 2013, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: thanks Tejas Yes, I cheecked the logs and no Error appears in them I let the http.content.limit and parser.html.impl with their default value... Benajmin On Tue, Jun 25, 2013 at 6:14 PM, Tejas Patil tejas.patil...@gmail.com wrote: Did you check the logs (NUTCH_HOME/logs/hadoop.log) for any exception or error messages ? Also you might have a look at these configs in nutch-site.xml (default values are in nutch-default.xml): http.content.limit and parser.html.impl On Tue, Jun 25, 2013 at 7:04 AM, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Hello I installed Nutch 2.2 on my linux machine. I defined the seed directory with one file containing: http://en.wikipedia.org/ http://edition.cnn.com/ I ran the following: sh bin/nutch inject ~/DataExplorerCrawl_gpfs/seed/ After this step: the call -bash-4.1$ sh bin/nutch readdb -stats returns TOTAL urls: 2 status 0 (null):2 avg score: 1.0 Then, I ran the following: bin/nutch generate -topN 10 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb However, the stats call after these steps is still: the call -bash-4.1$ sh bin/nutch readdb -stats status 5 (status_redir_perm): 1 max score: 2.0 TOTAL urls: 3 avg score: 1.334 Only 3 urls?! What do I miss? thanks Benjamin -- *Lewis* -- *Lewis*
Re: nutch2.x in cluster mode ?
I have never used 2.x on prod but this is what I would do: The datastore backend needs to be setup on the cluster. Even Hadoop must be installed. Export all relevant environment variables. Nutch 2.x source must be downloaded to the master node. Then modify the required configs and run ant runtime to create nutch binaries inside NUTCH_HOME/runtime/deploy. Trigger the crawl command from the master node. On Sun, Jun 30, 2013 at 1:16 AM, Tony Mullins tonymullins...@gmail.comwrote: Tejas , that [0] is for nutch 1.x which uses hdfs for its data storage. And as new nutch 2.x uses hbase (backend) which is already based on hadoop (hdfs). If I deploy my hbase in cluster mode on 3 different nodes then do I still need to deploy nutch 2.x on these 3 nodes as well ? Could you please care to add some little more information for nutch2.x + hbase + hadoop ? Regards, Khan [0] : http://wiki.apache.org/nutch/NutchHadoopTutorial On Sat, Jun 29, 2013 at 10:50 PM, Tejas Patil tejas.patil...@gmail.com wrote: On Sat, Jun 29, 2013 at 10:36 AM, imran khan imrankhan.x...@gmail.com wrote: Greetings, Is there any guide for setting up nutch2.x in cluster mode ? [0] is a relevant wiki page .. which has not been updated since a long time. I am guessing that you have already tried running in local mode as given in [1]. For cluster mode, have hadoop 1.2.0 setup and its variables exported, set nutch configs as per your requirements, run 'ant' and then run nutch commands from $NUTCH_HOME/runtime/deploy And which versions of hadopp nutch2.x/hbase works well in cluster mode ? Use Nutch 2.2 and HBase 0.90.x Regards, Khan [0] : http://wiki.apache.org/nutch/NutchHadoopTutorial [1] : http://wiki.apache.org/nutch/Nutch2Tutorial
Re: Questions/issues with nutch
I am curious to know why do needed the raw html content instead of parsed stuff. Search engines are meant to index parsed text. The data to be stored and indexed reduces after parsing. On Sat, Jun 29, 2013 at 9:20 PM, h b hb6...@gmail.com wrote: Thanks Tejas, I have just 2 urls in my seed file, and the second run of fetch ran for a few hours. I will verify if I got what I wanted. Regarding the raw html, its a ugly hack, so I did not really create a patch. But this is what I did In src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java getParse method, //text = sb.toString(); text = new String(page.getContent().array()); Would be nice to make this as a configuration in the plugin xml. Other thing I will try soon is to extract the content only for a specific depth. On Sat, Jun 29, 2013 at 12:49 AM, Tejas Patil tejas.patil...@gmail.com wrote: Yes. Nutch would parse the HTML and extract the content out of it. Tweaking around the code surrounding the parser would have made that happen. If you did something else, would you mind sharing it ? The depth is used by the Crawl class in 1.x which is deprecated in 2.x. Use bin/crawl instead. While running the bin/crawl script, the numberOfRounds option is nothing but the depth till which you want the crawling to be performed. If you want to use the individual commands instead, run generate - fetch - parse - update multiple times. The crawl script internally does the same thing. eg. If you want to fetch till depth 3, this is how you could do: inject - (generate - fetch - parse - update) - (generate - fetch - parse - update) - (generate - fetch - parse - update) - solrindex On Fri, Jun 28, 2013 at 7:24 PM, h b hb6...@gmail.com wrote: Ok, I tweaked the code a bit to extract the html as is from the parser, to realize that it is too much of a text and too much depth of crawling. So I am looking to see if I can somehow limit the depth. Nutch 1.x docs mention about the -depth parameter. However, I do not see this in the nutch-default.xml under Nutch 2.x. The -topN is used for number of links per depth. So for Nutch 2.x where/how do I set the depth? On Fri, Jun 28, 2013 at 11:32 AM, h b hb6...@gmail.com wrote: Ok, SO i also got this work with Solr 4 no errors, I think the key was not using a crawl id. I had to comment the updatelog in solrconfig.xml because I got some _version_ related error. My next questions is, my solr document, or for that matter even the hbase value of the html content is 'not html'. It appears that nutch is extracting out text only. How do I retain the html content as is. On Fri, Jun 28, 2013 at 10:54 AM, Tejas Patil tejas.patil...@gmail.com wrote: Kewl !! I wonder why org.apache.solr.common.SolrException: undefined field text happens.. Anybody who can throw light on this ? On Fri, Jun 28, 2013 at 10:45 AM, h b hb6...@gmail.com wrote: Thanks Tejas I tried these steps, One step I added, was updatedb *bin/nutch updatedb* Just to be consistent with the doc, and your suggestion on some other thread, I used solr 3.6 instead of 4.x I copied the schema.xml from nutch/conf (rootlevel) and started solr. It failed with SEVERE: org.apache.solr.common.SolrException: undefined field text One of the google thread, suggested I ignore this error, so I ignored and indexed anyway So now I got it to work. Playing some more with the queries On Fri, Jun 28, 2013 at 9:52 AM, Tejas Patil tejas.patil...@gmail.com wrote: The storage.schema.webpage seems messed up but I don't have ample time now to look into it. Here is what I would suggest to get things working: * * *[1] Remove all the old data from HBase* (I assume that HBase is running while you do this) *cd $HBASE_HOME* *./bin/hbase shell * In the HBase shell, use list to see all the tables, delete all of those related to Nutch (ones named as *webpage). Remove them using disable and drop commands. eg. if one of the tables is webpage, you would run this: *disable 'webpage' * *drop 'webpage'* * * *[2] Run crawl* I assume that you have not changed storage.schema.webpage is nutch-site.xml and nutch-default.xml. If yes, revert it to: *property* * namestorage.schema.webpage/**name* * valuewebpage/value* * descriptionThis value holds the schema name used for Nutch web db.* * Note that Nutch ignores the value in the gora mapping files, and uses* * this as the webpage schema name.* * /description
Re: How to use Nutch 2.2.1 with Solr
On Sun, Jun 30, 2013 at 5:57 AM, Hung Nguyen Dang nguyendanghung2...@gmail.com wrote: Hello, I'm new to use nutch, but I found that, the document is confuse with me, Could you please help me show a basic document step by step to config and run nutch? Do wee need to config : 1. Hadoop, http://hadoop.apache.org/docs/stable/single_node_setup.html http://hadoop.apache.org/docs/stable/cluster_setup.html 2. HBase http://hbase.apache.org/book/quickstart.html 3. Gora https://wiki.apache.org/nutch/Nutch2Tutorial to run nutch? Thanks, Nguyen Dang Hung
Re: Questions/issues with nutch
Yes. Nutch would parse the HTML and extract the content out of it. Tweaking around the code surrounding the parser would have made that happen. If you did something else, would you mind sharing it ? The depth is used by the Crawl class in 1.x which is deprecated in 2.x. Use bin/crawl instead. While running the bin/crawl script, the numberOfRounds option is nothing but the depth till which you want the crawling to be performed. If you want to use the individual commands instead, run generate - fetch - parse - update multiple times. The crawl script internally does the same thing. eg. If you want to fetch till depth 3, this is how you could do: inject - (generate - fetch - parse - update) - (generate - fetch - parse - update) - (generate - fetch - parse - update) - solrindex On Fri, Jun 28, 2013 at 7:24 PM, h b hb6...@gmail.com wrote: Ok, I tweaked the code a bit to extract the html as is from the parser, to realize that it is too much of a text and too much depth of crawling. So I am looking to see if I can somehow limit the depth. Nutch 1.x docs mention about the -depth parameter. However, I do not see this in the nutch-default.xml under Nutch 2.x. The -topN is used for number of links per depth. So for Nutch 2.x where/how do I set the depth? On Fri, Jun 28, 2013 at 11:32 AM, h b hb6...@gmail.com wrote: Ok, SO i also got this work with Solr 4 no errors, I think the key was not using a crawl id. I had to comment the updatelog in solrconfig.xml because I got some _version_ related error. My next questions is, my solr document, or for that matter even the hbase value of the html content is 'not html'. It appears that nutch is extracting out text only. How do I retain the html content as is. On Fri, Jun 28, 2013 at 10:54 AM, Tejas Patil tejas.patil...@gmail.com wrote: Kewl !! I wonder why org.apache.solr.common.SolrException: undefined field text happens.. Anybody who can throw light on this ? On Fri, Jun 28, 2013 at 10:45 AM, h b hb6...@gmail.com wrote: Thanks Tejas I tried these steps, One step I added, was updatedb *bin/nutch updatedb* Just to be consistent with the doc, and your suggestion on some other thread, I used solr 3.6 instead of 4.x I copied the schema.xml from nutch/conf (rootlevel) and started solr. It failed with SEVERE: org.apache.solr.common.SolrException: undefined field text One of the google thread, suggested I ignore this error, so I ignored and indexed anyway So now I got it to work. Playing some more with the queries On Fri, Jun 28, 2013 at 9:52 AM, Tejas Patil tejas.patil...@gmail.com wrote: The storage.schema.webpage seems messed up but I don't have ample time now to look into it. Here is what I would suggest to get things working: * * *[1] Remove all the old data from HBase* (I assume that HBase is running while you do this) *cd $HBASE_HOME* *./bin/hbase shell * In the HBase shell, use list to see all the tables, delete all of those related to Nutch (ones named as *webpage). Remove them using disable and drop commands. eg. if one of the tables is webpage, you would run this: *disable 'webpage' * *drop 'webpage'* * * *[2] Run crawl* I assume that you have not changed storage.schema.webpage is nutch-site.xml and nutch-default.xml. If yes, revert it to: *property* * namestorage.schema.webpage/**name* * valuewebpage/value* * descriptionThis value holds the schema name used for Nutch web db.* * Note that Nutch ignores the value in the gora mapping files, and uses* * this as the webpage schema name.* * /description* */property* Run crawl commands: *bin/nutch inject urls/* *bin/nutch generate -topN 5 -noFilter -adddays 0* *bin/nutch fetch -all -threads 5 * *bin/nutch parse -all * *[3] Perform indexing* I assume that you have Solr setup and NUTCH_HOME/conf/schema.xml copied in ${SOLR_HOME}/example/solr/conf/. See bullets 4-6 in [0] for details. Start solr and run the indexing command: *bin/nutch solrindex $SOLR_URL -all * [0] : http://wiki.apache.org/nutch/NutchTutorial Thanks, Tejas On Thu, Jun 27, 2013 at 1:47 PM, h b hb6...@gmail.com wrote: Ok, so avro did not work quite well for me, I got a test grid with hbase, and I started using that for now. All steps ran without errors and I see my crawled doc in hbase. However, after running the solr integration, and querying solr, I get back nothing. Index files look very tiny. The one thing I noted is a message during almost every step 13/06/27 20:37:53 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names mappingfile schema is 'webpage' vs
Re: nutch2.x in cluster mode ?
On Sat, Jun 29, 2013 at 10:36 AM, imran khan imrankhan.x...@gmail.comwrote: Greetings, Is there any guide for setting up nutch2.x in cluster mode ? [0] is a relevant wiki page .. which has not been updated since a long time. I am guessing that you have already tried running in local mode as given in [1]. For cluster mode, have hadoop 1.2.0 setup and its variables exported, set nutch configs as per your requirements, run 'ant' and then run nutch commands from $NUTCH_HOME/runtime/deploy And which versions of hadopp nutch2.x/hbase works well in cluster mode ? Use Nutch 2.2 and HBase 0.90.x Regards, Khan [0] : http://wiki.apache.org/nutch/NutchHadoopTutorial [1] : http://wiki.apache.org/nutch/Nutch2Tutorial
Re: NUTCH, SOLR and HBase integration
Try using Solr 3.x and follow the steps (4-6) given in http://wiki.apache.org/nutch/NutchTutorial On Thu, Jun 27, 2013 at 11:39 PM, Mariam Salloum mariam.sall...@gmail.comwrote: Hi Tejas, Thanks for your response. I'm using the latest version solr-4.3.1. On Jun 27, 2013, at 11:10 PM, Tejas Patil tejas.patil...@gmail.com wrote: Which version of SOLR are you using ? It should go well with Solr 3.x http://wiki.apache.org/nutch/NutchTutorial On Thu, Jun 27, 2013 at 11:02 PM, Mariam Salloum mariam.sall...@gmail.comwrote: I'm having problems with integrating SOLR and NUTCH. I have done the following: 1 - Installed/configured NUTCH, SOLR, and HBase. 2 - The crawl script did not work for me, so I'm using the step-by-step commands 3 - I ran inject, generate, fetch, and parse and all ran successfully. I'm able to see the table in HBase and see the fetch and parse flags set for the entries. 4 - I copied the /conf/schema.xml from the Nutch directory into the SOLR config directory and verified its using the right schema.xml file. 5 - I made sure that I updated schema.xml to set indexed and stored property to true field name=content type=text stored=true indexed=true/ 6 - Finally, I started SOLR and tried running bin/nutch solrindex … SOLR runs without errors (checked the solr.log). However, nothing is loaded to SOLR. It states number of documents loaded is 0, and the query *:* returns nothing. What could be the problem? Any ideas will be appreciated. Thanks Mariam
Re: Questions/issues with nutch
directory structure. Make changes to conf/nutch-site.xml, build the job jar, navigate to runtime/deploy, run the code. It's easier to make the job jar and scripts in deploy available to the job tracker. You also didn't comment on the counters for the inject job. Do you see any? Best Lewis On Wednesday, June 26, 2013, h b hb6...@gmail.com wrote: Here is an example of what I am saying about the config changes not taking effect. cd runtime/deploy cat ../local/conf/nutch-site.xml .. property namestorage.data.store.class/name valueorg.apache.gora.avro.store.AvroStore/value /property . cd ../.. ant job cd runtime/deploy bin/nutch inject urls -crawlId crawl1 . 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class. . So the nutch-site.xml was changed to use AvroStore as storage class and job was rebuilt, and I reran inject, the output of which still shows that it is trying to use Memstore. On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: The Gora MemStore was introduced to deal predominantly with test scenarios. This is justified as the 2.x code is pulled nightly and after every commit and tested. It is nnot thread safe and should not be used (until we fix some issues) for any kind of serious deployment. From your inject task on the job tracker, you will be able to see 'urls_injected' counters which represent the number of urls actually persisted through Gora into the datastore. I understand that HBase is not an option. Gora should also support writing the output into Avro sequence files... which can be pumped into hdfs. We have done some work on this so I suppose that right now is as good a time as any for you to try it out. use the default datastore as org.apache.gora.avro.store.AvroStore I think. You can double check by looking into gora.properties As a note, youu should use nutch-site.xml within the top level conf directory for all your Nutch configuration. You should then create a new job jar for use in hadoop by calling 'ant job' after the changes are made. hth Lewis On Wednesday, June 26, 2013, h b hb6...@gmail.com wrote: The quick responses flowing are very encouraging. Thanks Tejas. Tejas, as I mentioned earlier, in fact I actually ran it step by step. So first I ran the inject command and then the readdb with dump option and did not see anything in the dump files, that leads me to say that the inject did not work.I verified the regex-urlfilter and made sure that my url is not getting filtered. I agree that the second link is about configuring HBase as a storageDB. However, I do not have Hbase installed and dont foresee getting it installed any sooner, hence using HBase for storage is not a option, so I am going to have to stick to Gora with memory store. On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil tejas.patil...@gmail.com wrote: On Wed, Jun 26, 2013 at 9:53 PM, h b hb6...@gmail.com wrote: Thanks for the response Lewis. I did read these links, I mostly followed the first link and tried both the 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer exception on solr, so I figured that I should first deal with getting the crawl part to work and then deal with solr indexing. Hence I went back to trying it stepwise. You should try running the crawl using individual commands and see where the problem is. The nutch tutorial which Lewis pointed you to had those commands. Even peeking into the bin/crawl script would also help as it calls the nutch commands. As for the second link, it is more about using HBase as store instead of gora. This is not really a option for me yet, cause my grid does not have hbase installed yet. Getting it done is not much under my control HBase is one of the datastores supported by Apache Gora. That tutorial speaks about how to configure Nutch (actually Gora) to use HBase as a backend. So, its wrong to say that the tutorial was about HBase and not Gora. the FAQ link is the one I had not gone through until I checked your response, but I do not find answers to any of my questions (directly/indirectly) in it. Ok On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney *Lewis* -- *Lewis*
Re: Id based crawling with nutch2.x/hbase and multiple webpage tables
On Thu, Jun 27, 2013 at 12:24 AM, Tony Mullins tonymullins...@gmail.comwrote: I am grateful for the help community is giving me and I wont be able to do it without their help. When I was using Cassandra, it only created sinlge 'webpage' table, if I ran my jobs without crawlId (directly from eclipse) or with crawlId it always used the same 'webpage' table. This is not the case with HBase, as HBase creates a table like 'crawlId_webpage' , so what I was saying is it possible to achieve the same behavior (Cassandra's) with Hbase ( to make HBase only create single 'webpage' table even if I give crawlId to my bin/crawl script ) ? You can customize your bin/crawl script to get that done. Currently it passes the crawlId argument to the nutch commands. You can can check the usage of those commands and figure out if they accept -all AFAIK, fetch and parse commands have a -all param which you can use. Updatedb does not need it as by default it works over all batches. And I think this log is generated due to the same issue I mentioned above : Keyclass and nameclass match but mismatching table names mappingfile schema is 'webpage' vs actual schema 'C11_webpage' , assuming they are the same. I have no clue what this is about. I will be looking into this in coming days. And what do you meant by the status of URLs ? Those indicate the status of the url. [0] is a shameless plug of my answer over stackoverflow which tells what each status stands for. These are the logs when I run my job for the first time ( Inject - generate - fetch - parse - DBUpdate) and for 2 or 3 depth levels ( generate - fetch - parse - DBUpdate) I always get these *status:2 (status_fetched)* fetchTime:0 prevFetchTime:0 fetchInterval:0 retriesSinceFetch:0 modifiedTime:0 prevModifiedTime:0 protocolStatus:(null) Thanks again for your help. Tony. [0] : http://stackoverflow.com/questions/16853155/where-can-i-find-documentation-about-nutch-status-codes/16869165#16869165 On Thu, Jun 27, 2013 at 2:33 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: On Wed, Jun 26, 2013 at 4:30 AM, Tony Mullins tonymullins...@gmail.com wrote: Is it possible to crawl with crawlId but HBase only crates 'webpage' table without crawlId prefix , just like Cassandra does? I can't understand this question Tony. And my other problems of DBUpdateJob's exception on some random urls and repeating/mixed html of all urls present in seed.txt are also resolved (disappeared) with HBase backend. Good Am I suppose to get proper values here or these are the expected output in ParseFilter plugin ? What is the status of the URLs which have the null or 0 values for the fields you posted? PS. Now I am getting correct HTML in ParseFilter with HBase backend. Good
Re: Questions/issues with nutch
On Wed, Jun 26, 2013 at 10:26 PM, h b hb6...@gmail.com wrote: The quick responses flowing are very encouraging. Thanks Tejas. Tejas, as I mentioned earlier, in fact I actually ran it step by step. So first I ran the inject command and then the readdb with dump option and did not see anything in the dump files, that leads me to say that the inject did not work.I verified the regex-urlfilter and made sure that my url is not getting filtered. and you see nothing interesting in the logs. Oh boy... If this happens w/o any config changes over the distribution (apart from http.agent.name), then it should have been reported by now. You might set the loggers to lower level to get more details. I have a feeling that mostly the reason is the datastore used is buggy. I agree that the second link is about configuring HBase as a storageDB. However, I do not have Hbase installed and dont foresee getting it installed any sooner, hence using HBase for storage is not a option, so I am going to have to stick to Gora with memory store. Ok. There were Jiras logged regarding memory store not working correctly (it was in reference to junits being failing). Lewis / Renato might have more knowledge about it. Being honest, I doubt it anybody .. out there .. is actually using memstore. HBase seems to be the most cheered backend. On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil tejas.patil...@gmail.com wrote: On Wed, Jun 26, 2013 at 9:53 PM, h b hb6...@gmail.com wrote: Thanks for the response Lewis. I did read these links, I mostly followed the first link and tried both the 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer exception on solr, so I figured that I should first deal with getting the crawl part to work and then deal with solr indexing. Hence I went back to trying it stepwise. You should try running the crawl using individual commands and see where the problem is. The nutch tutorial which Lewis pointed you to had those commands. Even peeking into the bin/crawl script would also help as it calls the nutch commands. As for the second link, it is more about using HBase as store instead of gora. This is not really a option for me yet, cause my grid does not have hbase installed yet. Getting it done is not much under my control HBase is one of the datastores supported by Apache Gora. That tutorial speaks about how to configure Nutch (actually Gora) to use HBase as a backend. So, its wrong to say that the tutorial was about HBase and not Gora. the FAQ link is the one I had not gone through until I checked your response, but I do not find answers to any of my questions (directly/indirectly) in it. Ok On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Hemant, I strongly advise you to take some time to look through the Nutch Tutorial for 1.x and 2.x. http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/Nutch2Tutorial Also please see the FAQ's, which you will find very very useful. http://wiki.apache.org/nutch/FAQ Thanks Lewis On Wed, Jun 26, 2013 at 5:18 PM, h b hb6...@gmail.com wrote: Hi, I am first time user of nutch. I installed nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single webpage. I am running nutch step by step. These are the problems I came across - 1. Inject did not work, i..e the url does not reflect in the webdb(gora-memstore). The way I verify this is after running inject, i run readdb with dump. This created a directory in hdfs with 0 size part file. 2. config files - This confused me a lot. When run from deploy directory, does nutch use the config files from local/conf? Changes made to local/conf/nutch-site.xml did not take effect after editing this file. I had to edit this in order to get rid of the 'http.agent.name' error. I finally ended up hard-coding this in the code, rebuilding and running to keep going forward. 3. how to interpret readdb - Running readdb -stats, shows a lot out output but I do not see my url from seed.txt in there. So I do not know if the entry in webdb actually reflects my seed.txt at all or not. 4. logs - When nutch is run from the deploy directory, the logs/hadoop.log is not generated anymore, not locally, nor on the grid. I tried to make it verbose by changing log4j.properties to DEBUG, but still had not file generated. Any help with this would help me move forward with nutch. Regards Hemant -- *Lewis*
Re: Questions/issues with nutch
Hi Lewis, Thanks for details. One quickie: By using memstore as the datastore, will the results be persisted across runs ? I mean, after injecting stuff, where would the crawl datums get stored on to the disk so that the generate phase gets those ? I believe that memstore won't do it and would give up everything once the process ends. On Wed, Jun 26, 2013 at 11:06 PM, Tejas Patil tejas.patil...@gmail.comwrote: On Wed, Jun 26, 2013 at 10:26 PM, h b hb6...@gmail.com wrote: The quick responses flowing are very encouraging. Thanks Tejas. Tejas, as I mentioned earlier, in fact I actually ran it step by step. So first I ran the inject command and then the readdb with dump option and did not see anything in the dump files, that leads me to say that the inject did not work.I verified the regex-urlfilter and made sure that my url is not getting filtered. and you see nothing interesting in the logs. Oh boy... If this happens w/o any config changes over the distribution (apart from http.agent.name), then it should have been reported by now. You might set the loggers to lower level to get more details. I have a feeling that mostly the reason is the datastore used is buggy. I agree that the second link is about configuring HBase as a storageDB. However, I do not have Hbase installed and dont foresee getting it installed any sooner, hence using HBase for storage is not a option, so I am going to have to stick to Gora with memory store. Ok. There were Jiras logged regarding memory store not working correctly (it was in reference to junits being failing). Lewis / Renato might have more knowledge about it. Being honest, I doubt it anybody .. out there .. is actually using memstore. HBase seems to be the most cheered backend. On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil tejas.patil...@gmail.com wrote: On Wed, Jun 26, 2013 at 9:53 PM, h b hb6...@gmail.com wrote: Thanks for the response Lewis. I did read these links, I mostly followed the first link and tried both the 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer exception on solr, so I figured that I should first deal with getting the crawl part to work and then deal with solr indexing. Hence I went back to trying it stepwise. You should try running the crawl using individual commands and see where the problem is. The nutch tutorial which Lewis pointed you to had those commands. Even peeking into the bin/crawl script would also help as it calls the nutch commands. As for the second link, it is more about using HBase as store instead of gora. This is not really a option for me yet, cause my grid does not have hbase installed yet. Getting it done is not much under my control HBase is one of the datastores supported by Apache Gora. That tutorial speaks about how to configure Nutch (actually Gora) to use HBase as a backend. So, its wrong to say that the tutorial was about HBase and not Gora. the FAQ link is the one I had not gone through until I checked your response, but I do not find answers to any of my questions (directly/indirectly) in it. Ok On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Hemant, I strongly advise you to take some time to look through the Nutch Tutorial for 1.x and 2.x. http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/Nutch2Tutorial Also please see the FAQ's, which you will find very very useful. http://wiki.apache.org/nutch/FAQ Thanks Lewis On Wed, Jun 26, 2013 at 5:18 PM, h b hb6...@gmail.com wrote: Hi, I am first time user of nutch. I installed nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single webpage. I am running nutch step by step. These are the problems I came across - 1. Inject did not work, i..e the url does not reflect in the webdb(gora-memstore). The way I verify this is after running inject, i run readdb with dump. This created a directory in hdfs with 0 size part file. 2. config files - This confused me a lot. When run from deploy directory, does nutch use the config files from local/conf? Changes made to local/conf/nutch-site.xml did not take effect after editing this file. I had to edit this in order to get rid of the 'http.agent.name' error. I finally ended up hard-coding this in the code, rebuilding and running to keep going forward. 3. how to interpret readdb - Running readdb -stats, shows a lot out output but I do not see my url from seed.txt in there. So I do not know if the entry in webdb actually reflects my seed.txt at all or not. 4. logs - When nutch is run from the deploy directory, the logs/hadoop.log is not generated anymore, not locally, nor on the grid. I tried
Re: Segments / Database in Nutch 2.X
On Thu, Jun 27, 2013 at 3:38 AM, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Hi I do not see the usage of Segments in nutch 2.x In addition, I do not see DB path . segments and crawldb are notions in 1.x representing the dir over FS which has the crawlers' data in it (those are nothing but Hadoops' Map files and Sequence files). 2.x leverages datastores to store the crawled data. A table is created in the datastore to have all the information. In such condition, how can we two separate crawls, one starting from url1 and the second from another seed, for example? You could specify different crawlIDs. Being honest, I have never tried running multiple crawls at the same time with 2.x. Its not seen to be a good thing to do as mentioned by Julien in this thread: http://lucene.472066.n3.nabble.com/Concurrently-running-multiple-nutch-crawls-td3166207.html Benjamin
Re: nutch issues
On Thu, Jun 27, 2013 at 4:30 AM, devang pandey devangpande...@gmail.comwrote: I am quite new to nutch. I have crawled a site successfully using nutch 1.2 You should use the latest version (1.7) as it has many bug fixes and enhancements. and extracted segment dump by *readseg* command but issue is that dump contains lot of information other than url and outlinks also if i want to analyse it, manual approach needs to be adopted. Did you use the general options ? Those are -nocontent ignore content directory -nofetch ignore crawl_fetch directory -nogenerate ignore crawl_generate directory -noparse ignore crawl_parse directory -noparsedata ignore parse_data directory -noparsetext ignore parse_text directory To see the usage, just run bin/nutch readseg w/o any params. It would be really great if there is any utiltiy, plugin which export link with out links in machine readable format like csv or sql. Please suggest The dump option of readseg command would give you a dump of the segment in plain text file which is human readable. You could run some shell commands to convert it into desired form you want.
Re: Questions/issues with nutch
What is the datastore in gora.properties ? http://wiki.apache.org/nutch/Nutch2Tutorial On Wed, Jun 26, 2013 at 11:37 PM, h b hb6...@gmail.com wrote: Here is an example of what I am saying about the config changes not taking effect. cd runtime/deploy cat ../local/conf/nutch-site.xml .. property namestorage.data.store.class/name valueorg.apache.gora.avro.store.AvroStore/value /property . cd ../.. ant job cd runtime/deploy bin/nutch inject urls -crawlId crawl1 . 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class. . So the nutch-site.xml was changed to use AvroStore as storage class and job was rebuilt, and I reran inject, the output of which still shows that it is trying to use Memstore. On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: The Gora MemStore was introduced to deal predominantly with test scenarios. This is justified as the 2.x code is pulled nightly and after every commit and tested. It is nnot thread safe and should not be used (until we fix some issues) for any kind of serious deployment. From your inject task on the job tracker, you will be able to see 'urls_injected' counters which represent the number of urls actually persisted through Gora into the datastore. I understand that HBase is not an option. Gora should also support writing the output into Avro sequence files... which can be pumped into hdfs. We have done some work on this so I suppose that right now is as good a time as any for you to try it out. use the default datastore as org.apache.gora.avro.store.AvroStore I think. You can double check by looking into gora.properties As a note, youu should use nutch-site.xml within the top level conf directory for all your Nutch configuration. You should then create a new job jar for use in hadoop by calling 'ant job' after the changes are made. hth Lewis On Wednesday, June 26, 2013, h b hb6...@gmail.com wrote: The quick responses flowing are very encouraging. Thanks Tejas. Tejas, as I mentioned earlier, in fact I actually ran it step by step. So first I ran the inject command and then the readdb with dump option and did not see anything in the dump files, that leads me to say that the inject did not work.I verified the regex-urlfilter and made sure that my url is not getting filtered. I agree that the second link is about configuring HBase as a storageDB. However, I do not have Hbase installed and dont foresee getting it installed any sooner, hence using HBase for storage is not a option, so I am going to have to stick to Gora with memory store. On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil tejas.patil...@gmail.com wrote: On Wed, Jun 26, 2013 at 9:53 PM, h b hb6...@gmail.com wrote: Thanks for the response Lewis. I did read these links, I mostly followed the first link and tried both the 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer exception on solr, so I figured that I should first deal with getting the crawl part to work and then deal with solr indexing. Hence I went back to trying it stepwise. You should try running the crawl using individual commands and see where the problem is. The nutch tutorial which Lewis pointed you to had those commands. Even peeking into the bin/crawl script would also help as it calls the nutch commands. As for the second link, it is more about using HBase as store instead of gora. This is not really a option for me yet, cause my grid does not have hbase installed yet. Getting it done is not much under my control HBase is one of the datastores supported by Apache Gora. That tutorial speaks about how to configure Nutch (actually Gora) to use HBase as a backend. So, its wrong to say that the tutorial was about HBase and not Gora. the FAQ link is the one I had not gone through until I checked your response, but I do not find answers to any of my questions (directly/indirectly) in it. Ok On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Hemant, I strongly advise you to take some time to look through the Nutch Tutorial for 1.x and 2.x. http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/Nutch2Tutorial Also please see the FAQ's, which you will find very very useful. http://wiki.apache.org/nutch/FAQ Thanks Lewis On Wed, Jun 26, 2013 at 5:18 PM, h b hb6...@gmail.com wrote: Hi, I am first time user of nutch. I installed nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single webpage. I am running nutch step
Re: 2.x Eclipse Breakpoints
Updated the wiki :) On Mon, Jun 24, 2013 at 11:34 PM, Tejas Patil tejas.patil...@gmail.comwrote: This is based on my personal experience and ain't an exhaustive list. I am sure that other folks on @user will have more suggestions. I will put this on the relevant wiki page shortly. FetcherReducer$FetcherThread run() : line 487 : LOG.info(fetching + fit.url : line 519 : final ProtocolStatus status = output.getStatus(); GeneratorMapper : map() : line 53 GeneratorReducer : reduce() : line 53 OutlinkExtractor : getOutlinks() : line 84 On Mon, Jun 24, 2013 at 6:44 PM, Prashant Ladha prashant.la...@gmail.comwrote: Hi, Just wanted to share a feedback in the Nutch-Eclipse setup. The Debug Nutch in Eclipse section has breakpoints related to Nutch 1.x If anyone can document the helpful breakpoints of 2.x, that would helpful. [0] - http://wiki.apache.org/nutch/RunNutchInEclipse P.S.: Is this the right forum for discussing these topics?
Re: Questions/issues with nutch
On Wed, Jun 26, 2013 at 9:53 PM, h b hb6...@gmail.com wrote: Thanks for the response Lewis. I did read these links, I mostly followed the first link and tried both the 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer exception on solr, so I figured that I should first deal with getting the crawl part to work and then deal with solr indexing. Hence I went back to trying it stepwise. You should try running the crawl using individual commands and see where the problem is. The nutch tutorial which Lewis pointed you to had those commands. Even peeking into the bin/crawl script would also help as it calls the nutch commands. As for the second link, it is more about using HBase as store instead of gora. This is not really a option for me yet, cause my grid does not have hbase installed yet. Getting it done is not much under my control HBase is one of the datastores supported by Apache Gora. That tutorial speaks about how to configure Nutch (actually Gora) to use HBase as a backend. So, its wrong to say that the tutorial was about HBase and not Gora. the FAQ link is the one I had not gone through until I checked your response, but I do not find answers to any of my questions (directly/indirectly) in it. Ok On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Hemant, I strongly advise you to take some time to look through the Nutch Tutorial for 1.x and 2.x. http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/Nutch2Tutorial Also please see the FAQ's, which you will find very very useful. http://wiki.apache.org/nutch/FAQ Thanks Lewis On Wed, Jun 26, 2013 at 5:18 PM, h b hb6...@gmail.com wrote: Hi, I am first time user of nutch. I installed nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single webpage. I am running nutch step by step. These are the problems I came across - 1. Inject did not work, i..e the url does not reflect in the webdb(gora-memstore). The way I verify this is after running inject, i run readdb with dump. This created a directory in hdfs with 0 size part file. 2. config files - This confused me a lot. When run from deploy directory, does nutch use the config files from local/conf? Changes made to local/conf/nutch-site.xml did not take effect after editing this file. I had to edit this in order to get rid of the 'http.agent.name' error. I finally ended up hard-coding this in the code, rebuilding and running to keep going forward. 3. how to interpret readdb - Running readdb -stats, shows a lot out output but I do not see my url from seed.txt in there. So I do not know if the entry in webdb actually reflects my seed.txt at all or not. 4. logs - When nutch is run from the deploy directory, the logs/hadoop.log is not generated anymore, not locally, nor on the grid. I tried to make it verbose by changing log4j.properties to DEBUG, but still had not file generated. Any help with this would help me move forward with nutch. Regards Hemant -- *Lewis*
Re: 2.x Eclipse Breakpoints
This is based on my personal experience and ain't an exhaustive list. I am sure that other folks on @user will have more suggestions. I will put this on the relevant wiki page shortly. FetcherReducer$FetcherThread run() : line 487 : LOG.info(fetching + fit.url : line 519 : final ProtocolStatus status = output.getStatus(); GeneratorMapper : map() : line 53 GeneratorReducer : reduce() : line 53 OutlinkExtractor : getOutlinks() : line 84 On Mon, Jun 24, 2013 at 6:44 PM, Prashant Ladha prashant.la...@gmail.comwrote: Hi, Just wanted to share a feedback in the Nutch-Eclipse setup. The Debug Nutch in Eclipse section has breakpoints related to Nutch 1.x If anyone can document the helpful breakpoints of 2.x, that would helpful. [0] - http://wiki.apache.org/nutch/RunNutchInEclipse P.S.: Is this the right forum for discussing these topics?
Re: FBA / Cookies
AFAIK, Nutch would support proxy authentication but won't do Form based authentication. On Mon, Jun 24, 2013 at 2:57 AM, jonathan_ou...@mcafee.com wrote: Hello there, ** ** I’m sure this has been asked before, however looking online and in archived e-mails I cannot find a recent response so… ** ** Does Nutch support Form Based Authentication via credentials and stored cookies? ** ** I’m aware that there are a number of challenges; detecting when a page request has been redirected to the authentication page, detecting when cookies time out etc. But I was interested to hear if Nutch now supports this feature. ** ** Regards *Jonathan Oulds* Software Engineer / PSC – Drive Encryption *McAfee, Inc.* Mocatta House Brighton, BN1 4DU Direct: +44 (0)1273 669419 *Web: *www.mcafee.com [image: sns_sig_550x120_fnl.png]http://internal.nai.com/division/marketing/BrandMarketing/templates/images/mfe_primary_logo_125x39.png *The information contained in this email message may be privileged, confidential and protected from disclosure. If you are not the intended recipient, any review, dissemination, distribution or copying is strictly prohibited. If you have received this email message in error, please notify the sender by reply email and delete the message and any attachments.* ** **
Re: need legends for fetch reduce jobtracker ouput
What will be the right place to add a note about the explanation of the Fetcher logs ? I think in FAQs wiki page under fetching section [0] ? [0] : http://wiki.apache.org/nutch/FAQ#Fetching On Mon, Apr 22, 2013 at 10:37 PM, kiran chitturi chitturikira...@gmail.comwrote: Yes Lewis. It would be the best way for the permissions right now. I will add Tejas once he shares his wiki uid. On Tue, Apr 23, 2013 at 1:07 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I agree. I can sort this tomorrow. @Kiran, Are we still working to addition of documentation contributers via contributers and admin group since the most recent lockdown? Tejas should be added to both groups. @Tejas please drop one of us your wiki uid whenever it suits. Lewis On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote: Hi Lewis, Thanks !! I have huge respect for those who engineered the Fetcher class (esp. of 1.x) as its simply *awesome* and complex piece of code. I can polish my post more so that it comes to the wiki quality. I don't have access to wiki. Can you provide me the same ? Thanks, Tejas On Mon, Apr 22, 2013 at 8:09 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: hi Tejas, this is a real excellent reply and very useful. it would be really great if we could somehow have this kind of low level information readily available on the Nutch wiki. On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote: Fetcher threads try to get a fetch item (url) from a queue of all the fetch items (this queue is actually a queue of queues. For details see [0]). If a thread doesnt get a fetch-item, it spinwaits for 500ms before polling the queue again. The '*spinWaiting*' count tells us how many threads are in their spinwaiting state at a given instance. The '*active*' count tells us how many threads are currently performing the activities related to the fetch of a fetch-item. This involves sending requests to the server, getting the bytes from the server, parsing, storing etc.. '*pages*' is a count for total pages fetched till a given point. '*errors*' is a count for total errors seen. *Next comes pages/s:* First number comes from this: float)pages.get())*10)/elapsed)/10.0 second one comes from this: (actualPages*10)/10.0 actualPages holds the count of pages processed in the last 5 secs (when the calculation is done). First number can be seen as the overall speed for that execution. The second number can be regarded as the instanteous speed as it just uses the #pages in last 5 secs when this calculation is done. See lines 818-830 in [0]. *Next comes the kb/s* values which are computed as follows: (((float)bytes.get())*8)/1024)/elapsed ((float)actualBytes)*8)/1024 This is similar to that of pages/sec. See lines 818-830 in [0]. '*URLs*' indicates how many urls are pending and '*queues*' indicate the number of queues present. Queues are formed on the basis on hostname or ip depending on the configuration set. See FetcherReducer.java [0] for more details. [0] : http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java?view=markup On Mon, Apr 22, 2013 at 6:09 PM, kaveh minooie ka...@plutoz.com wrote: could someone please tell me one more time, in this line: 0/20 spinwaiting/active, 53852 pages, 7612 errors, 4.1 12 pages/s, 2632 7346 kb/s, 989 URLs in 5 queues reduce what are the two numbers before pages/s and two numbers before kb/s? thanks, -- *Lewis* -- *Lewis* -- Kiran Chitturi http://www.linkedin.com/in/kiranchitturi
Re: need legends for fetch reduce jobtracker ouput
Thanks Lewis. The info of the fetcher log is added in the FAQ page here [0]. [0] : https://wiki.apache.org/nutch/FAQ#What_do_the_numbers_in_the_fetcher_log_indicate_.3F On Sat, Jun 22, 2013 at 12:54 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Sounds great Tejas. Wow this is a late shift. If you can commit your fetcher diagnostics it would be great Tejas. On Saturday, June 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote: What will be the right place to add a note about the explanation of the Fetcher logs ? I think in FAQs wiki page under fetching section [0] ? [0] : http://wiki.apache.org/nutch/FAQ#Fetching On Mon, Apr 22, 2013 at 10:37 PM, kiran chitturi chitturikira...@gmail.comwrote: Yes Lewis. It would be the best way for the permissions right now. I will add Tejas once he shares his wiki uid. On Tue, Apr 23, 2013 at 1:07 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I agree. I can sort this tomorrow. @Kiran, Are we still working to addition of documentation contributers via contributers and admin group since the most recent lockdown? Tejas should be added to both groups. @Tejas please drop one of us your wiki uid whenever it suits. Lewis On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote: Hi Lewis, Thanks !! I have huge respect for those who engineered the Fetcher class (esp. of 1.x) as its simply *awesome* and complex piece of code. I can polish my post more so that it comes to the wiki quality. I don't have access to wiki. Can you provide me the same ? Thanks, Tejas On Mon, Apr 22, 2013 at 8:09 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: hi Tejas, this is a real excellent reply and very useful. it would be really great if we could somehow have this kind of low level information readily available on the Nutch wiki. On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote: Fetcher threads try to get a fetch item (url) from a queue of all the fetch items (this queue is actually a queue of queues. For details see [0]). If a thread doesnt get a fetch-item, it spinwaits for 500ms before polling the queue again. The '*spinWaiting*' count tells us how many threads are in their spinwaiting state at a given instance. The '*active*' count tells us how many threads are currently performing the activities related to the fetch of a fetch-item. This involves sending requests to the server, getting the bytes from the server, parsing, storing etc.. '*pages*' is a count for total pages fetched till a given point. '*errors*' is a count for total errors seen. *Next comes pages/s:* First number comes from this: float)pages.get())*10)/elapsed)/10.0 second one comes from this: (actualPages*10)/10.0 actualPages holds the count of pages processed in the last 5 secs (when the calculation is done). First number can be seen as the overall speed for that execution. The second number can be regarded as the instanteous speed as it just uses the #pages in last 5 secs when this calculation is done. See lines 818-830 in [0]. *Next comes the kb/s* values which are computed as follows: (((float)bytes.get())*8)/1024)/elapsed ((float)actualBytes)*8)/1024 This is similar to that of pages/sec. See lines 818-830 in [0]. '*URLs*' indicates how many urls are pending and '*queues*' indicate the number of queues present. Queues are formed on the basis on hostname or ip depending on the configuration set. See FetcherReducer.java [0] for more details. [0] : http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java?view=markup On Mon, Apr 22, 2013 at 6:09 PM, kaveh minooie ka...@plutoz.com wrote: could someone please tell me one more time, in this line: 0/20 spinwaiting/active, 53852 pages, 7612 errors, 4.1 12 pages/s, 2632 7346 kb/s, 989 URLs in 5 queues reduce what are the two numbers before pages/s and two numbers before kb/s? thanks, -- *Lewis* -- *Lewis* -- Kiran Chitturi http://www.linkedin.com/in/kiranchitturi -- *Lewis*
Re: Most stable backend for Nutch 2.x
Yup. HBase 0.90.x is the best datastore for 2.x On Sat, Jun 22, 2013 at 8:19 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Imran, HBase 0.90.x thank you Lewis On Saturday, June 22, 2013, imran khan imrankhan.x...@gmail.com wrote: Greetings, I have seen many mails here about people having different issues with different backends with Nutch 2.x So which backend is most suited /stable with Nutch 2.x and also which version of that suited/stable backend. Please ignore the 'according to your business requirement' factor as my most important requirement is that I want to run Nutch 2.x smoothly without any issues. Regards, Khan -- *Lewis*
Re: Nutch 2.x with HBase backend errors
As mentioned in [0], use older (0.90.x) version of HBase. Unfortunaltely, HBase folks have removed the link from the downloads page. You can grab the source code from [1] and build it. [0] : http://wiki.apache.org/nutch/Nutch2Tutorial [1] : https://svn.apache.org/repos/asf/hbase/tags/0.90.4/ On Fri, Jun 21, 2013 at 11:01 AM, Tony Mullins tonymullins...@gmail.comwrote: In site http://wiki.apache.org/nutch/Nutch2Tutorial?action=showredirect=GORA_HBase its said that N.B. It's possible to encounter the following exception: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration; this is caused by the fact that sometimes the hbase TEST jar is deployed in the lib dir. To resolve this just copy the lib over from your installed HBase dir into the build lib dir. (This issue is currently in progress). I have tried copying the HBase 0.94.8 lib in Nutch2.x/build/lib but still the same error. In my Nutch2.x/build/lib there are older versions of zookeeper and hbase and if I try to remove them from newer jars in HBase/lib then too it doesn't work. Please suggest me what else should I do. Thanks, Tony. On Fri, Jun 21, 2013 at 8:12 PM, Tony Mullins tonymullins...@gmail.com wrote: Hi , After getting some errors with Cassandra backend with Nutch2.x , I am trying now HBase. I have installed HBase 94.8 and have also created sample table in it. After following these links http://wiki.apache.org/nutch/RunNutchInEclipse http://wiki.apache.org/nutch/Nutch2Tutorial?action=showredirect=GORA_HBase I am getting this error when I try to run my first injector job: Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:108) at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135) at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75 I have noticed one strange thing that as we tell gora-properties that our cassandra is running at localhots:9160 , we dont do such thing in case of HBase and just telling it that our default datastore is HBase. So is there any missing step in these tutorials which could cause this exception ? Thanks, Tony.
Re: A bug in the crawl secript in Nutch 1.6
Thanks Joe for pointing it out. There was a jira [0] for this bug and the change is already present in the trunk. [0] : https://issues.apache.org/jira/browse/NUTCH-1500 On Fri, Jun 21, 2013 at 7:11 PM, Joe Zhang smartag...@gmail.com wrote: The new crawl script is quite useful. Thanks for the addition. It comes with a bug, though: Line 169: $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $SEGMENT should be: $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT instead.
Re: confusion over fetch schedule
On Fri, Jun 21, 2013 at 7:07 PM, Joe Zhang smartag...@gmail.com wrote: Sorry, Nutch is certainly aware of page modification, and it does capture lastModified. Nutch does captures the last modified field but I am not sure if its value is used ahead. I remember that it was not being used for any logic in older versions but need to confirm if the code is modified to take that into account. The real question is, can nutch get lastModified of a page before fetching, and use it to make fetching decisions (e.g,, whether or not to override the default interval)? No. Nutch won't lookup for the lastModified of a page before fetching its content. On Fri, Jun 21, 2013 at 6:27 PM, Joe Zhang smartag...@gmail.com wrote: If I don't change the default value of db.fetch.interval.default, which is 30 days, does it mean that the URL in the db won't be refetched before the due time even if it has been modified? In other words, is Nutch aware of page modification?
Re: confusion over fetch schedule
I just checked the current code and it seems to me that lastModifed (aka Modified time in CrawlDatum class) is not used for any further logic. If you want to customize the fetch interval for a subset of pages, do as Lewis suggested. i.e. specify a customized fetch interval for the main pages in the inject command [0]. [0] : http://wiki.apache.org/nutch/bin/nutch_inject On Fri, Jun 21, 2013 at 8:06 PM, Joe Zhang smartag...@gmail.com wrote: Thanks, guys. So, just to confirm, lastModifed is not use in the fetching logic at all. Ideally, it should take higher priority than the default interval. This is particularly important for sites such as cnn.com, whether the leaf page doesn't really change, but the portal page is updated all the time. On Fri, Jun 21, 2013 at 7:40 PM, Tejas Patil tejas.patil...@gmail.com wrote: On Fri, Jun 21, 2013 at 7:07 PM, Joe Zhang smartag...@gmail.com wrote: Sorry, Nutch is certainly aware of page modification, and it does capture lastModified. Nutch does captures the last modified field but I am not sure if its value is used ahead. I remember that it was not being used for any logic in older versions but need to confirm if the code is modified to take that into account. The real question is, can nutch get lastModified of a page before fetching, and use it to make fetching decisions (e.g,, whether or not to override the default interval)? No. Nutch won't lookup for the lastModified of a page before fetching its content. On Fri, Jun 21, 2013 at 6:27 PM, Joe Zhang smartag...@gmail.com wrote: If I don't change the default value of db.fetch.interval.default, which is 30 days, does it mean that the URL in the db won't be refetched before the due time even if it has been modified? In other words, is Nutch aware of page modification?
Re: run nutch-1.6 in eclipse
Hi Mustafa, Those steps are added recently and would work for the current nutch trunk. They won't work for nutch 1.6. Please start from step #1 ie. checkout nutch from repo. If you still face issues, get back to us with precise details about the error and at which step. Thanks, Tejas On Wed, Jun 19, 2013 at 1:37 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Mustafa, Please read this thoroughly http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_One:_Using_the_Mailing_Lists Once we understand from a well detailed email, what the problem is, we are more than willing to help. Until then it is really difficult to help you out. Sorry. Lewis On Wed, Jun 19, 2013 at 1:30 PM, Mustafa Elkhiat melkh...@gmail.com wrote: Hi i wana run nutch-1.6 in eclipse i follow guide http://wiki.apache.org/nutch/RunNutchInEclipse but having errors please any one can help me to run nutch in eclipse in detail -- *Lewis*
Re: How to get raw HTML in @override Filter method in ParseFilter class
Hey Jamshaid, We cannot see any screenshot being attached. Could you upload it somewhere and share the url ? On Thu, Jun 13, 2013 at 11:25 PM, Jamshaid Ashraf jamshaid...@gmail.comwrote: Hi, Thanks for prompt reply! I have set debug point on following line in plugin code in eclipse but get source not found screen when debugging plugin code in eclipse. Please see attached screen shot. String content = new String(page.getContent().array()); What might cause this to happen and how can I fix it? Regards, Jamshaid On Thu, Jun 13, 2013 at 8:34 PM, feng lu amuseme...@gmail.com wrote: Hi I checked the ParseFilter interface in Nutch 2.x like this. Parse filter(String url, WebPage page, Parse parse,HTMLMetaTags metaTags, DocumentFragment doc); you can through this method to get the raw content of html page. String content = new String(page.getContent().array()); and get the parsed text through parse.getText() method. On Thu, Jun 13, 2013 at 11:10 PM, Jamshaid Ashraf jamshaid...@gmail.com wrote: Hi, Since I'm using nutch 2.2 ParseFilter plugin and I need to extract custom information from parsed raw html (preferably using JSoup) ... but I still could't find out how to get the raw html in @override filter () method . As all the examples I have found are in Nutch 1.x api and doens't work with new Nutch 2.x api. Thanks in advance! Regards, Jamshaid -- Don't Grow Old, Grow Up... :-)
Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?
sure. I would make a note of it. Do you have any other suggestions that would make the documentation better ? On Thu, Jun 13, 2013 at 10:29 PM, Tony Mullins tonymullins...@gmail.comwrote: Thanks again Tejas. This worked. I think these all configurations should be in wiki.. so the new users specially who are coming from non-java background (like me) could benifit from this and create less traffic in mailing lists :) Thanks. Tony. On Thu, Jun 13, 2013 at 10:38 PM, Tejas Patil tejas.patil...@gmail.com wrote: I can't see the image that you attached. Anyways, if you are running via command line (ie. runtime/local): set plugin.folders to plugins in NUTCH_HOME/runtime/local/conf/nutch-site.xml. For running from Eclipse, set plugin.folders to the absolute path of directory where the plugins are generated (ie. NUTCH_HOME/build/plugins) in NUTCH_HOME/conf/nutch-site.xml On Thu, Jun 13, 2013 at 5:38 AM, Tony Mullins tonymullins...@gmail.com wrote: Tejas, I can now successfully run the plugin from terminal like bin/nutch parsechecker http://www.google.nl But if I try to run my code directly from eclipse , with main class as 'org.apache.nutch.parse.ParserChecker' and program arguments as ' http://www.google.nl' it fails with same exception of ClassNotFound. Please see the attached image. [image: Inline image 1] I have tried 'ant clean' in my Nutch2.2 source... but same error !!! Could you please help me fixing this issue. Thanks, Tony On Thu, Jun 13, 2013 at 2:23 PM, Tony Mullins tonymullins...@gmail.com wrote: Thank you very much Tejas. It worked. :) Just wondering why did you ask me to remove the 'plugin.folders' from conf/nutch-site.xml ? And the problem was due to bad cache/runtime build ? Thank you again !!! Tony. On Thu, Jun 13, 2013 at 1:47 PM, Tejas Patil tejas.patil...@gmail.com wrote: I don't see any attachments with the mail. Anyways, you need to: 1. remove all your changes from conf/nutch-default.xml. Make it in sync with svn. (rm conf/nutch-default.xml svn up conf/nutch-default.xml) 2. In conf/nutch-site.xml, remove the entry for plugin.folders 3. run ant clean runtime Now try again. On Thu, Jun 13, 2013 at 1:39 AM, Tony Mullins tonymullins...@gmail.com wrote: Hi Tejas, Thanks for pointing out the problem. I have changed the package to kaqqao.nutch.selector and have also modified the package in java source files as package kaqqao.nutch.selector; But I am still getting the ClassNotFound exception... please see attached images !!! Please note that I am using fresh Nutch 2.2 source without additional patch ... do I need to apply any patch to run this ? Thanks, Tony. On Thu, Jun 13, 2013 at 1:16 PM, Tejas Patil tejas.patil...@gmail.com wrote: The package structure you actually have is: *kaqqao.nutch.plugin.selector;* In src/plugin/element-selector/plugin.xml you have defined it as: extension id=*kaqqao.nutch.selector*.HtmlElementSelectorIndexer name=Nutch Blacklist and Whitelist Indexing Filter point=org.apache.nutch.indexer.IndexingFilter implementation id=HtmlElementSelectorIndexer class=*kaqqao.nutch.selector* .HtmlElementSelectorIndexer/ /extension It aint the same and thats why it cannot load that class at runtime. Make it consistent and try again. It worked at my end after changing the package structure to kaqqao.nutch.selector On Wed, Jun 12, 2013 at 11:45 PM, Tony Mullins tonymullins...@gmail.com wrote: Hi Tejas, I am following this example https://github.com/veggen/nutch-element-selector. And now I have tried this example without any changes to my fresh source of Nutch 2.2. Attached is my patch ( change set) on fresh Nutch 2.2 source. Kindly review it and please let me know if I am missing something. Thanks, Tonny On Thu, Jun 13, 2013 at 11:19 AM, Tejas Patil tejas.patil...@gmail.com wrote: Weird. I would like to have a quick peek into your changes. Maybe you are doing something wrong which is hard to predict and figure out by asking bunch of questions to you over email. Can you attach a patch file of your changes ? Please remove the fluff from it and only keep the bare essential things in the patch. Also, if you are working for some company, make sure that you attaching some code here should not be against your organisational policy. Thanks, Tejas On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins tonymullins...@gmail.com wrote: I have done
Re: Nutch 2.2 - Exception in thread 'main' [org.apache.gora.sql.store.SqlStore]
There has been discussion about that few months back and I am not aware of the exact root cause behind it. See http://lucene.472066.n3.nabble.com/Nutch-2-1-different-batch-id-null-td4040592.html http://lucene.472066.n3.nabble.com/Re-nutch-2-1-with-mysql-different-batch-id-null-td4058698.html There is Jira to track the same: https://issues.apache.org/jira/browse/NUTCH-1567 On Thu, Jun 13, 2013 at 2:11 PM, Weder Carlos Vieira weder.vie...@gmail.com wrote: mhmmm got it... Tejas can you please explain to me why I put some URL inside urls/seed.txt and many pages inside that urls aren't parsed? Example: Skipping http://wiki.creativecommons.org/Integrate; different batch id (null) Skipping http://wiki.creativecommons.org/LRMI; different batch id (null) Skipping http://wiki.creativecommons.org/Marking; different batch id (null) This pages are example of many others pages that aren't parsed. Like that, there are many other pages that I wanted to be read and recorded in the database. Thanks again. On Thu, Jun 13, 2013 at 6:04 PM, Tejas Patil tejas.patil...@gmail.com wrote: Those are all images which wont get parsed by Nutch. On Thu, Jun 13, 2013 at 1:33 PM, Weder Carlos Vieira weder.vie...@gmail.com wrote: I extracted 1 row of this urls returned... It attached in excel format.
Re: Nutch job fails with the exception Caused by: java.io.IOException: Could not obtain block: blk
Thanks for sharing. This seems to be more towards HBase end and less of Nutch. On Thu, Jun 13, 2013 at 11:52 PM, vivekvl vive...@yahoo.com wrote: Few references for this issue.. http://hbase.apache.org/book.html#dfs.datanode.max.xcievers http://www.larsgeorge.com/2012/03/hadoop-hbase-and-xceivers.html http://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/ -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-job-fails-with-the-exception-Caused-by-java-io-IOException-Could-not-obtain-block-blk-tp4069322p4070438.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Refrence 3rd Party Jar in Nutch 2.x source
Adding things to the classpath in Eclipse won't help. Look into NUTCH_HOME/src/plugin/parse-swf/plugin.xml. That plugin uses an external jar : javaswf.jar. This applies if you want the jar to be with the codebase. If your jar needs to be picked up from maven repo, you will have to specify it in ivy.xml and plugin.xml. On Fri, Jun 14, 2013 at 12:43 AM, imran khan imrankhan.x...@gmail.comwrote: I need that jar in one of my custom Nutch Plugin. On Fri, Jun 14, 2013 at 12:20 PM, imran khan imrankhan.x...@gmail.com wrote: Greetings, I want to add 3rd party jar file in my existing Nutch 2.x source. I have tried adding it in Libraries dialog of my Eclipse via 'Add External Jars' but I am still getting of pakacage errors on import of that jar on sourcefile where I want to use this Jar. Could you please help me in adding 3rd party Jar to my Nutch 2.x source in Eclipse. Regards Imran
Re: Refrence 3rd Party Jar in Nutch 2.x source
On Fri, Jun 14, 2013 at 1:58 AM, imran khan imrankhan.x...@gmail.comwrote: Ok for ivy dependency I will add my 3rd party jar in dependencies , where do I have to define it in plugin.xml ? Look at how commons-net dependency is defined here: http://svn.apache.org/viewvc/nutch/trunk/src/plugin/protocol-ftp/plugin.xml?revision=1175075view=markup And for 2nd option like parse-swf plugin , I will just place my jar in lib directory of plugin and add its refrence in runtime ? Yes. Add reference in plugin.xml as done here: http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-swf/plugin.xml?revision=1175075view=markup Regards, Imran On Fri, Jun 14, 2013 at 1:19 PM, Tejas Patil tejas.patil...@gmail.com wrote: Adding things to the classpath in Eclipse won't help. Look into NUTCH_HOME/src/plugin/parse-swf/plugin.xml. That plugin uses an external jar : javaswf.jar. This applies if you want the jar to be with the codebase. If your jar needs to be picked up from maven repo, you will have to specify it in ivy.xml and plugin.xml. On Fri, Jun 14, 2013 at 12:43 AM, imran khan imrankhan.x...@gmail.com wrote: I need that jar in one of my custom Nutch Plugin. On Fri, Jun 14, 2013 at 12:20 PM, imran khan imrankhan.x...@gmail.com wrote: Greetings, I want to add 3rd party jar file in my existing Nutch 2.x source. I have tried adding it in Libraries dialog of my Eclipse via 'Add External Jars' but I am still getting of pakacage errors on import of that jar on sourcefile where I want to use this Jar. Could you please help me in adding 3rd party Jar to my Nutch 2.x source in Eclipse. Regards Imran
Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?
Weird. I would like to have a quick peek into your changes. Maybe you are doing something wrong which is hard to predict and figure out by asking bunch of questions to you over email. Can you attach a patch file of your changes ? Please remove the fluff from it and only keep the bare essential things in the patch. Also, if you are working for some company, make sure that you attaching some code here should not be against your organisational policy. Thanks, Tejas On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins tonymullins...@gmail.comwrote: I have done this all. Created my plugin's ivy.xml , plugin.xml , build,xml . Added the entry in nutch-site.xml and srcpluginbuild.xml. But I am still getting PluginRuntimeException: java.lang.ClassNotFoundException Is there any other configuration that I am missing or its Nutch 2.2 issues ? Thanks, Tony. On Thu, Jun 13, 2013 at 1:09 AM, Tejas Patil tejas.patil...@gmail.com wrote: Here is the relevant wiki page: http://wiki.apache.org/nutch/WritingPluginExample Although its old, I think that it will help. On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Tony, you have to register your plugin in src/plugin/build.xml Does your src/plugin/myplugin/plugin.xml properly propagate jar file, extension point and implementing class? And, finally, you have to add your plugin to the property plugin.includes in nutch-site.xml Cheers, Sebastian On 06/12/2013 07:48 PM, Tony Mullins wrote: Hi, I am trying simple ParseFilter plugin in Nutch 2.2. And I can build it and also the srcpluginbuild.xml successfully. But its .jar file is not being created in my runtimelocalpluginsmyplugin directory. And on running bin/nutch parsechecker http://www.google.nl; I get this error java.lang.RuntimeException: org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: com.xyz.nutch.selector.HtmlElementSelectorFilter If I go to MyNutch2.2Source/build/myplugin , I can see plugin's jar with test classes directory created there. If I copy .jar from here and paste it to my runtimelocalpluginsmyplugin directory with plugin.xml file then too I get the same exception of class not found. I have not made any changes in srcpluginbuild-plugin.xml. Could you please guide me that what is I am doing wrong here ? Thanks, Tony
Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?
A side note (not related to the problem you faced, but a general rule with nutch): In your patch I saw that you were having changes in nutch-default.xml and nutch-site.xml. Please just modify nutch-site.xml and keep nutch-default.xml as it is. On Thu, Jun 13, 2013 at 1:07 AM, Tejas Patil tej...@apache.org wrote: The package structure you actually have is: *kaqqao.nutch.plugin.selector;* In src/plugin/element-selector/plugin.xml you have defined it as: extension id=*kaqqao.nutch.selector*.HtmlElementSelectorIndexer name=Nutch Blacklist and Whitelist Indexing Filter point=org.apache.nutch.indexer.IndexingFilter implementation id=HtmlElementSelectorIndexer class=*kaqqao.nutch.selector* .HtmlElementSelectorIndexer/ /extension It aint the same and thats why it cannot load that class at runtime. Make it consistent and try again. It worked at my end after changing the package structure to kaqqao.nutch.selector On Wed, Jun 12, 2013 at 11:45 PM, Tony Mullins tonymullins...@gmail.comwrote: Hi Tejas, I am following this example https://github.com/veggen/nutch-element-selector. And now I have tried this example without any changes to my fresh source of Nutch 2.2. Attached is my patch ( change set) on fresh Nutch 2.2 source. Kindly review it and please let me know if I am missing something. Thanks, Tonny On Thu, Jun 13, 2013 at 11:19 AM, Tejas Patil tejas.patil...@gmail.comwrote: Weird. I would like to have a quick peek into your changes. Maybe you are doing something wrong which is hard to predict and figure out by asking bunch of questions to you over email. Can you attach a patch file of your changes ? Please remove the fluff from it and only keep the bare essential things in the patch. Also, if you are working for some company, make sure that you attaching some code here should not be against your organisational policy. Thanks, Tejas On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins tonymullins...@gmail.com wrote: I have done this all. Created my plugin's ivy.xml , plugin.xml , build,xml . Added the entry in nutch-site.xml and srcpluginbuild.xml. But I am still getting PluginRuntimeException: java.lang.ClassNotFoundException Is there any other configuration that I am missing or its Nutch 2.2 issues ? Thanks, Tony. On Thu, Jun 13, 2013 at 1:09 AM, Tejas Patil tejas.patil...@gmail.com wrote: Here is the relevant wiki page: http://wiki.apache.org/nutch/WritingPluginExample Although its old, I think that it will help. On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Tony, you have to register your plugin in src/plugin/build.xml Does your src/plugin/myplugin/plugin.xml properly propagate jar file, extension point and implementing class? And, finally, you have to add your plugin to the property plugin.includes in nutch-site.xml Cheers, Sebastian On 06/12/2013 07:48 PM, Tony Mullins wrote: Hi, I am trying simple ParseFilter plugin in Nutch 2.2. And I can build it and also the srcpluginbuild.xml successfully. But its .jar file is not being created in my runtimelocalpluginsmyplugin directory. And on running bin/nutch parsechecker http://www.google.nl; I get this error java.lang.RuntimeException: org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: com.xyz.nutch.selector.HtmlElementSelectorFilter If I go to MyNutch2.2Source/build/myplugin , I can see plugin's jar with test classes directory created there. If I copy .jar from here and paste it to my runtimelocalpluginsmyplugin directory with plugin.xml file then too I get the same exception of class not found. I have not made any changes in srcpluginbuild-plugin.xml. Could you please guide me that what is I am doing wrong here ? Thanks, Tony
Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?
The package structure you actually have is: *kaqqao.nutch.plugin.selector;* In src/plugin/element-selector/plugin.xml you have defined it as: extension id=*kaqqao.nutch.selector*.HtmlElementSelectorIndexer name=Nutch Blacklist and Whitelist Indexing Filter point=org.apache.nutch.indexer.IndexingFilter implementation id=HtmlElementSelectorIndexer class=*kaqqao.nutch.selector* .HtmlElementSelectorIndexer/ /extension It aint the same and thats why it cannot load that class at runtime. Make it consistent and try again. It worked at my end after changing the package structure to kaqqao.nutch.selector On Wed, Jun 12, 2013 at 11:45 PM, Tony Mullins tonymullins...@gmail.comwrote: Hi Tejas, I am following this example https://github.com/veggen/nutch-element-selector. And now I have tried this example without any changes to my fresh source of Nutch 2.2. Attached is my patch ( change set) on fresh Nutch 2.2 source. Kindly review it and please let me know if I am missing something. Thanks, Tonny On Thu, Jun 13, 2013 at 11:19 AM, Tejas Patil tejas.patil...@gmail.comwrote: Weird. I would like to have a quick peek into your changes. Maybe you are doing something wrong which is hard to predict and figure out by asking bunch of questions to you over email. Can you attach a patch file of your changes ? Please remove the fluff from it and only keep the bare essential things in the patch. Also, if you are working for some company, make sure that you attaching some code here should not be against your organisational policy. Thanks, Tejas On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins tonymullins...@gmail.com wrote: I have done this all. Created my plugin's ivy.xml , plugin.xml , build,xml . Added the entry in nutch-site.xml and srcpluginbuild.xml. But I am still getting PluginRuntimeException: java.lang.ClassNotFoundException Is there any other configuration that I am missing or its Nutch 2.2 issues ? Thanks, Tony. On Thu, Jun 13, 2013 at 1:09 AM, Tejas Patil tejas.patil...@gmail.com wrote: Here is the relevant wiki page: http://wiki.apache.org/nutch/WritingPluginExample Although its old, I think that it will help. On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Tony, you have to register your plugin in src/plugin/build.xml Does your src/plugin/myplugin/plugin.xml properly propagate jar file, extension point and implementing class? And, finally, you have to add your plugin to the property plugin.includes in nutch-site.xml Cheers, Sebastian On 06/12/2013 07:48 PM, Tony Mullins wrote: Hi, I am trying simple ParseFilter plugin in Nutch 2.2. And I can build it and also the srcpluginbuild.xml successfully. But its .jar file is not being created in my runtimelocalpluginsmyplugin directory. And on running bin/nutch parsechecker http://www.google.nl; I get this error java.lang.RuntimeException: org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: com.xyz.nutch.selector.HtmlElementSelectorFilter If I go to MyNutch2.2Source/build/myplugin , I can see plugin's jar with test classes directory created there. If I copy .jar from here and paste it to my runtimelocalpluginsmyplugin directory with plugin.xml file then too I get the same exception of class not found. I have not made any changes in srcpluginbuild-plugin.xml. Could you please guide me that what is I am doing wrong here ? Thanks, Tony
Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?
I don't see any attachments with the mail. Anyways, you need to: 1. remove all your changes from conf/nutch-default.xml. Make it in sync with svn. (rm conf/nutch-default.xml svn up conf/nutch-default.xml) 2. In conf/nutch-site.xml, remove the entry for plugin.folders 3. run ant clean runtime Now try again. On Thu, Jun 13, 2013 at 1:39 AM, Tony Mullins tonymullins...@gmail.comwrote: Hi Tejas, Thanks for pointing out the problem. I have changed the package to kaqqao.nutch.selector and have also modified the package in java source files as package kaqqao.nutch.selector; But I am still getting the ClassNotFound exception... please see attached images !!! Please note that I am using fresh Nutch 2.2 source without additional patch ... do I need to apply any patch to run this ? Thanks, Tony. On Thu, Jun 13, 2013 at 1:16 PM, Tejas Patil tejas.patil...@gmail.comwrote: The package structure you actually have is: *kaqqao.nutch.plugin.selector;* In src/plugin/element-selector/plugin.xml you have defined it as: extension id=*kaqqao.nutch.selector*.HtmlElementSelectorIndexer name=Nutch Blacklist and Whitelist Indexing Filter point=org.apache.nutch.indexer.IndexingFilter implementation id=HtmlElementSelectorIndexer class=*kaqqao.nutch.selector* .HtmlElementSelectorIndexer/ /extension It aint the same and thats why it cannot load that class at runtime. Make it consistent and try again. It worked at my end after changing the package structure to kaqqao.nutch.selector On Wed, Jun 12, 2013 at 11:45 PM, Tony Mullins tonymullins...@gmail.com wrote: Hi Tejas, I am following this example https://github.com/veggen/nutch-element-selector. And now I have tried this example without any changes to my fresh source of Nutch 2.2. Attached is my patch ( change set) on fresh Nutch 2.2 source. Kindly review it and please let me know if I am missing something. Thanks, Tonny On Thu, Jun 13, 2013 at 11:19 AM, Tejas Patil tejas.patil...@gmail.com wrote: Weird. I would like to have a quick peek into your changes. Maybe you are doing something wrong which is hard to predict and figure out by asking bunch of questions to you over email. Can you attach a patch file of your changes ? Please remove the fluff from it and only keep the bare essential things in the patch. Also, if you are working for some company, make sure that you attaching some code here should not be against your organisational policy. Thanks, Tejas On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins tonymullins...@gmail.com wrote: I have done this all. Created my plugin's ivy.xml , plugin.xml , build,xml . Added the entry in nutch-site.xml and srcpluginbuild.xml. But I am still getting PluginRuntimeException: java.lang.ClassNotFoundException Is there any other configuration that I am missing or its Nutch 2.2 issues ? Thanks, Tony. On Thu, Jun 13, 2013 at 1:09 AM, Tejas Patil tejas.patil...@gmail.com wrote: Here is the relevant wiki page: http://wiki.apache.org/nutch/WritingPluginExample Although its old, I think that it will help. On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Tony, you have to register your plugin in src/plugin/build.xml Does your src/plugin/myplugin/plugin.xml properly propagate jar file, extension point and implementing class? And, finally, you have to add your plugin to the property plugin.includes in nutch-site.xml Cheers, Sebastian On 06/12/2013 07:48 PM, Tony Mullins wrote: Hi, I am trying simple ParseFilter plugin in Nutch 2.2. And I can build it and also the srcpluginbuild.xml successfully. But its .jar file is not being created in my runtimelocalpluginsmyplugin directory. And on running bin/nutch parsechecker http://www.google.nl; I get this error java.lang.RuntimeException: org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: com.xyz.nutch.selector.HtmlElementSelectorFilter If I go to MyNutch2.2Source/build/myplugin , I can see plugin's jar with test classes directory created there. If I copy .jar from here and paste it to my runtimelocalpluginsmyplugin directory with plugin.xml file then too I get the same exception of class not found. I have not made any changes in srcpluginbuild-plugin.xml. Could you please guide me that what is I am doing wrong here ? Thanks, Tony
Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?
I can't see the image that you attached. Anyways, if you are running via command line (ie. runtime/local): set plugin.folders to plugins in NUTCH_HOME/runtime/local/conf/nutch-site.xml. For running from Eclipse, set plugin.folders to the absolute path of directory where the plugins are generated (ie. NUTCH_HOME/build/plugins) in NUTCH_HOME/conf/nutch-site.xml On Thu, Jun 13, 2013 at 5:38 AM, Tony Mullins tonymullins...@gmail.comwrote: Tejas, I can now successfully run the plugin from terminal like bin/nutch parsechecker http://www.google.nl But if I try to run my code directly from eclipse , with main class as 'org.apache.nutch.parse.ParserChecker' and program arguments as ' http://www.google.nl' it fails with same exception of ClassNotFound. Please see the attached image. [image: Inline image 1] I have tried 'ant clean' in my Nutch2.2 source... but same error !!! Could you please help me fixing this issue. Thanks, Tony On Thu, Jun 13, 2013 at 2:23 PM, Tony Mullins tonymullins...@gmail.comwrote: Thank you very much Tejas. It worked. :) Just wondering why did you ask me to remove the 'plugin.folders' from conf/nutch-site.xml ? And the problem was due to bad cache/runtime build ? Thank you again !!! Tony. On Thu, Jun 13, 2013 at 1:47 PM, Tejas Patil tejas.patil...@gmail.comwrote: I don't see any attachments with the mail. Anyways, you need to: 1. remove all your changes from conf/nutch-default.xml. Make it in sync with svn. (rm conf/nutch-default.xml svn up conf/nutch-default.xml) 2. In conf/nutch-site.xml, remove the entry for plugin.folders 3. run ant clean runtime Now try again. On Thu, Jun 13, 2013 at 1:39 AM, Tony Mullins tonymullins...@gmail.com wrote: Hi Tejas, Thanks for pointing out the problem. I have changed the package to kaqqao.nutch.selector and have also modified the package in java source files as package kaqqao.nutch.selector; But I am still getting the ClassNotFound exception... please see attached images !!! Please note that I am using fresh Nutch 2.2 source without additional patch ... do I need to apply any patch to run this ? Thanks, Tony. On Thu, Jun 13, 2013 at 1:16 PM, Tejas Patil tejas.patil...@gmail.com wrote: The package structure you actually have is: *kaqqao.nutch.plugin.selector;* In src/plugin/element-selector/plugin.xml you have defined it as: extension id=*kaqqao.nutch.selector*.HtmlElementSelectorIndexer name=Nutch Blacklist and Whitelist Indexing Filter point=org.apache.nutch.indexer.IndexingFilter implementation id=HtmlElementSelectorIndexer class=*kaqqao.nutch.selector* .HtmlElementSelectorIndexer/ /extension It aint the same and thats why it cannot load that class at runtime. Make it consistent and try again. It worked at my end after changing the package structure to kaqqao.nutch.selector On Wed, Jun 12, 2013 at 11:45 PM, Tony Mullins tonymullins...@gmail.com wrote: Hi Tejas, I am following this example https://github.com/veggen/nutch-element-selector. And now I have tried this example without any changes to my fresh source of Nutch 2.2. Attached is my patch ( change set) on fresh Nutch 2.2 source. Kindly review it and please let me know if I am missing something. Thanks, Tonny On Thu, Jun 13, 2013 at 11:19 AM, Tejas Patil tejas.patil...@gmail.com wrote: Weird. I would like to have a quick peek into your changes. Maybe you are doing something wrong which is hard to predict and figure out by asking bunch of questions to you over email. Can you attach a patch file of your changes ? Please remove the fluff from it and only keep the bare essential things in the patch. Also, if you are working for some company, make sure that you attaching some code here should not be against your organisational policy. Thanks, Tejas On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins tonymullins...@gmail.com wrote: I have done this all. Created my plugin's ivy.xml , plugin.xml , build,xml . Added the entry in nutch-site.xml and srcpluginbuild.xml. But I am still getting PluginRuntimeException: java.lang.ClassNotFoundException Is there any other configuration that I am missing or its Nutch 2.2 issues ? Thanks, Tony. On Thu, Jun 13, 2013 at 1:09 AM, Tejas Patil tejas.patil...@gmail.com wrote: Here is the relevant wiki page: http://wiki.apache.org/nutch/WritingPluginExample Although its old, I think that it will help. On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Tony, you have to register your plugin in src/plugin/build.xml Does your src/plugin/myplugin/plugin.xml
Re: Nutch 1.6 on CDH4.2.1
How should I define the Hue job so that it recognizes Nutch's .job jar file and/or make the CDH4 Hue consistent with the hadoop/hdfs shell commands? Could you try posting to hue and CDH4 user groups ? We dont promise compatibility across the several hadoop distributions out there. See https://issues.apache.org/jira/browse/NUTCH-1447 On Thu, Jun 13, 2013 at 7:39 AM, Byte Array byte.arra...@gmail.com wrote: Hello! I am trying to run a simple crawl with Nutch 1.6 on CDH4.2.1 on Centos 6.2 cluster. First I had problems with # hadoop jar apache-nutch-1.6.job org.apache.nutch.fetcher.Fetcher /nutch/1.6/crawl/segments/20130613095319 which was returning: java.lang.RuntimeException: problem advancing post rec#0 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1183) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:255) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:251) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:40) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:506) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:447) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:447) Caused by: java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:206) . . . Also, I noticed inconsistency between the file system shown with hdfs dfs -ls and the one shown in CDH4 Hue GUI. The former seems to simply create the folders/files locally and is not aware of the ones I create through Hue GUI. Therefore, I suspected that the job is not properly running on the CDH4 cluster and used Hue GUI to create /user/admin/Nutch-1.6 folder and urls/seed.txt and upload the Nutch 1.6 .job file (previously configured and built with ant in Eclipse). When I submit the job through Hue it logs ClassNotFoundException, although I properly defined path to the .job file on the hdfs and the class name in that file: ... Failing Oozie Launcher, Main class [org.apache.nutch.crawl.Injector], exception invoking main(), java.lang.ClassNotFoundException: Class org.apache.nutch.crawl.Injector not found java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.nutch.crawl.Injector not found ... How should I define the Hue job so that it recognizes Nutch's .job jar file and/or make the CDH4 Hue consistent with the hadoop/hdfs shell commands? This thread looks related: http://www.mail-archive.com/user@nutch.apache.org/msg07603.html Thank you
Re: what is stored in the hbase after inject job
row-key = url column=f:fi : fetchInterval (the delay between re-fetches of a page) column=f:ts : fetchTime (indicates when the url will be elligible for fetching) column=mk:_injmrk_ : markers column=mk:dist column=mtdt:_csh_: metadata column=s:s : status (is the url fetched, unfetched, newly injected, gone, redirected etc..) On Thu, Jun 13, 2013 at 6:40 AM, RS tinyshr...@163.com wrote: I do not what is sotred in the hbase after inject a website. When I use the hbase shell $ scan 'webpage' , there are : hbase(main):028:0 scan '1_webpage' ROW COLUMN+CELL com.xinhuanet.www:http/ column=f:fi, timestamp=1371110099941, value=\x00'\x8D\x00 com.xinhuanet.www:http/ column=f:ts, timestamp=1371110099941, value=\x00\x00\x01?\x87\xBA\x0A com.xinhuanet.www:http/ column=mk:_injmrk_, timestamp=1371110099941, value=y com.xinhuanet.www:http/ column=mk:dist, timestamp=1371110099941, value=0 com.xinhuanet.www:http/ column=mtdt:_csh_, timestamp=1371110099941, value=?\x80\x00\x00 com.xinhuanet.www:http/ column=s:s, timestamp=1371110099941, value=?\x80\x00\x00 1 row(s) in 0.0300 seconds So, is only 6 column are setted in the hbase ? And what is the real data stored in it? I find that in the source code, there is a WebPage Class. I could not understand all, but I think there should be 24 fileds in the hbase for each webside. public static final String[] _ALL_FIELDS = {baseUrl,status,fetchTime,prevFetchTime,fetchInterval,retriesSinceFetch,modifiedTime,prevModifiedTime,protocolStatus,content,contentType,prevSignature,signature,title,text,parseStatus,score,reprUrl,headers,outlinks,inlinks,markers,metadata,batchId,}; Thanks HeChuan
Re: Installing Nutch1.6 on Windows7
On Thu, Jun 13, 2013 at 4:50 AM, Andrea Lanzoni a.lanz...@alice.it wrote: Hi Lewis and thanks for your reply. I will try to be as much detailed as possible to allow you to understand the shortcomings I incurred during installation of Nutch and Solr on my PC, which is a 64 bit running Windows 7. I canvassed wiki apache and found this link: http://wiki.apache.org/nutch/**FabioGiavazzi/** HowtoGettingNutchRunningonWind**ows?highlight=%28download%29|%** 28nutch%29|%28for%29|%**28windows%29|%287%29http://wiki.apache.org/nutch/FabioGiavazzi/HowtoGettingNutchRunningonWindows?highlight=%28download%29%7C%28nutch%29%7C%28for%29%7C%28windows%29%7C%287%29 http://wiki.apache.org/nutch/**FabioGiavazzi/** HowtoGettingNutchRunningonWind**ows?highlight=%28download%29%** 7C%28nutch%29%7C%28for%29%7C%**28windows%29%7C%287%29http://wiki.apache.org/nutch/FabioGiavazzi/HowtoGettingNutchRunningonWindows?highlight=%28download%29%7C%28nutch%29%7C%28for%29%7C%28windows%29%7C%287%29 Even though it explains Nutch 1.2 installation on Windows 7 I thought it might fit for Nutch 1.6 as well. Everything went smooth until step 4 of the presentation. Step 4 puzzled me and decided to skip it. I went on and made the changes in XML as stated in ensuing steps. Problem arouse at Step 7 where it reads: Go to *nutch*-1.2\conf\ and edit the file crawl-urlfilter.txt Those steps are old. Skip step #7 and proceed. I didn't find it in my Nutch directories and got into a cul de sac. PS: I just learned what cul de sac means !! thanks for adding to my vocab :) Presently in my PC I have installed as follows: C:\Users\Andrea\Documents\**apache-tomcat-7.0.40\apache-**tomcat-7.0.40 C:\cygwin\home\apache-nutch-1.**6-bin\apache-nutch-1.6 C:\cygwin\home\solr-4.2.0\**solr-4.2.0 I am also asking your advice on this: is it only my laziness/stubborness to continue to install them on Windows, in other terms would you suggest to drop Windows and install Ubuntu and restart the procedure in the new op. system? I happened only once to come across Ubuntu, does it permit to host simultaneously and use on the same PC Windows and Ubuntu itself? You can have Ubuntu + Windows installed on the same m/c. See this: http://www.ubuntu.com/download/desktop/install-ubuntu-with-windows Its super easy to setup Ubuntu with that and its worth of spending an hour or two. Thanks for your welcome opinion. Andrea Il 13/06/2013 01:49, Lewis John Mcgibbney ha scritto: Hi Andrea, Please describe the problem here. There is an absence of any detail about what is wrong here. Thanks On Wednesday, June 12, 2013, Andrea Lanzoni a.lanz...@alice.it wrote: Hi everyone, I am a newcomer to Nutch and Solr and, after studying literature available on web, I tried to install them on _Windows 7_. I have not been able to match the few instructions on the wikiapache site nor I could find a guide updated to Nutch 1.6 but only for older versions I tried by following old versions guides on the web but never succeeded in the installation, often because of differences from what the guide read and what I saw on screen. I followed the steps by installing: - Tomcat - Java jdk 7 - Cygwin, Nutch 1.6 and Solr 4 Everything went apparently smooth and I copied Nutch and Solr in: C:\cygwin\home\apache-nutch-1.**6-bin and C:\cygwin\home\solr-4.2.0\**solr-4.2.0 Whilst the two folders: jdk1.7.0_21 and jre7, are within the Java folder in Programs directory I apologize for my dumbness but I couldn't find how to manage it. If somebody has a clear and detailed step by step pattern to follow for installing Nutch 1.6 and Solr 4 I would be very grateful. Thanks in advance. Andrea Lanzoni
Re: Nutch 2.2 - Exception in thread 'main' [org.apache.gora.sql.store.SqlStore]
On Thu, Jun 13, 2013 at 12:41 PM, Weder Carlos Vieira weder.vie...@gmail.com wrote: Hello everyone! This is my first mail here. Welcome !! I want to know more and more about nutch and share what a find out by myself with you. Thanks if someone can help me too. I was trying to use nutch. First I setup and test nutch 2.1 and its works fine, but many of crawled urls was saved on MySQL with null value, just few url with status=2. I don't understand that but I go on... Next I try to setup and use (test) Nutch 2.2, in this case when I start command to initiate crawl I get this error below. Exception in thread main java.lang.ClassNotFoundException: org.apache.gora.sql.store.SqlStore at java.net.URLClassLoader$1.run(URLClassLoader.java:366) . The 'SqlStore' class was removed from nutch 2.2? because on Nutch 2.1 this error doesn't appear. Yes. Nutch 2.2 uses Apache Gora 0.3 which has deprecated their support for mySQL as a data store. Follow https://wiki.apache.org/nutch/Nutch2Tutorial - In the other hand I want to ask a second question, How can I improve configuration of Nutch 2.1 (that Works fine) to fetch more and more url without 'null values'. What do you mean by w/o null values ? Thanks a lot. Weder
Re: Nutch 2.2 - Exception in thread 'main' [org.apache.gora.sql.store.SqlStore]
On Thu, Jun 13, 2013 at 1:04 PM, Weder Carlos Vieira weder.vie...@gmail.com wrote: After Nutch 2.2 with gora 0.3 mysql will not be more supported? Nutch 2.2 which uses gora 0.3 by default wont support MySQL. There might be a possibility of making that happen by tweaking dependencies but I have never tried it. See lines 102-112 in ivy/ivy.xml. I want mean that crawl doesn't parsing many urls and I don't know why. Can you share few of those urls ? Weder On Thu, Jun 13, 2013 at 4:49 PM, Tejas Patil tejas.patil...@gmail.com wrote: On Thu, Jun 13, 2013 at 12:41 PM, Weder Carlos Vieira weder.vie...@gmail.com wrote: Hello everyone! This is my first mail here. Welcome !! I want to know more and more about nutch and share what a find out by myself with you. Thanks if someone can help me too. I was trying to use nutch. First I setup and test nutch 2.1 and its works fine, but many of crawled urls was saved on MySQL with null value, just few url with status=2. I don't understand that but I go on... Next I try to setup and use (test) Nutch 2.2, in this case when I start command to initiate crawl I get this error below. Exception in thread main java.lang.ClassNotFoundException: org.apache.gora.sql.store.SqlStore at java.net.URLClassLoader$1.run(URLClassLoader.java:366) . The 'SqlStore' class was removed from nutch 2.2? because on Nutch 2.1 this error doesn't appear. Yes. Nutch 2.2 uses Apache Gora 0.3 which has deprecated their support for mySQL as a data store. Follow https://wiki.apache.org/nutch/Nutch2Tutorial - In the other hand I want to ask a second question, How can I improve configuration of Nutch 2.1 (that Works fine) to fetch more and more url without 'null values'. What do you mean by w/o null values ? Thanks a lot. Weder
Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?
Here is the relevant wiki page: http://wiki.apache.org/nutch/WritingPluginExample Although its old, I think that it will help. On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Tony, you have to register your plugin in src/plugin/build.xml Does your src/plugin/myplugin/plugin.xml properly propagate jar file, extension point and implementing class? And, finally, you have to add your plugin to the property plugin.includes in nutch-site.xml Cheers, Sebastian On 06/12/2013 07:48 PM, Tony Mullins wrote: Hi, I am trying simple ParseFilter plugin in Nutch 2.2. And I can build it and also the srcpluginbuild.xml successfully. But its .jar file is not being created in my runtimelocalpluginsmyplugin directory. And on running bin/nutch parsechecker http://www.google.nl; I get this error java.lang.RuntimeException: org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: com.xyz.nutch.selector.HtmlElementSelectorFilter If I go to MyNutch2.2Source/build/myplugin , I can see plugin's jar with test classes directory created there. If I copy .jar from here and paste it to my runtimelocalpluginsmyplugin directory with plugin.xml file then too I get the same exception of class not found. I have not made any changes in srcpluginbuild-plugin.xml. Could you please guide me that what is I am doing wrong here ? Thanks, Tony
Re: Nutch Compilation Error with Eclipse
If you want to find out the java class corresponding to any command, just peek inside src/bin/nutch script and at the bottom you would find a switch case with a case corresponding to each command. For 2.x, here are the important classes: inject - org.apache.nutch.crawl.InjectorJob generate - org.apache.nutch.crawl.GeneratorJob fetch - org.apache.nutch.fetcher.FetcherJob parse - org.apache.nutch.parse.ParserJob updatedb - org.apache.nutch.crawl.DbUpdaterJob Create a separate launcher for each of these. Running these without any i/p parameters would show you the usage of these commands.
Re: Issues on Compiling Nutch 2.x with Eclipse
Hi Tony, That tutorial is based on some earlier nutch version. Please follow http://wiki.apache.org/nutch/RunNutchInEclipse#Checkout_Nutch_in_Eclipse. There has been recent changes to that wiki page and those new steps would take care of getting automation.jar and etc dependencies in place. On Sun, Jun 9, 2013 at 11:58 PM, Tony Mullins tonymullins...@gmail.comwrote: Hi , The last try I made was with this tutorial ' https://sites.google.com/site/profilerajanimaski/webcrawlers/run-nutch-in-eclipse' , after following word to word ( which didn't work for me) then I made some modifications to it as for step 11 I added 'bin' , 'gora' , 'java' ,'test' , 'testprocess' , 'testresources' . And for step 14 I couldn't find 'src/plugin/url-filter-automation/lib/automation.jar' in my source. And when I try to run main 'Crawler' project it says there are errors and give me option to proceed with errors and when I proceed with errors I am getting this error: InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 0 Exception in thread main java.lang.RuntimeException: job failed: name=generate: null, jobid=job_local_0002... . So please help me what I am doing wrong here or guide me to a tutorial which works If the latest Nutch 2.2 source doesn't work with these tutorials then which version of 2.x will work and how ? Thanks. Tony On Mon, Jun 10, 2013 at 7:20 AM, Tejas Patil tejas.patil...@gmail.comwrote: Could you try closing and re-opening the eclipse and then let eclipse rebuild workspace. BTW: On which packages / classes do you see red dots ? On Sun, Jun 9, 2013 at 9:23 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Tony, This source has literally just been released. The tutorial on the Nutch wiki has also just been updated but you need to follow it closely and pay attention to each step. It sounds like the red dots problem your having is explained in the 2nd to last bullet point below http://wiki.apache.org/nutch/RunNutchInEclipse#Checkout_Nutch_in_Eclipse Also, you've not actually said what went wrong! Lewis On Sunday, June 9, 2013, Tony Mullins tonymullins...@gmail.com wrote: Hi, I am new to Nutch. I am trying to use Nutch with Cassandra and have successfully build the Nutch 2.x ( http://svn.apache.org/repos/asf/nutch/branches/2.x/). But I get errors ( different errors after following different tutorials) when I try to run it directly from Eclipse ( I am on CentOS 6.4) , I have tried to follow these tutorials to run Nutch source from Eclipse but no use. http://wiki.apache.org/nutch/RunNutchInEclipse run nutch in eclipse | profilerajanimaski http://jarpit83.blogspot.com/2012/07/configuring-nutch-in-eclipse.html http://techvineyard.blogspot.com/2010/12/build-nutch-20.html Whatever I do, I get red * on my source and it doesn't get run by Eclipse , but it always get build successfully using Ant. Plaaase help me here, could any one please guide me to single web tutorial which actually could help me compile and run latest Nutch 2.x with Eclipse (Juno) on CentOS. Thanksss. Tony. -- *Lewis*
Re: Issues on Compiling Nutch 2.x with Eclipse
I have created a google doc [0] with several snapshots describing how to setup nutch 2.x + eclipse. This is different from the one over the wiki page and tailored for Nutch 2.x. Please try it out, let us know if you still have issues with that. Based on your comments, I would add the same over nutch wiki. [0] : https://docs.google.com/document/d/1qvJwrZ9Sc0NAF9p3ie4uV7JsfCHxnrh9QF19HINw48c/edit?usp=sharing On Mon, Jun 10, 2013 at 11:32 AM, Tejas Patil tejas.patil...@gmail.comwrote: yes. - Close the project in eclipse. Right click on the project, click on Properties and get the location of the project. - Goto that location in terminal - Run 'ant eclipse'. (Note that you need to have Apache Anthttp://ant.apache.org/manual/index.html installed and configured) After going command line, you might as well do this: Specify the GORA backend in nutch-site.xml, uncomment its dependency in ivy/ivy.xml and ensure that the store you selected is set as the default datastore in gora.properties On Mon, Jun 10, 2013 at 11:21 AM, Tony Mullins tonymullins...@gmail.comwrote: Hi, So the latest Nutch2.x includes the Teja's Patch ( https://issues.apache.org/jira/browse/NUTCH-1577) , means if I have latest source then it already has that patch. Now can some one please help me here what is meant by the 2nd last step 'Run 'ant eclipse' on http://wiki.apache.org/nutch/RunNutchInEclipse. Do I need to go to the location where source is and give ant command 'ant -f build.xml' , or its something else ??? And after refreshing the source, Eclipse would let compile and run my code ? Thanks, Tony On Mon, Jun 10, 2013 at 6:56 PM, Tony Mullins tonymullins...@gmail.com wrote: Hi Lewis, I understand this, that there may be something wrong on my end. And as I said I get different errors on running Nutch 2.x with Eclipse, after following different tutorials. My background is in .NET and I might will just move to JAVA , just because of this project (Nutch). But at the moment I am having difficult time understanding the 'setup/configuration' required to run Nutch in Eclipse. When you say '...*you may find it convenient to patch your dist with Tejas' Eclipse ant target and simply run 'ant eclipse' from within your terminal prior to doing a file, import, existing projects in to workspace from within Eclipse..*.' which patch do I need to get and how to apply it ? And by running 'ant eclipse' , do you mean dropping build.xml to Ant window in Eclipse , OR building the Nutch source by using the ant -f build.xml command in terminal ? ( by the way I have done both and both successfully builds the source , but eclipse doesn't run the source). So could you please guide me here in more details, I would be really grateful to you and Nutch community. Thanks, Tony. On Mon, Jun 10, 2013 at 6:38 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Tony, These issues stem from your environment not being correct. I, as many other, have been able to DEBUG and develop Nutch 1.7 and 2.x series from within Eclipse. As you are working with 2.x source, you may find it convenient to patch your dist with Tejas' Eclipse ant target and simply run 'ant eclipse' from within your terminal prior to doing a file, import, existing projects in to workspace from within Eclipse. I can guarantee you, the reason the tutorial is on the Nutch wiki is because as some stage, someone (many many people), somewhere have found it useful for developing Nutch in Eclipse. I don't want to sound like a baloon here, but your java security exceptions are not a problem with Nutch... it's your environment. hth On Monday, June 10, 2013, Tony Mullins tonymullins...@gmail.com wrote: Hi , Ok now I have followed this tutorial word by word. http://wiki.apache.org/nutch/RunNutchInEclipse#Checkout_Nutch_in_Eclipse. After getting new source 2.2 , I have build it using Ant - which was successful then set the configurations and comment the 'hsqldb' dependency and uncomment the cassandra dependency ( as I want to run it against cassandra). After doing this all when I run the code from eclipse I get error Exception in thread main java.lang.SecurityException: Prohibited package name: java.org.apache.nutch.crawl at java.lang.ClassLoader.preDefineClass(ClassLoader.java:649) at java.lang.ClassLoader.defineClass(ClassLoader.java:785) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) and have red '*' all over my code. Please see the attached image. Now what I do ? Please any one could tell me that is it even possible to compile/run/debug latest Nutch 2.x branch from Eclipse ? I need help here... Tony !!! On Mon, Jun 10, 2013 at 12:15 PM, Tejas Patil tejas.patil...@gmail.com wrote: Hi Tony
Re: Nutch Compilation Error with Eclipse
I have created a google doc [0] with several snapshots describing how to setup nutch 2.x + eclipse. This is different from the one over the wiki page and tailored for Nutch 2.x. Please try it out, let us know if you still have issues with that. Based on your comments, I would add the same over nutch wiki. [0] : https://docs.google.com/document/d/1qvJwrZ9Sc0NAF9p3ie4uV7JsfCHxnrh9QF19HINw48c/edit?usp=sharing On Mon, Jun 10, 2013 at 6:23 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, It is (IMHO) kind of fruitless running the crawl class (which is deprecated now and we highly suggest you use and amend the /src/bin/crawl script for your usecase) within Eclipse. You will learn far more setting breakpoints within individual classes and watching them execute on that basis. I notice you've not provided an URL directory to the crawl argument anyway so you will need to sort this one out. Best Lewis On Monday, June 10, 2013, Jamshaid Ashraf jamshaid...@gmail.com wrote: I'm performing following tasks: Commands in Arguments tab: Program Arguments=urls -dir crawl -depth 3 -topN 50 VM Arguments:-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log And then just running the code. Regards, Jamshaid On Mon, Jun 10, 2013 at 4:54 PM, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Hi Which task do you try to launch? Benjamin On Mon, Jun 10, 2013 at 1:57 PM, Jamshaid Ashraf jamshaid...@gmail.com wrote: Hi, I am new to Nutch. I am trying to use Nutch with Cassandra and have successfully build the Nutch 2.x but shows following error when I run it from latest eclipse. java.lang.NullPointerException at org.apache.avro.util.Utf8.init(Utf8.java:37) at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260). I will be grateful for any help if someone can provide. Thanks. -- *Lewis*
Re: Nutch Compilation Error with Eclipse
Hi Jamshaid, The simplified steps with snapshots are now added to Nutch wiki [0]. It would be helpful if you could try those out and lets us know if there are any improvements or corrections that you think. PS: Few images look shrinked. I will be fixing it soon. [0] : https://wiki.apache.org/nutch/RunNutchInEclipse On Mon, Jun 10, 2013 at 2:58 PM, Tejas Patil tejas.patil...@gmail.comwrote: I have created a google doc [0] with several snapshots describing how to setup nutch 2.x + eclipse. This is different from the one over the wiki page and tailored for Nutch 2.x. Please try it out, let us know if you still have issues with that. Based on your comments, I would add the same over nutch wiki. [0] : https://docs.google.com/document/d/1qvJwrZ9Sc0NAF9p3ie4uV7JsfCHxnrh9QF19HINw48c/edit?usp=sharing On Mon, Jun 10, 2013 at 6:23 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, It is (IMHO) kind of fruitless running the crawl class (which is deprecated now and we highly suggest you use and amend the /src/bin/crawl script for your usecase) within Eclipse. You will learn far more setting breakpoints within individual classes and watching them execute on that basis. I notice you've not provided an URL directory to the crawl argument anyway so you will need to sort this one out. Best Lewis On Monday, June 10, 2013, Jamshaid Ashraf jamshaid...@gmail.com wrote: I'm performing following tasks: Commands in Arguments tab: Program Arguments=urls -dir crawl -depth 3 -topN 50 VM Arguments:-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log And then just running the code. Regards, Jamshaid On Mon, Jun 10, 2013 at 4:54 PM, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Hi Which task do you try to launch? Benjamin On Mon, Jun 10, 2013 at 1:57 PM, Jamshaid Ashraf jamshaid...@gmail.com wrote: Hi, I am new to Nutch. I am trying to use Nutch with Cassandra and have successfully build the Nutch 2.x but shows following error when I run it from latest eclipse. java.lang.NullPointerException at org.apache.avro.util.Utf8.init(Utf8.java:37) at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260). I will be grateful for any help if someone can provide. Thanks. -- *Lewis*
Re: Issues on Compiling Nutch 2.x with Eclipse
Hi Tony, The simplified steps with snapshots are now added to Nutch wiki [0]. It would be helpful if you could try those out and lets us know if there are any improvements or corrections that you think. PS: Few images look shrinked. I will be fixing it soon. [0] : https://wiki.apache.org/nutch/RunNutchInEclipse On Mon, Jun 10, 2013 at 2:57 PM, Tejas Patil tejas.patil...@gmail.comwrote: I have created a google doc [0] with several snapshots describing how to setup nutch 2.x + eclipse. This is different from the one over the wiki page and tailored for Nutch 2.x. Please try it out, let us know if you still have issues with that. Based on your comments, I would add the same over nutch wiki. [0] : https://docs.google.com/document/d/1qvJwrZ9Sc0NAF9p3ie4uV7JsfCHxnrh9QF19HINw48c/edit?usp=sharing On Mon, Jun 10, 2013 at 11:32 AM, Tejas Patil tejas.patil...@gmail.comwrote: yes. - Close the project in eclipse. Right click on the project, click on Properties and get the location of the project. - Goto that location in terminal - Run 'ant eclipse'. (Note that you need to have Apache Anthttp://ant.apache.org/manual/index.html installed and configured) After going command line, you might as well do this: Specify the GORA backend in nutch-site.xml, uncomment its dependency in ivy/ivy.xml and ensure that the store you selected is set as the default datastore in gora.properties On Mon, Jun 10, 2013 at 11:21 AM, Tony Mullins tonymullins...@gmail.comwrote: Hi, So the latest Nutch2.x includes the Teja's Patch ( https://issues.apache.org/jira/browse/NUTCH-1577) , means if I have latest source then it already has that patch. Now can some one please help me here what is meant by the 2nd last step 'Run 'ant eclipse' on http://wiki.apache.org/nutch/RunNutchInEclipse. Do I need to go to the location where source is and give ant command 'ant -f build.xml' , or its something else ??? And after refreshing the source, Eclipse would let compile and run my code ? Thanks, Tony On Mon, Jun 10, 2013 at 6:56 PM, Tony Mullins tonymullins...@gmail.com wrote: Hi Lewis, I understand this, that there may be something wrong on my end. And as I said I get different errors on running Nutch 2.x with Eclipse, after following different tutorials. My background is in .NET and I might will just move to JAVA , just because of this project (Nutch). But at the moment I am having difficult time understanding the 'setup/configuration' required to run Nutch in Eclipse. When you say '...*you may find it convenient to patch your dist with Tejas' Eclipse ant target and simply run 'ant eclipse' from within your terminal prior to doing a file, import, existing projects in to workspace from within Eclipse..*.' which patch do I need to get and how to apply it ? And by running 'ant eclipse' , do you mean dropping build.xml to Ant window in Eclipse , OR building the Nutch source by using the ant -f build.xml command in terminal ? ( by the way I have done both and both successfully builds the source , but eclipse doesn't run the source). So could you please guide me here in more details, I would be really grateful to you and Nutch community. Thanks, Tony. On Mon, Jun 10, 2013 at 6:38 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Tony, These issues stem from your environment not being correct. I, as many other, have been able to DEBUG and develop Nutch 1.7 and 2.x series from within Eclipse. As you are working with 2.x source, you may find it convenient to patch your dist with Tejas' Eclipse ant target and simply run 'ant eclipse' from within your terminal prior to doing a file, import, existing projects in to workspace from within Eclipse. I can guarantee you, the reason the tutorial is on the Nutch wiki is because as some stage, someone (many many people), somewhere have found it useful for developing Nutch in Eclipse. I don't want to sound like a baloon here, but your java security exceptions are not a problem with Nutch... it's your environment. hth On Monday, June 10, 2013, Tony Mullins tonymullins...@gmail.com wrote: Hi , Ok now I have followed this tutorial word by word. http://wiki.apache.org/nutch/RunNutchInEclipse#Checkout_Nutch_in_Eclipse . After getting new source 2.2 , I have build it using Ant - which was successful then set the configurations and comment the 'hsqldb' dependency and uncomment the cassandra dependency ( as I want to run it against cassandra). After doing this all when I run the code from eclipse I get error Exception in thread main java.lang.SecurityException: Prohibited package name: java.org.apache.nutch.crawl at java.lang.ClassLoader.preDefineClass(ClassLoader.java:649) at java.lang.ClassLoader.defineClass(ClassLoader.java:785) at java.security.SecureClassLoader.defineClass
Re: Error in NutchHadoopTutorial
Thanks Wahaj for the correction. The wiki page is updated with the same. On Sat, Jun 8, 2013 at 1:23 AM, Wahaj Ali wahaj...@gmail.com wrote: Hello, Just wanted to bring to your notice that there is a slight error in the NutchHadoopTutorial (http://wiki.apache.org/nutch/NutchHadoopTutorial ). The command given under Performing a Nutch Crawl is: hadoop jar nutch-${version}.jar org.apache.nutch.crawl.Crawl urls -dir urls -depth 3 -topN 5 It should be: hadoop jar nutch-${version}.jar org.apache.nutch.crawl.Crawl urls -dir crawl -depth 3 -topN 5 This is also in consistent with the immediate line which says: We are using the nutch crawl command. The urls dir is the urls directory that we added to the distributed filesystem. The -dir crawl is the output directory. Regards, Wahaj
Re: Unable to crawl google search results
Do you mean to turn off the robots processing ? See the comment by Andrzej over [0]: The goal of Nutch is to implement a well-behaved crawler that obeys robot rules and netiquette. Your patch simply disables these control mechanisms. If it works for you and you can risk the wrath of webmasters, that's fine, you are free to use this patch - but Nutch as a project cannot encourage such practice. [0] : https://issues.apache.org/jira/browse/NUTCH-938 On Tue, Jun 4, 2013 at 2:58 PM, Yves S. Garret yoursurrogate...@gmail.comwrote: One more question, is it ever a good idea to set this property protocol.plugin.check.robots in nutch-site.xml to false? On Tue, Jun 4, 2013 at 5:30 PM, Yves S. Garret yoursurrogate...@gmail.comwrote: Got another issue. When I run my crawler over google search results, I see _nothing_ in my HBase table... why? This is what I'm trying to crawl: https://www.google.com/#output=searchsclient=psy-abq=xboxoq=xboxgs_l=hp.3..0l4.648.1180.0.1354.4.4.0.0.0.0.213.547.0j2j1.3.0...0.0...1c.1.15.psy-ab.jd107GllWZwpbx=1bav=on.2,or.r_cp.r_qf.bvm=bv.47380653,d.eWUfp=13d973d49a29d61dbiw=1280bih=635 Here are my logs: http://bin.cakephp.org/view/1619245280 Here is my $NUTCH_HOME/conf/nutch-site.xml: http://bin.cakephp.org/view/1304119856 And the output that I see when I run the crawler: http://bin.cakephp.org/view/260103467 In nutch-site.xml, I have all of the needed plugin.includes, I believe...
Re: [REQUEST] (NUTCH-1569) Upgrade 2.x to Gora 0.3
CDH4 ?? nope. We support Apache Hadoop only and give no guarantee against any commercial Hadoop distributions out there. On Mon, Jun 3, 2013 at 7:08 AM, adfel70 adfe...@gmail.com wrote: Hi does this patch solves the issue with CDH4 and habse? -- View this message in context: http://lucene.472066.n3.nabble.com/REQUEST-NUTCH-1569-Upgrade-2-x-to-Gora-0-3-tp4064544p4067815.html Sent from the Nutch - User mailing list archive at Nabble.com.