Re: Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.
Just wanted to update and let everyone know that this issue with single map task for fetch was occurring because Generator.java had logic around MRV1 property *mapred.job.tracker*, I had to change that logic and as I am running this on YARN and now multiple fetch tasks operate on a single segment. Also I misunderstood that multiple segments would need to be generated to achieve parallelism , it does not seem to be the case , parallelism at fetch time is achieved by having multiple fetch tasks operate on a single segment. Thanks everyone for your help on resolving this issue. On Wed, Sep 24, 2014 at 6:14 PM, Meraj A. Khan mera...@gmail.com wrote: Folks, As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN cluster . In order to scale I would need to Fetch concurrently with multiple map tasks on multiple nodes ,I think that the first step to do so would be to generate multiple segments in the generate phase so that multiple fetch map tasks can operate in parallel and in order to generate multiple segments at Generate time I have made the following changes , but unfortunately I have been unsuccessful in doing so. I have tweaked the following parameters in bin/crawl to do so . added the *maxNumSegments* and *numFetchers* parameters in the call to generate in *bin/crawl *script as can be seen below. *$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter* (Here $numFetchers has a value of 15) The *generate.max.count* and *generate.count.mode* and *topN* are all default values , meaning I am not providing any values for them. Also the crawldb status before the Generate phase is as shown below , it shows that the number of unfetched URLs is more than *75 million* , so its not that there are not enough urls for Generate to generate multiple segments. * CrawlDB status* * db_fetched=318708* * db_gone=4774* * db_notmodified=2274* * db_redir_perm=2253* * db_redir_temp=2527* * db_unfetched=7524* However I do see this message in the logs consistently during the generate phase. *Generator: jobtracker is 'local', generating exactly one partition.* is this one partition referring to the the single segment that is going to be generated ? If so how do I address this. I feel like I have exhausted all the options but I am unable to have the Generate phase generate more than one segment at a time. Can someone let me know if there is anything else that I should be trying here ? *Thanks and any help is much appreciated!*
Solr Indexer Reduce Tasks fail to report status
Hello, I have been running Nutch 1.9 on Hadoop 1.2.1 using the deploy/bin/crawl script for a little while with no problems. However, I just increased the scope of the crawl pretty significantly, and now *most* of my Indexer jobs are failing on the reduce task showing the error Task attempt_201409241419_0046_r_00_3 failed to report status for 600 seconds. Killing!. From the TT logs, the main issue seems to be Caused by: java.io.IOException: Connection reset by peer. I found some suggestions that these errors could be caused by somaxconn being too low, so I increased from 128 to 256 on the node running Solr and the JT and it didn't help. I also bumped the memory for MR tasks up to 1024m from 700-something which doesn't seem to have helped either. Has anyone seen this before? Or have any idea what could cause this? Here is the relevant excerpt from the TT logs: 2014-09-25 00:40:25,580 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(attempt_201409241419_0033_m_18_0,0) failed : org.mortbay.jetty.EofException at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791) at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:551) at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:572) at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:651) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:4125) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:914) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) at org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:170) at org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221) at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:725) ... 27 more 2014-09-25 00:40:25,580 WARN org.mortbay.log: Committed before 410 getMapOutput(attempt_201409241419_0033_m_18_0,0) failed : org.mortbay.jetty.EofException at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791) at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:551) at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:572) at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:651) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:4125) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:914) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at
Crawled data not inserting in the tables
Hi, Gora gurus, I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA Cassandra mapping to store the crawled data. I can confirm that all 12 URLs are not being filtered and are injected, but after running the generate, fetch and parse jobs . There are only 3 entries in column family f. I am not sure what I am doing wrong. The logs have not yielded anything relevant. What should I be looking at? Any advice would be gratefully appreciated. Thanks, Kartik -- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message.
Re: Question about Nutch Wicket
Hi Nima, I never used nutch web admin. Web admin that you used, is very old. Maybe you can use our brand new web admin development ( https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-841). Now it is just committed trunk and 2.x branches. For your question IMHO start URLs means your seedlist. For nutch accept as a folder or text file. Limit URLs mean when crawler start using seed list which new URL will be accepted for next steps. Actually you can right regex rules in nutch. For example you crawl a home page of news webpage but you want to only get sports urls. You can write a regex rule for accepting sports URLs. For reverse situations you can use Exclude URLs. Talat. Hello Everyone: I am following the directions exactly word for word in this tutorial https://github.com/101tec/nutch/wiki/admin-url-upload My question is what is the difference between the start and limit urls. From the wiki I saw that the limit url seems to be a flat list of urls we want to fetch, but then why have a start url to become with? Also I noticed that when you do not have a limit url, you get the following exception (Note: I am using nutch-gui-0.5-dev), When you start a crawl shouildnt there be some sort of pop up box that occurs pops up saying that you need a limit url? So you dont get this exception? I can help work on this. 14/09/25 20:34:37 INFO [Thread-3071] (Fetcher.java:970) - Fetcher: done 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:217) - bw update: starting 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:218) - bw update: db: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/crawldb 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:219) - bw update: bwdb: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:220) - bw update: segments: [/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/segments/20140925203433] 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:223) - bw update: wrapping started. 14/09/25 20:34:37 WARN [Thread-3071] (JobClient.java:547) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/09/25 20:34:38 INFO [Thread-3071] (BWUpdateDb.java:248) - bw update: filtering started. 14/09/25 20:34:38 WARN [Thread-3071] (JobClient.java:547) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/09/25 20:34:38 WARN [Thread-3071] (StartCrawlRunnable.java:57) - can not start crawl.14/09/25 20:34:38 WARN [Thread-3071] (StartCrawlRunnable.java:57) - can not start crawl. org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb/current at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.bw.BWUpdateDb.update(BWUpdateDb.java:263) at org.apache.nutch.crawl.CrawlTool.crawl(CrawlTool.java:135) at org.apache.nutch.admin.crawl.StartCrawlRunnable.run(StartCrawlRunnable.java:52) at java.lang.Thread.run(Thread.java:744) -- Nima Falaki Software Engineer nfal...@popsugar.com
Re: Question about Nutch Wicket
Thanks Talat, just wondering is there a set of instructions that I can use to get the nutch web admin tool up and running? Nima On Thu, Sep 25, 2014 at 9:37 PM, Talat Uyarer ta...@uyarer.com wrote: Hi Nima, I never used nutch web admin. Web admin that you used, is very old. Maybe you can use our brand new web admin development ( https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-841). Now it is just committed trunk and 2.x branches. For your question IMHO start URLs means your seedlist. For nutch accept as a folder or text file. Limit URLs mean when crawler start using seed list which new URL will be accepted for next steps. Actually you can right regex rules in nutch. For example you crawl a home page of news webpage but you want to only get sports urls. You can write a regex rule for accepting sports URLs. For reverse situations you can use Exclude URLs. Talat. Hello Everyone: I am following the directions exactly word for word in this tutorial https://github.com/101tec/nutch/wiki/admin-url-upload My question is what is the difference between the start and limit urls. From the wiki I saw that the limit url seems to be a flat list of urls we want to fetch, but then why have a start url to become with? Also I noticed that when you do not have a limit url, you get the following exception (Note: I am using nutch-gui-0.5-dev), When you start a crawl shouildnt there be some sort of pop up box that occurs pops up saying that you need a limit url? So you dont get this exception? I can help work on this. 14/09/25 20:34:37 INFO [Thread-3071] (Fetcher.java:970) - Fetcher: done 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:217) - bw update: starting 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:218) - bw update: db: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/crawldb 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:219) - bw update: bwdb: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:220) - bw update: segments: [/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/segments/20140925203433] 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:223) - bw update: wrapping started. 14/09/25 20:34:37 WARN [Thread-3071] (JobClient.java:547) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/09/25 20:34:38 INFO [Thread-3071] (BWUpdateDb.java:248) - bw update: filtering started. 14/09/25 20:34:38 WARN [Thread-3071] (JobClient.java:547) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/09/25 20:34:38 WARN [Thread-3071] (StartCrawlRunnable.java:57) - can not start crawl.14/09/25 20:34:38 WARN [Thread-3071] (StartCrawlRunnable.java:57) - can not start crawl. org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb/current at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.bw.BWUpdateDb.update(BWUpdateDb.java:263) at org.apache.nutch.crawl.CrawlTool.crawl(CrawlTool.java:135) at org.apache.nutch.admin.crawl.StartCrawlRunnable.run(StartCrawlRunnable.java:52) at java.lang.Thread.run(Thread.java:744) -- Nima Falaki Software Engineer nfal...@popsugar.com -- Nima Falaki Software Engineer nfal...@popsugar.com
Re: Question about Nutch Wicket
Sounds great. it will be good for web ui. If you write it, I can review and add the wiki ;) Talat On Sep 26, 2014 8:26 AM, Nima Falaki nfal...@popsugar.com wrote: Nm, I figured it out. had to run the webapp command. There should be a wiki to document this. I could volunteer to write one, if nobody else is going to do this? On Thu, Sep 25, 2014 at 10:23 PM, Nima Falaki nfal...@popsugar.com wrote: Thanks Talat, just wondering is there a set of instructions that I can use to get the nutch web admin tool up and running? Nima On Thu, Sep 25, 2014 at 9:37 PM, Talat Uyarer ta...@uyarer.com wrote: Hi Nima, I never used nutch web admin. Web admin that you used, is very old. Maybe you can use our brand new web admin development ( https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-841). Now it is just committed trunk and 2.x branches. For your question IMHO start URLs means your seedlist. For nutch accept as a folder or text file. Limit URLs mean when crawler start using seed list which new URL will be accepted for next steps. Actually you can right regex rules in nutch. For example you crawl a home page of news webpage but you want to only get sports urls. You can write a regex rule for accepting sports URLs. For reverse situations you can use Exclude URLs. Talat. Hello Everyone: I am following the directions exactly word for word in this tutorial https://github.com/101tec/nutch/wiki/admin-url-upload My question is what is the difference between the start and limit urls. From the wiki I saw that the limit url seems to be a flat list of urls we want to fetch, but then why have a start url to become with? Also I noticed that when you do not have a limit url, you get the following exception (Note: I am using nutch-gui-0.5-dev), When you start a crawl shouildnt there be some sort of pop up box that occurs pops up saying that you need a limit url? So you dont get this exception? I can help work on this. 14/09/25 20:34:37 INFO [Thread-3071] (Fetcher.java:970) - Fetcher: done 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:217) - bw update: starting 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:218) - bw update: db: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/crawldb 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:219) - bw update: bwdb: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:220) - bw update: segments: [/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/segments/20140925203433] 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:223) - bw update: wrapping started. 14/09/25 20:34:37 WARN [Thread-3071] (JobClient.java:547) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/09/25 20:34:38 INFO [Thread-3071] (BWUpdateDb.java:248) - bw update: filtering started. 14/09/25 20:34:38 WARN [Thread-3071] (JobClient.java:547) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/09/25 20:34:38 WARN [Thread-3071] (StartCrawlRunnable.java:57) - can not start crawl.14/09/25 20:34:38 WARN [Thread-3071] (StartCrawlRunnable.java:57) - can not start crawl. org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb/current at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.bw.BWUpdateDb.update(BWUpdateDb.java:263) at org.apache.nutch.crawl.CrawlTool.crawl(CrawlTool.java:135) at org.apache.nutch.admin.crawl.StartCrawlRunnable.run(StartCrawlRunnable.java:52) at java.lang.Thread.run(Thread.java:744) -- Nima Falaki Software Engineer nfal...@popsugar.com -- Nima Falaki Software Engineer nfal...@popsugar.com -- Nima Falaki Software Engineer nfal...@popsugar.com