Re: Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.

2014-09-25 Thread Meraj A. Khan
Just wanted to update and let everyone know that this issue with single map
task for fetch was occurring because Generator.java had logic around MRV1
property *mapred.job.tracker*, I had to change that logic and as I am
running this on YARN and now multiple fetch tasks operate on a single
segment.

Also I misunderstood that multiple segments would need to be generated to
achieve parallelism , it does not seem to be the case , parallelism at
fetch time is achieved by having multiple fetch tasks operate on a single
segment.

Thanks everyone for your help on resolving this issue.



On Wed, Sep 24, 2014 at 6:14 PM, Meraj A. Khan mera...@gmail.com wrote:

 Folks,

 As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN
 cluster .

 In order to scale I would need to Fetch concurrently with multiple map
 tasks on multiple nodes ,I  think that the first step to do so would be to
 generate multiple segments in the generate phase so that multiple fetch map
 tasks can operate in parallel and in  order to generate multiple segments
 at Generate time I have made the following changes , but unfortunately I
 have been unsuccessful in doing so.

 I have tweaked the following parameters in bin/crawl to do so .

 added the *maxNumSegments* and *numFetchers* parameters in the call to
 generate in *bin/crawl *script as can be seen below.


 *$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
 $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers
 -noFilter*

 (Here $numFetchers has a value of 15)

 The *generate.max.count* and *generate.count.mode* and *topN* are all
 default values , meaning I am not providing any values for them.

 Also the crawldb status before the Generate phase is as shown below , it
 shows that the number of unfetched URLs is more than *75 million* , so
 its not that there are not enough urls for Generate to generate multiple
 segments.

 * CrawlDB status*
 * db_fetched=318708*
 * db_gone=4774*
 * db_notmodified=2274*
 * db_redir_perm=2253*
 * db_redir_temp=2527*
 * db_unfetched=7524*

 However I do see this message in the logs consistently during the generate
 phase.

  *Generator: jobtracker is 'local', generating exactly one partition.*

 is this one partition referring to the the single segment that is going
 to be generated ? If so how do I address this.


 I feel like I have exhausted all the options but I am unable to have the
 Generate phase generate more than one segment at a time.

 Can someone let me know if there is anything else that I should be trying
 here ?

 *Thanks and any help is much appreciated!*





Solr Indexer Reduce Tasks fail to report status

2014-09-25 Thread Jonathan Cooper-Ellis
Hello,

I have been running Nutch 1.9 on Hadoop 1.2.1 using the deploy/bin/crawl
script for a little while with no problems. However, I just increased the
scope of the crawl pretty significantly, and now *most* of my Indexer jobs
are failing on the reduce task showing the error Task
attempt_201409241419_0046_r_00_3 failed to report status for 600
seconds. Killing!. From the TT logs, the main issue seems to be Caused
by: java.io.IOException: Connection reset by peer.

I found some suggestions that these errors could be caused by somaxconn
being too low, so I increased from 128 to 256 on the node running Solr and
the JT and it didn't help. I also bumped the memory for MR tasks up to
1024m from 700-something which doesn't seem to have helped either.

Has anyone seen this before? Or have any idea what could cause this?

Here is the relevant excerpt from the TT logs:

2014-09-25 00:40:25,580 WARN org.apache.hadoop.mapred.TaskTracker:
getMapOutput(attempt_201409241419_0033_m_18_0,0) failed :
org.mortbay.jetty.EofException
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791)
at
org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:551)
at
org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:572)
at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012)
at
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:651)
at
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580)
at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:4125)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:914)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
at org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:170)
at
org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221)
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:725)
... 27 more

2014-09-25 00:40:25,580 WARN org.mortbay.log: Committed before 410
getMapOutput(attempt_201409241419_0033_m_18_0,0) failed :
org.mortbay.jetty.EofException
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791)
at
org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:551)
at
org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:572)
at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012)
at
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:651)
at
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580)
at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:4125)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:914)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 

Crawled data not inserting in the tables

2014-09-25 Thread Krishnanand, Kartik
Hi, Gora gurus,

I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA 
Cassandra mapping to store the crawled data.

I can confirm that all 12 URLs are not being filtered and are injected, but 
after running the generate, fetch and parse jobs . There are only 3 entries in 
column family f.

I am not sure what I am doing wrong. The logs have not yielded anything 
relevant. What should I be looking at?

Any advice would be gratefully appreciated.

Thanks,

Kartik

--
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended 
recipient, please delete this message.


Re: Question about Nutch Wicket

2014-09-25 Thread Talat Uyarer
Hi Nima,

I never used nutch web admin. Web admin that you used, is very old. Maybe
you can use our brand new web admin development (
https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-841). Now
it is just committed trunk and 2.x branches.

For your question IMHO start URLs means your seedlist. For nutch accept as
a folder or text file.

Limit URLs mean when crawler start using seed list which new URL will be
accepted for next steps. Actually you can right regex rules in nutch.  For
example you crawl a home page of news webpage but you want to only get
sports urls. You can write a regex rule for accepting sports URLs.

For reverse situations you can use Exclude URLs.

Talat.
 Hello Everyone:

I am following the directions exactly word for word in this tutorial

https://github.com/101tec/nutch/wiki/admin-url-upload

My question is what is the difference between the start and limit urls.
From the wiki I saw that the limit url seems to be a flat list of urls we
want to fetch, but then why have a start url to become with?

Also I noticed that when you do not have a limit url, you get the following
exception (Note: I am using nutch-gui-0.5-dev),

When you start a crawl shouildnt there be some sort of pop up box that
occurs pops up saying that you need a limit url? So you dont get this
exception? I can help work on this.


14/09/25 20:34:37  INFO [Thread-3071] (Fetcher.java:970) - Fetcher: done

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:217) - bw update:
starting

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:218) - bw update:
db: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/crawldb

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:219) - bw update:
bwdb: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:220) - bw update:
segments:
[/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/segments/20140925203433]

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:223) - bw update:
wrapping started.

14/09/25 20:34:37  WARN [Thread-3071] (JobClient.java:547) - Use
GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.

14/09/25 20:34:38  INFO [Thread-3071] (BWUpdateDb.java:248) - bw update:
filtering started.

14/09/25 20:34:38  WARN [Thread-3071] (JobClient.java:547) - Use
GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.

14/09/25 20:34:38  WARN [Thread-3071] (StartCrawlRunnable.java:57) - can
not start crawl.14/09/25 20:34:38  WARN [Thread-3071]
(StartCrawlRunnable.java:57) - can not start crawl.

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb/current

at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)

at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)

at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)

at org.apache.nutch.crawl.bw.BWUpdateDb.update(BWUpdateDb.java:263)

at org.apache.nutch.crawl.CrawlTool.crawl(CrawlTool.java:135)

at
org.apache.nutch.admin.crawl.StartCrawlRunnable.run(StartCrawlRunnable.java:52)

at java.lang.Thread.run(Thread.java:744)


--



Nima Falaki
Software Engineer
nfal...@popsugar.com


Re: Question about Nutch Wicket

2014-09-25 Thread Nima Falaki
Thanks Talat, just wondering is there a set of instructions that I can use
to get the nutch web admin tool up and running?

Nima

On Thu, Sep 25, 2014 at 9:37 PM, Talat Uyarer ta...@uyarer.com wrote:

 Hi Nima,

 I never used nutch web admin. Web admin that you used, is very old. Maybe
 you can use our brand new web admin development (
 https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-841).
 Now
 it is just committed trunk and 2.x branches.

 For your question IMHO start URLs means your seedlist. For nutch accept as
 a folder or text file.

 Limit URLs mean when crawler start using seed list which new URL will be
 accepted for next steps. Actually you can right regex rules in nutch.  For
 example you crawl a home page of news webpage but you want to only get
 sports urls. You can write a regex rule for accepting sports URLs.

 For reverse situations you can use Exclude URLs.

 Talat.
  Hello Everyone:

 I am following the directions exactly word for word in this tutorial

 https://github.com/101tec/nutch/wiki/admin-url-upload

 My question is what is the difference between the start and limit urls.
 From the wiki I saw that the limit url seems to be a flat list of urls we
 want to fetch, but then why have a start url to become with?

 Also I noticed that when you do not have a limit url, you get the following
 exception (Note: I am using nutch-gui-0.5-dev),

 When you start a crawl shouildnt there be some sort of pop up box that
 occurs pops up saying that you need a limit url? So you dont get this
 exception? I can help work on this.


 14/09/25 20:34:37  INFO [Thread-3071] (Fetcher.java:970) - Fetcher: done

 14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:217) - bw update:
 starting

 14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:218) - bw update:
 db: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/crawldb

 14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:219) - bw update:
 bwdb: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb

 14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:220) - bw update:
 segments:

 [/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/segments/20140925203433]

 14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:223) - bw update:
 wrapping started.

 14/09/25 20:34:37  WARN [Thread-3071] (JobClient.java:547) - Use
 GenericOptionsParser for parsing the arguments. Applications should
 implement Tool for the same.

 14/09/25 20:34:38  INFO [Thread-3071] (BWUpdateDb.java:248) - bw update:
 filtering started.

 14/09/25 20:34:38  WARN [Thread-3071] (JobClient.java:547) - Use
 GenericOptionsParser for parsing the arguments. Applications should
 implement Tool for the same.

 14/09/25 20:34:38  WARN [Thread-3071] (StartCrawlRunnable.java:57) - can
 not start crawl.14/09/25 20:34:38  WARN [Thread-3071]
 (StartCrawlRunnable.java:57) - can not start crawl.

 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
 file:/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb/current

 at

 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)

 at

 org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)

 at

 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)

 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)

 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)

 at org.apache.nutch.crawl.bw.BWUpdateDb.update(BWUpdateDb.java:263)

 at org.apache.nutch.crawl.CrawlTool.crawl(CrawlTool.java:135)

 at

 org.apache.nutch.admin.crawl.StartCrawlRunnable.run(StartCrawlRunnable.java:52)

 at java.lang.Thread.run(Thread.java:744)


 --



 Nima Falaki
 Software Engineer
 nfal...@popsugar.com




-- 



Nima Falaki
Software Engineer
nfal...@popsugar.com


Re: Question about Nutch Wicket

2014-09-25 Thread Talat Uyarer
Sounds great. it will be good for web ui. If you write it, I can review and
add the wiki ;)

Talat
On Sep 26, 2014 8:26 AM, Nima Falaki nfal...@popsugar.com wrote:

 Nm, I figured it out. had to run the webapp command. There should be a wiki
 to document this. I could volunteer to write one, if nobody else is going
 to do this?



 On Thu, Sep 25, 2014 at 10:23 PM, Nima Falaki nfal...@popsugar.com
 wrote:

  Thanks Talat, just wondering is there a set of instructions that I can
 use
  to get the nutch web admin tool up and running?
 
  Nima
 
  On Thu, Sep 25, 2014 at 9:37 PM, Talat Uyarer ta...@uyarer.com wrote:
 
  Hi Nima,
 
  I never used nutch web admin. Web admin that you used, is very old.
 Maybe
  you can use our brand new web admin development (
  https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-841).
  Now
  it is just committed trunk and 2.x branches.
 
  For your question IMHO start URLs means your seedlist. For nutch accept
 as
  a folder or text file.
 
  Limit URLs mean when crawler start using seed list which new URL will be
  accepted for next steps. Actually you can right regex rules in nutch.
 For
  example you crawl a home page of news webpage but you want to only get
  sports urls. You can write a regex rule for accepting sports URLs.
 
  For reverse situations you can use Exclude URLs.
 
  Talat.
   Hello Everyone:
 
  I am following the directions exactly word for word in this tutorial
 
  https://github.com/101tec/nutch/wiki/admin-url-upload
 
  My question is what is the difference between the start and limit urls.
  From the wiki I saw that the limit url seems to be a flat list of urls
 we
  want to fetch, but then why have a start url to become with?
 
  Also I noticed that when you do not have a limit url, you get the
  following
  exception (Note: I am using nutch-gui-0.5-dev),
 
  When you start a crawl shouildnt there be some sort of pop up box that
  occurs pops up saying that you need a limit url? So you dont get this
  exception? I can help work on this.
 
 
  14/09/25 20:34:37  INFO [Thread-3071] (Fetcher.java:970) - Fetcher: done
 
  14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:217) - bw update:
  starting
 
  14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:218) - bw update:
  db: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/crawldb
 
  14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:219) - bw update:
  bwdb: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb
 
  14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:220) - bw update:
  segments:
 
 
 [/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/segments/20140925203433]
 
  14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:223) - bw update:
  wrapping started.
 
  14/09/25 20:34:37  WARN [Thread-3071] (JobClient.java:547) - Use
  GenericOptionsParser for parsing the arguments. Applications should
  implement Tool for the same.
 
  14/09/25 20:34:38  INFO [Thread-3071] (BWUpdateDb.java:248) - bw update:
  filtering started.
 
  14/09/25 20:34:38  WARN [Thread-3071] (JobClient.java:547) - Use
  GenericOptionsParser for parsing the arguments. Applications should
  implement Tool for the same.
 
  14/09/25 20:34:38  WARN [Thread-3071] (StartCrawlRunnable.java:57) - can
  not start crawl.14/09/25 20:34:38  WARN [Thread-3071]
  (StartCrawlRunnable.java:57) - can not start crawl.
 
  org.apache.hadoop.mapred.InvalidInputException: Input path does not
 exist:
 
 file:/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb/current
 
  at
 
 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
 
  at
 
 
 org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
 
  at
 
 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
 
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
 
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
 
  at org.apache.nutch.crawl.bw.BWUpdateDb.update(BWUpdateDb.java:263)
 
  at org.apache.nutch.crawl.CrawlTool.crawl(CrawlTool.java:135)
 
  at
 
 
 org.apache.nutch.admin.crawl.StartCrawlRunnable.run(StartCrawlRunnable.java:52)
 
  at java.lang.Thread.run(Thread.java:744)
 
 
  --
 
 
 
  Nima Falaki
  Software Engineer
  nfal...@popsugar.com
 
 
 
 
  --
 
 
 
  Nima Falaki
  Software Engineer
  nfal...@popsugar.com
 
 


 --



 Nima Falaki
 Software Engineer
 nfal...@popsugar.com