Re: Nutch in production
Hi Sachin, Just a suggestion here - you can use Apache Kafka to generate and catch events which are mapped to incoming crawl requests, crawl status and much more. I have created a prototype for production queue [0] which runs on top of a supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a look and let me know if you have any questions. [0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler P.S. - There can be many solutions to this. I am just giving one. :) Regards, Karanjeet Singh http://irds.usc.edu On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju <sachi...@mstack.com> wrote: > Hi, >I was experimenting some crawl cycles with nutch and would like to setup > a distributed crawl environment. But I wonder how can I trigger nutch for > incoming crawl requests in a production system. I read about nutch REST > api. Is that the real option that I have ? Or can I run nutch as a > continuously running distributed server by any other option ? > > My preferred nutch version is nutch 1.12. > > Regards, > Sachin Shaju > > sachi...@mstack.com > +919539887554 > > -- > > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. > > WARNING: Computer viruses can be transmitted via email. The recipient > should check this email and any attachments for the presence of viruses. > The company accepts no liability for any damage caused by any virus > transmitted by this email. > > www.mStack.com > ᐧ
Re: Error unknown protocol
Hi Nana, May be you can use URL regex filter to exclude these out. Following regex expression will allow only http(s) links to be crawled. +^http(s){0,1}://* Thanks & Regards, Karanjeet Singh USC On Mon, Jun 6, 2016 at 7:13 PM, Nana Pandiawan < nana.pandia...@solusi247.com.invalid> wrote: > Hi Furkan, > thanks for your response > > if the error occurred when nutch find a data uri schema like the one below? > > > > I just crawl the random page and get the error, > how to skip it that crawling proccess can be continued by nutch? > > On 06/06/16 17:25, Furkan KAMACI wrote: > >> Hi Nana, >> >> It seems that your problem maybe related to base64 data. Here is a link >> about it: >> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_12458390_embed-2Djava-2Dapplet-2Dthrough-2Durl-2Ddata=DQICaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=hqok1xhQzZJQQMUShFCwJlH6xLwK-lHBPuRLfzb1UMU=8fdlI6GMFRLCE37HOq1zs3Xm-sNs7ol0BxzvGxCFm5A= >> Could >> >> you share the pages that you get error for? >> >> Kind Regards, >> Furkan KAMACI >> >> On Mon, Jun 6, 2016 at 4:26 AM, Nana Pandiawan < >> nana.pandia...@solusi247.com.invalid> wrote: >> >> Hi All, >>> >>> I'm getting following errors when updatedb. can someone tell me whats >>> going >>> wrong and how to solve it. >>> thanks. >>> >>> 16/06/04 00:58:42 INFO mapreduce.Job: map 0% reduce 0% >>> 16/06/04 00:59:27 INFO mapreduce.Job: Task Id : >>> attempt_1464314319848_0309_m_00_0, Status : FAILED >>> Error: java.net.MalformedURLException: unknown protocol: t00 >>> at java.net.URL.(URL.java:603) >>> at java.net.URL.(URL.java:493) >>> at java.net.URL.(URL.java:442) >>> at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43) >>> at >>> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96) >>> at >>> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38) >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) >>> at >>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) >>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at javax.security.auth.Subject.doAs(Subject.java:415) >>> at >>> >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) >>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) >>> 16/06/04 01:00:14 INFO mapreduce.Job: Task Id : >>> attempt_1464314319848_0309_m_00_1, Status : FAILED >>> Error: java.net.MalformedURLException: unknown protocol: t00 >>> at java.net.URL.(URL.java:603) >>> at java.net.URL.(URL.java:493) >>> at java.net.URL.(URL.java:442) >>> at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43) >>> at >>> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96) >>> at >>> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38) >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) >>> at >>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) >>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at javax.security.auth.Subject.doAs(Subject.java:415) >>> at >>> >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) >>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) >>> 16/06/04 01:00:42 INFO mapreduce.Job: Task Id : >>> attempt_1464314319848_0309_m_01_0, Status : FAILED >>> Error: java.net.MalformedURLException: unknown protocol: data >>> >>> I use Apache Nutch 2.3.1 and hbase as backend. >>> Regards, >>> >>> > ᐧ
Re: [ANNOUNCE] New Nutch committer and PMC - Karanjeet Singh
Hi Sebastian, Thanks for the invitation and warm welcome. Hello Everyone, I am glad to be on board and having this opportunity to work with all of you. I am a graduate student at the University of Southern California (USC) pursuing my Master’s in Computer Science. Prior to this, I was working as a web developer at Computer Sciences Corporation (CSC), India. At CSC, I have developed applications for a global payments technology company adhering to PCI DSS standards. And now, I am starting my summer internship at NASA JPL. Last year, in 2015, I took a course named Information Retrieval (IR) under Prof. Chris Mattmann where I got the opportunity to learn and work on Nutch 1.x. This was the time when I started working on some of its bugs. The semester ended but not the interest and therefore I moved ahead working on Nutch plugins, particularly, HtmlUnit and Selenium. During this summer, I plan to make more contributions and help the community grow. Also, I plan to port Nutch backend on Spark for an improved performance and better after-crawl analysis. I am also interested in working on real-time crawl analysis in Nutch through a clean and easy to understand visual interface. I am excited to be a part of this community!!! Regards, Karanjeet Singh USC On Sun, May 22, 2016 at 12:51 PM, Sebastian Nagel < wastl.na...@googlemail.com> wrote: > Dear all, > > on behalf of the Nutch PMC it is my pleasure to announce > that Karanjeet Singh has joined the Nutch team as committer > and PMC member. Karanjeet, would you mind to introduce > yourself and tell the Nutch community about your relation > to Apache Nutch, what you have done or plan to do, etc.? > > Congratulations and welcome on board! > > Regards, > Sebastian > ᐧ
Re: Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)
Thanks, Sebastian. This is solved now. I looked through the code and found that Nutch has a limit placed on the count of host URLs which is defined by *topN / number of reducer tasks*. Please refer here [0]. So, I was running 16 reduce tasks with topN 1000 and hence 62 URLs (1000 / 16). I am interested to know the reason for this. Is it due to politeness? [0]: https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L141 Regards, Karanjeet Singh USC On Thu, Apr 14, 2016 at 1:40 AM, Sebastian Nagel <wastl.na...@googlemail.com > wrote: > Hi, > > I didn't anything wrong. Did you check whether > CrawlDb entries are marked as "generated" > by "_ngt_="? With generate.update.crawldb=true > it may happen that after having run generate > multiple times, only 62 unfetched and not-generated > entries remain. > > Sebastian > > On 04/14/2016 03:31 AM, Karanjeet Singh wrote: > > Hello, > > > > I am trying to crawl a website using Nutch on Hadoop cluster. I have > > modified the crawl script to restrict the sizeFetchList to 1000 (which is > > the topN value for nutch generate command). > > > > However, as I see, Nutch is only generating 62 URLs where the unfetched > URL > > count is 5,000 (approx). I am using the below command: > > > > nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D > > mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D > > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D > > mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN > 1000 > > -numFetchers 1 -noFilter > > > > Can anyone please look into this and let me know if I am missing > something. > > Please find the crawl configuration here [0]. > > > > [0]: > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_karanjeets_crawl-2Devaluation_tree_master_nutch_conf=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=O7MP8WCf7SwgrHXMvaLfmySYST5zRY_AIRTn6cMKclA=QwXZBwYqg1DRis1p2p3iCS6zk4VIb-alEkjMhnzjpWg= > > > > Thanks & Regards, > > Karanjeet Singh > > USC > > ᐧ > > > > ᐧ
Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)
Hello, I am trying to crawl a website using Nutch on Hadoop cluster. I have modified the crawl script to restrict the sizeFetchList to 1000 (which is the topN value for nutch generate command). However, as I see, Nutch is only generating 62 URLs where the unfetched URL count is 5,000 (approx). I am using the below command: nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 1000 -numFetchers 1 -noFilter Can anyone please look into this and let me know if I am missing something. Please find the crawl configuration here [0]. [0]: https://github.com/karanjeets/crawl-evaluation/tree/master/nutch/conf Thanks & Regards, Karanjeet Singh USC ᐧ
Re: Crawl Script Don't Want To Use -topn
Hi Manish, If you are pointing at the links retrieved from a page, I would recommend you to have a look at the Nutch configuration properties "db.max.outlinks.per.page" and "db.max.inlinks". Hope it helps. Thanks & Regards, Karanjeet Singh CS Graduate Student University of Southern California karan...@usc.edu On Sun, Dec 20, 2015 at 8:33 PM, Manish Verma <m_ve...@apple.com> wrote: > Hi, > > I am using notch 1.10 and using crawl script and I see from logs it uses > -topn 5, I want to consider all pages equally and want to crawl > everything. > > Thanks MV > > >
Re: How to deploy Selenium on Server?
Hi Byzen, I hope you have installed all required libraries (Firefox, Xvfb) for Selenium on your remote server. Can you please share your logs (${NUTCH_HOME}/logs/hadoop.log) to get an insight of this issue. Thanks & Regards, Karanjeet Singh CS Graduate Student University of Southern California karan...@usc.edu On Mon, Dec 21, 2015 at 4:54 AM, Baizhang Ma <baizhang...@gmail.com> wrote: > Hi, everyone. > I want to use Selenium plugins to crawl dynamic content of pages. I deploy > it as https://github.com/momer/nutch-selenium says and can run normally in > local computer(my own computer). However, the plugins don't work after i > deploy on the remote server. At the beginning, I thought it might need a > deplay or desktop as same as local model, so i installed a desktop on the > server, but unfortunately, it still cann't work. Is there anyone who have > ideas about this? Thanks very much! > > Best Regards, > Byzen. Ma >
Re: Configuring rotating agent in Nutch
I am facing the same problem here. Tried rebuilding it but in logs I can only see the agent name mentioned in http.agent.name property. By $NUTCH_HOME/conf do you mean runtime/local/conf directory ? Also can you please brief me on how the rotation works ? Does the agent rotates after crawling some X links and if so can we configure that X ? -- View this message in context: http://lucene.472066.n3.nabble.com/Configuring-rotating-agent-in-Nutch-tp4231459p4231609.html Sent from the Nutch - User mailing list archive at Nabble.com.