Re: Nutch in production

2016-09-29 Thread Karanjeet Singh
Hi Sachin,

Just a suggestion here - you can use Apache Kafka to generate and catch
events which are mapped to incoming crawl requests, crawl status and much
more.

I have created a prototype for production queue [0] which runs on top of a
supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a
look and let me know if you have any questions.

[0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler

P.S. - There can be many solutions to this. I am just giving one.  :)

Regards,
Karanjeet Singh
http://irds.usc.edu

On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju <sachi...@mstack.com> wrote:

> Hi,
>I was experimenting some crawl cycles with nutch and would like to setup
> a distributed crawl environment. But I wonder how can I trigger nutch for
> incoming crawl requests in a production system. I read about nutch REST
> api. Is that the real option that I have ? Or can I run nutch as a
> continuously running distributed server by any other option ?
>
>  My preferred nutch version is nutch 1.12.
>
> Regards,
> Sachin Shaju
>
> sachi...@mstack.com
> +919539887554
>
> --
>
>
> The information contained in this electronic message and any attachments to
> this message are intended for the exclusive use of the addressee(s) and may
> contain proprietary, confidential or privileged information. If you are not
> the intended recipient, you should not disseminate, distribute or copy this
> e-mail. Please notify the sender immediately and destroy all copies of this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>
> www.mStack.com
>

ᐧ


Re: Error unknown protocol

2016-06-07 Thread Karanjeet Singh
Hi Nana,

May be you can use URL regex filter to exclude these out. Following regex
expression will allow only http(s) links to be crawled.

+^http(s){0,1}://*

Thanks & Regards,
Karanjeet Singh
USC

On Mon, Jun 6, 2016 at 7:13 PM, Nana Pandiawan <
nana.pandia...@solusi247.com.invalid> wrote:

> Hi Furkan,
> thanks for your response
>
> if the error occurred when nutch find a data uri schema like the one below?
>
> 
>
> I just crawl the random page and get the error,
> how to skip it that crawling proccess can be continued by nutch?
>
> On 06/06/16 17:25, Furkan KAMACI wrote:
>
>> Hi Nana,
>>
>> It seems that your problem maybe related to base64 data. Here is a link
>> about it:
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_12458390_embed-2Djava-2Dapplet-2Dthrough-2Durl-2Ddata=DQICaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=hqok1xhQzZJQQMUShFCwJlH6xLwK-lHBPuRLfzb1UMU=8fdlI6GMFRLCE37HOq1zs3Xm-sNs7ol0BxzvGxCFm5A=
>> Could
>>
>> you share the pages that you get error for?
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>> On Mon, Jun 6, 2016 at 4:26 AM, Nana Pandiawan <
>> nana.pandia...@solusi247.com.invalid> wrote:
>>
>> Hi All,
>>>
>>> I'm getting following errors when updatedb. can someone tell me whats
>>> going
>>> wrong and how to solve it.
>>> thanks.
>>>
>>> 16/06/04 00:58:42 INFO mapreduce.Job:  map 0% reduce 0%
>>> 16/06/04 00:59:27 INFO mapreduce.Job: Task Id :
>>> attempt_1464314319848_0309_m_00_0, Status : FAILED
>>> Error: java.net.MalformedURLException: unknown protocol: t00
>>>  at java.net.URL.(URL.java:603)
>>>  at java.net.URL.(URL.java:493)
>>>  at java.net.URL.(URL.java:442)
>>>  at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
>>>  at
>>> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
>>>  at
>>> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
>>>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>>>  at
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>>>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>>>  at java.security.AccessController.doPrivileged(Native Method)
>>>  at javax.security.auth.Subject.doAs(Subject.java:415)
>>>  at
>>>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>>> 16/06/04 01:00:14 INFO mapreduce.Job: Task Id :
>>> attempt_1464314319848_0309_m_00_1, Status : FAILED
>>> Error: java.net.MalformedURLException: unknown protocol: t00
>>>  at java.net.URL.(URL.java:603)
>>>  at java.net.URL.(URL.java:493)
>>>  at java.net.URL.(URL.java:442)
>>>  at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
>>>  at
>>> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
>>>  at
>>> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
>>>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>>>  at
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>>>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>>>  at java.security.AccessController.doPrivileged(Native Method)
>>>  at javax.security.auth.Subject.doAs(Subject.java:415)
>>>  at
>>>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>>> 16/06/04 01:00:42 INFO mapreduce.Job: Task Id :
>>> attempt_1464314319848_0309_m_01_0, Status : FAILED
>>> Error: java.net.MalformedURLException: unknown protocol: data
>>>
>>> I use Apache Nutch 2.3.1 and hbase as backend.
>>> Regards,
>>>
>>>
>
ᐧ


Re: [ANNOUNCE] New Nutch committer and PMC - Karanjeet Singh

2016-05-23 Thread Karanjeet Singh
Hi Sebastian,

Thanks for the invitation and warm welcome.


Hello Everyone,

I am glad to be on board and having this opportunity to work with all of
you.

I am a graduate student at the University of Southern California (USC)
pursuing my Master’s in Computer Science. Prior to this, I was working as a
web developer at Computer Sciences Corporation (CSC), India.  At CSC, I
have developed applications for a global payments technology company
adhering to PCI DSS standards.

And now, I am starting my summer internship at NASA JPL.

Last year, in 2015, I took a course named Information Retrieval (IR) under
Prof. Chris Mattmann where I got the opportunity to learn and work on Nutch
1.x. This was the time when I started working on some of its bugs. The
semester ended but not the interest and therefore I moved ahead working on
Nutch plugins, particularly, HtmlUnit and Selenium.

During this summer, I plan to make more contributions and help the
community grow. Also, I plan to port Nutch backend on Spark for an improved
performance and better after-crawl analysis. I am also interested in
working on real-time crawl analysis in Nutch through a clean and easy to
understand visual interface.

I am excited to be a part of this community!!!
Regards,
Karanjeet Singh
USC


On Sun, May 22, 2016 at 12:51 PM, Sebastian Nagel <
wastl.na...@googlemail.com> wrote:

> Dear all,
>
> on behalf of the Nutch PMC it is my pleasure to announce
> that Karanjeet Singh has joined the Nutch team as committer
> and PMC member. Karanjeet, would you mind to introduce
> yourself and tell the Nutch community about your relation
> to Apache Nutch, what you have done or plan to do, etc.?
>
> Congratulations and welcome on board!
>
> Regards,
> Sebastian
>

ᐧ


Re: Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)

2016-04-14 Thread Karanjeet Singh
Thanks, Sebastian.

This is solved now. I looked through the code and found that Nutch has a
limit placed on the count of host URLs which is defined by *topN / number
of reducer tasks*. Please refer here [0].

So, I was running 16 reduce tasks with topN 1000 and hence 62 URLs (1000 /
16).

I am interested to know the reason for this. Is it due to politeness?

[0]:
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L141

Regards,
Karanjeet Singh
USC

On Thu, Apr 14, 2016 at 1:40 AM, Sebastian Nagel <wastl.na...@googlemail.com
> wrote:

> Hi,
>
> I didn't anything wrong. Did you check whether
> CrawlDb entries are marked as "generated"
> by "_ngt_="?  With generate.update.crawldb=true
> it may happen that after having run generate
> multiple times, only 62 unfetched and not-generated
> entries remain.
>
> Sebastian
>
> On 04/14/2016 03:31 AM, Karanjeet Singh wrote:
> > Hello,
> >
> > I am trying to crawl a website using Nutch on Hadoop cluster. I have
> > modified the crawl script to restrict the sizeFetchList to 1000 (which is
> > the topN value for nutch generate command).
> >
> > However, as I see, Nutch is only generating 62 URLs where the unfetched
> URL
> > count is 5,000 (approx). I am using the below command:
> >
> > nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D
> > mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D
> > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> > mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN
> 1000
> > -numFetchers 1 -noFilter
> >
> > Can anyone please look into this and let me know if I am missing
> something.
> > Please find the crawl configuration here [0].
> >
> > [0]:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_karanjeets_crawl-2Devaluation_tree_master_nutch_conf=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=O7MP8WCf7SwgrHXMvaLfmySYST5zRY_AIRTn6cMKclA=QwXZBwYqg1DRis1p2p3iCS6zk4VIb-alEkjMhnzjpWg=
> >
> > Thanks & Regards,
> > Karanjeet Singh
> > USC
> > ᐧ
> >
>
>
ᐧ


Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)

2016-04-13 Thread Karanjeet Singh
Hello,

I am trying to crawl a website using Nutch on Hadoop cluster. I have
modified the crawl script to restrict the sizeFetchList to 1000 (which is
the topN value for nutch generate command).

However, as I see, Nutch is only generating 62 URLs where the unfetched URL
count is 5,000 (approx). I am using the below command:

nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D
mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 1000
-numFetchers 1 -noFilter

Can anyone please look into this and let me know if I am missing something.
Please find the crawl configuration here [0].

[0]: https://github.com/karanjeets/crawl-evaluation/tree/master/nutch/conf

Thanks & Regards,
Karanjeet Singh
USC
ᐧ


Re: Crawl Script Don't Want To Use -topn

2015-12-21 Thread Karanjeet Singh
Hi Manish,

If you are pointing at the links retrieved from a page, I would recommend
you to have a look at the Nutch configuration properties
"db.max.outlinks.per.page" and "db.max.inlinks". Hope it helps.

Thanks & Regards,
Karanjeet Singh
CS Graduate Student
University of Southern California
karan...@usc.edu

On Sun, Dec 20, 2015 at 8:33 PM, Manish Verma <m_ve...@apple.com> wrote:

> Hi,
>
> I am using  notch 1.10 and using crawl script and I see from logs it uses
> -topn 5,  I want to consider all pages equally and want to crawl
> everything.
>
> Thanks MV
>
>
>


Re: How to deploy Selenium on Server?

2015-12-21 Thread Karanjeet Singh
Hi Byzen,

I hope you have installed all required libraries (Firefox, Xvfb) for
Selenium on your remote server. Can you please share your logs
(${NUTCH_HOME}/logs/hadoop.log) to get an insight of this issue.

Thanks & Regards,
Karanjeet Singh
CS Graduate Student
University of Southern California
karan...@usc.edu


On Mon, Dec 21, 2015 at 4:54 AM, Baizhang Ma <baizhang...@gmail.com> wrote:

> Hi, everyone.
> I want to use Selenium plugins to crawl dynamic content of pages. I deploy
> it as https://github.com/momer/nutch-selenium says and can run normally in
> local computer(my own computer). However, the plugins don't work after i
> deploy on the remote server. At the beginning, I thought it might need a
> deplay or desktop as same as local model, so i installed a desktop on the
> server, but unfortunately, it still cann't work. Is there anyone who have
> ideas about this? Thanks very much!
>
> Best Regards,
> Byzen. Ma
>


Re: Configuring rotating agent in Nutch

2015-09-27 Thread Karanjeet Singh
I am facing the same problem here. Tried rebuilding it but in logs I can only
see the agent name mentioned in http.agent.name property.

By $NUTCH_HOME/conf do you mean runtime/local/conf directory ? 

Also can you please brief me on how the rotation works ? Does the agent
rotates after crawling some X links and if so can we configure that X ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuring-rotating-agent-in-Nutch-tp4231459p4231609.html
Sent from the Nutch - User mailing list archive at Nabble.com.