. Please have a
look and let me know if you have any questions.
[0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler
P.S. - There can be many solutions to this. I am just giving one. :)
Regards,
Karanjeet Singh
http://irds.usc.edu
On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju <sa
Hi Nana,
May be you can use URL regex filter to exclude these out. Following regex
expression will allow only http(s) links to be crawled.
+^http(s){0,1}://*
Thanks & Regards,
Karanjeet Singh
USC
On Mon, Jun 6, 2016 at 7:13 PM, Nana Pandiawan <
nana.pandia...@solusi247.com.invalid
.
I am excited to be a part of this community!!!
Regards,
Karanjeet Singh
USC
On Sun, May 22, 2016 at 12:51 PM, Sebastian Nagel <
wastl.na...@googlemail.com> wrote:
> Dear all,
>
> on behalf of the Nutch PMC it is my pleasure to announce
> that Karanjeet Singh has joi
interested to know the reason for this. Is it due to politeness?
[0]:
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L141
Regards,
Karanjeet Singh
USC
On Thu, Apr 14, 2016 at 1:40 AM, Sebastian Nagel <wastl.na...@googlemail.com
> wrote:
>
/segments -topN 1000
-numFetchers 1 -noFilter
Can anyone please look into this and let me know if I am missing something.
Please find the crawl configuration here [0].
[0]: https://github.com/karanjeets/crawl-evaluation/tree/master/nutch/conf
Thanks & Regards,
Karanjeet Singh
USC
ᐧ
Hi Manish,
If you are pointing at the links retrieved from a page, I would recommend
you to have a look at the Nutch configuration properties
"db.max.outlinks.per.page" and "db.max.inlinks". Hope it helps.
Thanks & Regards,
Karanjeet Singh
CS Graduate Student
Universit
Hi Byzen,
I hope you have installed all required libraries (Firefox, Xvfb) for
Selenium on your remote server. Can you please share your logs
(${NUTCH_HOME}/logs/hadoop.log) to get an insight of this issue.
Thanks & Regards,
Karanjeet Singh
CS Graduate Student
University of Southern Califo
I am facing the same problem here. Tried rebuilding it but in logs I can only
see the agent name mentioned in http.agent.name property.
By $NUTCH_HOME/conf do you mean runtime/local/conf directory ?
Also can you please brief me on how the rotation works ? Does the agent
rotates after crawling
8 matches
Mail list logo