Nutch on map reduce 1 vs Yarn

2015-05-02 Thread Ali Nazemian
Dear all, Hi, I was wondering is there any performance difference between running Nutch 1.x on Map Reduce 1 and Yarn? I did test Nutch on both and it seems Nutch on Map Reduce 1 is faster than Yarn? Am I correct? Is there any performance related parameters that should be considered in case of

Nutch config fetch related parameters

2015-04-15 Thread Ali Nazemian
Hi, I am going to work on tuning nutch. For this purpose I found out that changing some of the nutch parameters affect nutch performance. Would you please somebody explain for me what are the following parameters and what is the relation of them with fetch speed exactly? fetcher.threads.per.host

Re: nutch with Hadoop V2

2014-09-05 Thread Ali Nazemian
Dear Talat, Hi, What about nutch 1.x?! Is there any plan for migrating to Yarn in nutch1 road map? Regards. On Fri, Sep 5, 2014 at 8:53 AM, Talat Uyarer ta...@uyarer.com wrote: Hi Mike, As a Nutch 2.x user, I will work on this issue. My plan I will finish until next release. Hence Gora

Re: Web forum crawling using nutch

2014-09-02 Thread Ali Nazemian
Dear Patrick, Sure, I was thinking of selenium. It seems that there is nutch plugin for this purpose which works with selenium. But I did not test that yet. Regards. On Mon, Sep 1, 2014 at 6:51 PM, Patrick Kirsch pkir...@zscho.de wrote: Am 06.08.2014 10:24, schrieb Ali Nazemian: Dear all

Re: Nutch Confusion

2014-08-29 Thread Ali Nazemian
Dear Iqbal, Hi, As far as I know, If you dont need Gora mapper for using Nutch over Hbase or MySQL or etc. , it is better to use version 1.x since some of Nutch functionality are not implemented on version 2.x and Nutch 1.x provides better performance for crawling web pages. ES is not difficult

Re: Different regex-urlfilter for different file types in nutch

2014-08-25 Thread Ali Nazemian
this feature. Maybe you can add a interface like protected Reader[] getRulesReaders(Configuration conf) throws IOException to get multi-readers for all configure files in RegexURLFilterBase class. On Tue, Aug 19, 2014 at 1:42 AM, Ali Nazemian alinazem...@gmail.com wrote: Dear all, Hi

Different regex-urlfilter for different file types in nutch

2014-08-18 Thread Ali Nazemian
Dear all, Hi, I use nutch 1.8 for crawl some web sites. For this purpose I want to change nutch in a way that different regex-urlfilter file loads for different types of file. For example one for html files and another for image files. (jpg/jpeg, ... ) Does nutch consider such situation? Or I

Re: Upgrading nutch 1.8 for having solrj 4.9

2014-07-20 Thread Ali Nazemian
I manged to handle the integration of nutch with solr 4.9. But how can I change that to work in distributed mode? (solrCloud) Best regards. On Thu, Jul 17, 2014 at 5:05 PM, Ali Nazemian alinazem...@gmail.com wrote: Thank you very much. I will check the attached patch to see if does it solve

Re: Upgrading nutch 1.8 for having solrj 4.9

2014-07-17 Thread Ali Nazemian
unexcepted issues. Talat 2014-07-15 15:14 GMT+03:00 Ali Nazemian alinazem...@gmail.com: Dear all, Hi, I need to make some changes in solr-index plugin. For this purpose I have to upgrade solrj to 4.9. The problem is for changing solrj I need to change other parts of nutch source code

Re: Upgrading nutch 1.8 for having solrj 4.9

2014-07-17 Thread Ali Nazemian
. It is too old. Can you create an issue about that on jira ? If nobody is not interest, I can. Thanks 2014-07-17 12:50 GMT+03:00 Ali Nazemian alinazem...@gmail.com: Dear Talat, What about solrj 4.8? Has it some deprecated classes too? I need to add partial update to some parts

Upgrading nutch 1.8 for having solrj 4.9

2014-07-15 Thread Ali Nazemian
Dear all, Hi, I need to make some changes in solr-index plugin. For this purpose I have to upgrade solrj to 4.9. The problem is for changing solrj I need to change other parts of nutch source code. It seems that some of applied libraries are out of date for solrj. Is there any body that already

Re: Changing nutch for update documents instead of add new ones

2014-07-07 Thread Ali Nazemian
Dear Jorge, Hi, Could you please tell me more about this solr plugin? Do you have that? Regards. On Wed, Jul 2, 2014 at 9:44 AM, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Sometime ago for a very particular use case we abstracted this responsability into a custom Solr plugin

Re: Changing nutch for update documents instead of add new ones

2014-07-07 Thread Ali Nazemian
cluster, this could be a bottleneck. I don’t know your specific requirement but hopes this helps. If you share more of your architecture I’m sure more people could help. Regards, On Jul 7, 2014, at 8:45 AM, Ali Nazemian alinazem...@gmail.com wrote: Dear Jorge, Hi, Could you please tell

Changing nutch for update documents instead of add new ones

2014-07-01 Thread Ali Nazemian
Dears, Hi, I am going to do some changes in nutch default behavior. I want to change nutch solr index (indexWriter class) in a way that instead of adding new document to solr, old documents are updated. I saw an update method inside this class. Is that implemented for this purpose? If no what is

Creating webgraph for one site

2014-06-30 Thread Ali Nazemian
Dear Nutch users, Hi, I want to create a webgraph for each website. I was wondering how that would be possible? I want to add a parent filed to crawl in a way that specific document is a child of specific web-page Is it possible with nutch webgraph? How? Regards. -- A.Nazemian

Sending parse data from one generate-fetch-update cycle to another one

2014-06-10 Thread Ali Nazemian
Hi every body, I am going to crawl and parse some news website as follows: There are some important locations in each website that have news with higher value of importance. Therefore I am going to parse page by xpath to find these news. Then I am going to assign specific score to these news based

Re: Incremental crawling with nutch

2014-06-09 Thread Ali Nazemian
:02 PM, Ali Nazemian alinazem...@gmail.com wrote: Dear Bayu, Would you please also provide me what procedure you are going to use for testing recrawl? maybe I do some steps wrong. Regards. On Fri, Jun 6, 2014 at 7:01 PM, Bayu Widyasanyata bwidyasany...@gmail.com wrote: Just

Re: Incremental crawling with nutch

2014-06-07 Thread Ali Nazemian
...@gmail.com wrote: Hi Ali, OK, I will share using my current script. I sometimes use -adddays parameter on nutch generate steps to force recrawling. Thanks. On Fri, Jun 6, 2014 at 11:02 PM, Ali Nazemian alinazem...@gmail.com wrote: Dear Bayu, Would you please also provide me what

Re: Incremental crawling with nutch

2014-06-06 Thread Ali Nazemian
/2010/06/11/how-to-re-crawl-with-nutch/ On Thu, Jun 5, 2014 at 12:32 AM, Ali Nazemian alinazem...@gmail.com wrote: Thank you very much. But it is just a parameter for specifying the interval between re-crawls. The problem is nutch re-crawl does not works with default crawl script

Re: Incremental crawling with nutch

2014-06-06 Thread Ali Nazemian
Thank you very much. On Fri, Jun 6, 2014 at 7:01 PM, Bayu Widyasanyata bwidyasany...@gmail.com wrote: Just curious, I will go back in lab and proof it --- wassalam, [bayu] /sent from Android phone/ On Jun 6, 2014 5:37 PM, Ali Nazemian alinazem...@gmail.com wrote: Dear Bayu, Hi

Re: Incremental crawling with nutch

2014-06-06 Thread Ali Nazemian
, [bayu] /sent from Android phone/ On Jun 6, 2014 5:37 PM, Ali Nazemian alinazem...@gmail.com wrote: Dear Bayu, Hi, I already read that post about recrawling. My problem is nutch does not works in the same way that this post mentioned. Regards. On Fri, Jun 6, 2014 at 2:14 PM, Bayu

re-crawling with nutch 1.8

2014-06-05 Thread Ali Nazemian
Hi, I recently got familiar with nutch and I want to use nutch for whole web crawling. The problem is I did not find any useful tutorial on how to re-crawl using nutch. I know that there is some configuration parameter that should change for purpose of recrawling, I am aware of them. The thing

Re: Incremental crawling with nutch

2014-06-04 Thread Ali Nazemian
to db.fetch.interval.max. Sent from my HTC - Reply message - From: Ali Nazemian alinazem...@gmail.com To: user@nutch.apache.org Subject: Incremental crawling with nutch Date: Mon, Jun 2, 2014 4:52 AM Hi, Could you please explain more? What parameter? How can I do

Re: Incremental crawling with nutch

2014-06-02 Thread Ali Nazemian
Hi, Could you please explain more? What parameter? How can I do that?! Regards. On Mon, Jun 2, 2014 at 3:42 AM, S.L simpleliving...@gmail.com wrote: Hi Ali Please see the nutch-site.xml parameters one of them does that. Sent from my HTC - Reply message - From: Ali Nazemian

Incremental crawling with nutch

2014-06-01 Thread Ali Nazemian
Hi everybody, I am going to use nutch for crawling some news web site. These websites will be updated regularly. Therefore I should recrawl them at least every 2 hours. But the problem is I want to have incremental re-crawl, it means nutch should crawl only the urls that are new and not fetched

Re: Re-crawl every 24 hours

2014-05-23 Thread Ali Nazemian
Dear Julien, Hi, Do you know any step by step guide for this procedure? Is this the same for nutch 1.8? Best regards. On Wed, May 21, 2014 at 6:43 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: property namedb.fetch.interval.default/name value1800/value descriptionThe default

Re: Re-crawl every 24 hours

2014-05-21 Thread Ali Nazemian
Hi Ali, It is the same problem that I faced recently. It is my concert too. I would appreciate if somebody answer this question. Best regards. On Wed, May 21, 2014 at 2:52 PM, Ali rahmani alira...@yahoo.com wrote: Dear Sir, I am customizing Nutch 2.2 to crawl my seed lists which contains

Nutch 1.8 on hadoop

2014-05-19 Thread Ali Nazemian
Hi, I was wondering how can I run nutch 1.8 job on hadoop cluster? As far as know for running nutch 1.7 job on hadoop cluster we could use org.apache.nutch.crawl.Crawl class. Since this class was deprecated and removed from nutch 1.8 what class would be responsible for crawling job? Which class

Re: Nutch 1.8 on hadoop

2014-05-19 Thread Ali Nazemian
, see bin/crawl.sh. It is easier to modify than the all in one Crawl class and gives a good understanding of the underlying processing steps. Best Julien On 19 May 2014 11:55, Ali Nazemian alinazem...@gmail.com wrote: Hi, I was wondering how can I run nutch 1.8 job on hadoop cluster

Re: Nutch 1.8 on hadoop

2014-05-19 Thread Ali Nazemian
Thank you very much. On Mon, May 19, 2014 at 5:06 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: there is no crawl class any more. Re-crawl class meant regarding the crawl class. Again there is now a script for it On 19 May 2014 13:08, Ali Nazemian alinazem...@gmail.com wrote

Re: Nutch 1.8 on hadoop

2014-05-19 Thread Ali Nazemian
) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) On Mon, May 19, 2014 at 5:10 PM, Ali Nazemian alinazem...@gmail.com wrote: Thank you very much. On Mon, May 19, 2014 at 5