Dear all,
Hi,
I was wondering is there any performance difference between running Nutch
1.x on Map Reduce 1 and Yarn? I did test Nutch on both and it seems Nutch
on Map Reduce 1 is faster than Yarn? Am I correct? Is there any performance
related parameters that should be considered in case of
Hi,
I am going to work on tuning nutch. For this purpose I found out that
changing some of the nutch parameters affect nutch performance. Would you
please somebody explain for me what are the following parameters and what
is the relation of them with fetch speed exactly?
fetcher.threads.per.host
Dear Talat,
Hi,
What about nutch 1.x?! Is there any plan for migrating to Yarn in nutch1
road map?
Regards.
On Fri, Sep 5, 2014 at 8:53 AM, Talat Uyarer ta...@uyarer.com wrote:
Hi Mike,
As a Nutch 2.x user, I will work on this issue. My plan I will finish
until next release. Hence Gora
Dear Patrick,
Sure, I was thinking of selenium. It seems that there is nutch plugin for
this purpose which works with selenium. But I did not test that yet.
Regards.
On Mon, Sep 1, 2014 at 6:51 PM, Patrick Kirsch pkir...@zscho.de wrote:
Am 06.08.2014 10:24, schrieb Ali Nazemian:
Dear all
Dear Iqbal,
Hi,
As far as I know, If you dont need Gora mapper for using Nutch over Hbase
or MySQL or etc. , it is better to use version 1.x since some of Nutch
functionality are not implemented on version 2.x and Nutch 1.x provides
better performance for crawling web pages. ES is not difficult
this feature.
Maybe you can add a interface like
protected Reader[] getRulesReaders(Configuration conf) throws IOException
to get multi-readers for all configure files in RegexURLFilterBase class.
On Tue, Aug 19, 2014 at 1:42 AM, Ali Nazemian alinazem...@gmail.com
wrote:
Dear all,
Hi
Dear all,
Hi,
I use nutch 1.8 for crawl some web sites. For this purpose I want to change
nutch in a way that different regex-urlfilter file loads for different
types of file. For example one for html files and another for image files.
(jpg/jpeg, ... ) Does nutch consider such situation? Or I
I manged to handle the integration of nutch with solr 4.9. But how can I
change that to work in distributed mode? (solrCloud)
Best regards.
On Thu, Jul 17, 2014 at 5:05 PM, Ali Nazemian alinazem...@gmail.com wrote:
Thank you very much. I will check the attached patch to see if does it
solve
unexcepted issues.
Talat
2014-07-15 15:14 GMT+03:00 Ali Nazemian alinazem...@gmail.com:
Dear all,
Hi,
I need to make some changes in solr-index plugin. For this purpose I have
to upgrade solrj to 4.9. The problem is for changing solrj I need to
change
other parts of nutch source code
. It is too old. Can you create an issue about
that on jira ? If nobody is not interest, I can.
Thanks
2014-07-17 12:50 GMT+03:00 Ali Nazemian alinazem...@gmail.com:
Dear Talat,
What about solrj 4.8? Has it some deprecated classes too? I need to add
partial update to some parts
Dear all,
Hi,
I need to make some changes in solr-index plugin. For this purpose I have
to upgrade solrj to 4.9. The problem is for changing solrj I need to change
other parts of nutch source code. It seems that some of applied libraries
are out of date for solrj. Is there any body that already
Dear Jorge,
Hi,
Could you please tell me more about this solr plugin? Do you have that?
Regards.
On Wed, Jul 2, 2014 at 9:44 AM, Jorge Luis Betancourt Gonzalez
jlbetanco...@uci.cu wrote:
Sometime ago for a very particular use case we abstracted this
responsability into a custom Solr plugin
cluster, this could be a bottleneck.
I don’t know your specific requirement but hopes this helps. If you share
more of your architecture I’m sure more people could help.
Regards,
On Jul 7, 2014, at 8:45 AM, Ali Nazemian alinazem...@gmail.com wrote:
Dear Jorge,
Hi,
Could you please tell
Dears,
Hi,
I am going to do some changes in nutch default behavior. I want to change
nutch solr index (indexWriter class) in a way that instead of adding new
document to solr, old documents are updated. I saw an update method
inside this class. Is that implemented for this purpose? If no what is
Dear Nutch users,
Hi,
I want to create a webgraph for each website. I was wondering how that
would be possible? I want to add a parent filed to crawl in a way that
specific document is a child of specific web-page Is it possible with
nutch webgraph? How?
Regards.
--
A.Nazemian
Hi every body,
I am going to crawl and parse some news website as follows:
There are some important locations in each website that have news with
higher value of importance. Therefore I am going to parse page by xpath to
find these news. Then I am going to assign specific score to these news
based
:02 PM, Ali Nazemian alinazem...@gmail.com
wrote:
Dear Bayu,
Would you please also provide me what procedure you are going to use for
testing recrawl? maybe I do some steps wrong.
Regards.
On Fri, Jun 6, 2014 at 7:01 PM, Bayu Widyasanyata
bwidyasany...@gmail.com
wrote:
Just
...@gmail.com
wrote:
Hi Ali,
OK, I will share using my current script.
I sometimes use -adddays parameter on nutch generate steps to force
recrawling.
Thanks.
On Fri, Jun 6, 2014 at 11:02 PM, Ali Nazemian alinazem...@gmail.com
wrote:
Dear Bayu,
Would you please also provide me what
/2010/06/11/how-to-re-crawl-with-nutch/
On Thu, Jun 5, 2014 at 12:32 AM, Ali Nazemian alinazem...@gmail.com
wrote:
Thank you very much. But it is just a parameter for specifying the
interval
between re-crawls. The problem is nutch re-crawl does not works with
default crawl script
Thank you very much.
On Fri, Jun 6, 2014 at 7:01 PM, Bayu Widyasanyata bwidyasany...@gmail.com
wrote:
Just curious, I will go back in lab and proof it
---
wassalam,
[bayu]
/sent from Android phone/
On Jun 6, 2014 5:37 PM, Ali Nazemian alinazem...@gmail.com wrote:
Dear Bayu,
Hi
,
[bayu]
/sent from Android phone/
On Jun 6, 2014 5:37 PM, Ali Nazemian alinazem...@gmail.com wrote:
Dear Bayu,
Hi,
I already read that post about recrawling. My problem is nutch does not
works in the same way that this post mentioned.
Regards.
On Fri, Jun 6, 2014 at 2:14 PM, Bayu
Hi,
I recently got familiar with nutch and I want to use nutch for whole web
crawling. The problem is I did not find any useful tutorial on how to
re-crawl using nutch. I know that there is some configuration parameter
that should change for purpose of recrawling, I am aware of them. The thing
to
db.fetch.interval.max.
Sent from my HTC
- Reply message -
From: Ali Nazemian alinazem...@gmail.com
To: user@nutch.apache.org
Subject: Incremental crawling with nutch
Date: Mon, Jun 2, 2014 4:52 AM
Hi,
Could you please explain more?
What parameter? How can I do
Hi,
Could you please explain more?
What parameter? How can I do that?!
Regards.
On Mon, Jun 2, 2014 at 3:42 AM, S.L simpleliving...@gmail.com wrote:
Hi Ali
Please see the nutch-site.xml parameters one of them does that.
Sent from my HTC
- Reply message -
From: Ali Nazemian
Hi everybody,
I am going to use nutch for crawling some news web site. These websites
will be updated regularly. Therefore I should recrawl them at least every 2
hours. But the problem is I want to have incremental re-crawl, it means
nutch should crawl only the urls that are new and not fetched
Dear Julien,
Hi,
Do you know any step by step guide for this procedure? Is this the same for
nutch 1.8?
Best regards.
On Wed, May 21, 2014 at 6:43 PM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
property
namedb.fetch.interval.default/name
value1800/value
descriptionThe default
Hi Ali,
It is the same problem that I faced recently. It is my concert too. I would
appreciate if somebody answer this question.
Best regards.
On Wed, May 21, 2014 at 2:52 PM, Ali rahmani alira...@yahoo.com wrote:
Dear Sir,
I am customizing Nutch 2.2 to crawl my seed lists which contains
Hi,
I was wondering how can I run nutch 1.8 job on hadoop cluster? As far as
know for running nutch 1.7 job on hadoop cluster we could use
org.apache.nutch.crawl.Crawl class. Since this class was deprecated and
removed from nutch 1.8 what class would be responsible for crawling job?
Which class
, see bin/crawl.sh. It
is easier to modify than the all in one Crawl class and gives a good
understanding of the underlying processing steps.
Best
Julien
On 19 May 2014 11:55, Ali Nazemian alinazem...@gmail.com wrote:
Hi,
I was wondering how can I run nutch 1.8 job on hadoop cluster
Thank you very much.
On Mon, May 19, 2014 at 5:06 PM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
there is no crawl class any more. Re-crawl class meant regarding the crawl
class. Again there is now a script for it
On 19 May 2014 13:08, Ali Nazemian alinazem...@gmail.com wrote
)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
On Mon, May 19, 2014 at 5:10 PM, Ali Nazemian alinazem...@gmail.com wrote:
Thank you very much.
On Mon, May 19, 2014 at 5
31 matches
Mail list logo