Re: Nutch Hadoop question
Hi All, Don't want to bother you guys too much... I've tried searching for this topic and do some testing myself but so far was quite unsuccessful. Basically - I wish to use some computers only for map-reduce processing and not for HDFS, does anyone know how this can be done? Thanks, Eran On Wed, Nov 11, 2009 at 12:19 PM, Eran Zinman zze...@gmail.com wrote: Hi All, I'm using Nutch with Hadoop with great pleasure - working great and really increase crawling performance on multiple machines. I have two strong machines and two older machines which I would like to use. So far I've been using only the two strong machines with Hadoop. Now I would like to add the two less powerful machines to do some processing as well. My question is - Right now the HDFS is shared between the two powerful computers. I don't want the two other computer to store any content on them as they have a slow and unreliable harddisk. I just want the two other machines to do processing (i.e. mapreduce) and not store any content on them. Is that possible - or do I have to use HDFS on all machines that do processing? If it's possible to use a machine only for mapreduce - how this is done? Thank you for your help, Eran
Re: Nutch Hadoop question
Hi Eran, mapreduce has to store its data on HDFS file system. But if you want to separate the two groups of servers, you could build two separate HDFS filesystems. To separate the two setups, you will need to make sure there is no cross communication between the two parts, cheer Alex Eran Zinman wrote: Hi All, Don't want to bother you guys too much... I've tried searching for this topic and do some testing myself but so far was quite unsuccessful. Basically - I wish to use some computers only for map-reduce processing and not for HDFS, does anyone know how this can be done? Thanks, Eran On Wed, Nov 11, 2009 at 12:19 PM, Eran Zinman zze...@gmail.com wrote: Hi All, I'm using Nutch with Hadoop with great pleasure - working great and really increase crawling performance on multiple machines. I have two strong machines and two older machines which I would like to use. So far I've been using only the two strong machines with Hadoop. Now I would like to add the two less powerful machines to do some processing as well. My question is - Right now the HDFS is shared between the two powerful computers. I don't want the two other computer to store any content on them as they have a slow and unreliable harddisk. I just want the two other machines to do processing (i.e. mapreduce) and not store any content on them. Is that possible - or do I have to use HDFS on all machines that do processing? If it's possible to use a machine only for mapreduce - how this is done? Thank you for your help, Eran
Re: Nutch Hadoop question
TuxRacer69 wrote: Hi Eran, mapreduce has to store its data on HDFS file system. More specifically, it needs read/write access to a shared filesystem. If you are brave enough you can use NFS, too, or any other type of filesystem that can be mounted locally on each node (e.g. a NetApp). But if you want to separate the two groups of servers, you could build two separate HDFS filesystems. To separate the two setups, you will need to make sure there is no cross communication between the two parts, You can run two separate clusters even on the same set of machines, just configure them to use different ports AND different local paths. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Synonym Filter with Nutch
Dharan Althuru wrote: Hi, We are trying to incorporate synonym filter during indexing using Nutch. As per my understanding Nutch doesn’t have synonym indexing plug-in by default. Can we extend IndexFilter in Nutch to incorporate the synonym filter plug-in available in Lucene using WordNet or custom synonym plug-in without any negative impacts to existing Nutch indexing (i.e., considering bigram etc). Synonym expansion should be done when the text is analyzed (using Analyzers), so you can reuse the Lucene's synonym filter. Unfortunately, this happens at different stages depending on whether you use the built-in Lucene indexer, or the Solr indexer. If you use the Lucene indexer, this happens in LuceneWriter, and the only way to affect it is to implement an analysis plugin, so that it's returned from AnalyzerFactory, and use your analysis plugin instead of the default one. See e.g. analysis-fr for an example of how to implement such plugin. However, when you index to Solr you need to configure the Solr's analysis chain, i.e. in your schema.xml you need to define for your fieldType that it has the synonym filter in its indexing analysis chain. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
How to configure nutch to crawl parallelly
Hi, All I'm using Nutch-1.0 on a 12 nodes cluster, and configure conf/hadoop-site.xml as follow: ... property namemapred.tasktracker.map.tasks.maximum/name value20/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value20/value /property ... but the Running Jobs section in page http://cluster0:50030/jobtracker.jsp never has more than one item. Thanks! Xiao
can't deploy nutch-1.0.war ???
I'm stuck and not able to deploy nutch-1.0.war I get following error in the catalina.log: Exception when processing TLD indicated by the ressource path /WEB-INF/taglibs-i18n.tld in the context /nutch-1.0 What could it be the taglibs is there, the *.properties files are there. ANY HELP where to look very welcomed. -- -MilleBii-
Re: How to configure nutch to crawl parallelly
I don't recall off the top of my head what that jobtracker.jsp shows, but judging by name, it shows your job. Each job is composed of multiple map and reduce tasks. Drill into your job and you should see multiple tasks running. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: xiao yang yangxiao9...@gmail.com To: nutch-user@lucene.apache.org Sent: Fri, November 13, 2009 12:16:55 PM Subject: How to configure nutch to crawl parallelly Hi, All I'm using Nutch-1.0 on a 12 nodes cluster, and configure conf/hadoop-site.xml as follow: ... mapred.tasktracker.map.tasks.maximum 20 mapred.tasktracker.reduce.tasks.maximum 20 ... but the Running Jobs section in page http://cluster0:50030/jobtracker.jsp never has more than one item. Thanks! Xiao
Re: Nutch Hadoop question
Thanks for the help guys. On Fri, Nov 13, 2009 at 5:20 PM, Andrzej Bialecki a...@getopt.org wrote: TuxRacer69 wrote: Hi Eran, mapreduce has to store its data on HDFS file system. More specifically, it needs read/write access to a shared filesystem. If you are brave enough you can use NFS, too, or any other type of filesystem that can be mounted locally on each node (e.g. a NetApp). But if you want to separate the two groups of servers, you could build two separate HDFS filesystems. To separate the two setups, you will need to make sure there is no cross communication between the two parts, You can run two separate clusters even on the same set of machines, just configure them to use different ports AND different local paths. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com