Re: Problem in config nutch-default.xml
Related issue? http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06135.html [EMAIL PROTECTED] wrote: Hi all. I have a problem in config nutch-default.xml. As I am in China, most ftp sites that I want to crawl are encoded in chinese, but when nutch crawl these ftp sites,it could not get the correct charset code,and the parse results are incomprehensible and useless. so I set property nameparser.character.encoding.default/name valuewindows-1252/value /property to valuegb2312/value and got a very interesting result, nutch now can crawl the files and directories of the root directoy of chinese ftp sites without any messy characters,but can NOT crawl any files in SUBdirectories,just got a result :404 no found. I know there must be something wrong in config files but how and where can I config nutch to crawl a chinese ftp site? I 've been working on this problem for halt a month and find no way to solve it, Could anyone helo me??? thanks
Nutch slow how to speed up?
I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 memory), searching with queries like 'China Nuclear Forces' takes 20 – 25 s. My config: http.content.limit = 6165536 dfs.replication = 1 mapred.submit.replication = 2 mapred.child.java.opts = -Xmx800m My data: TOTAL urls: 3748140 retry 0: 3614731 retry 1: 85999 retry 2: 20772 retry 3: 26638 min score: 0.0 avg score: 0.64956105 max score: 3922.723 status 1 (DB_unfetched): 1316016 status 2 (DB_fetched): 2168397 status 3 (DB_gone): 263727 Status: HEALTHY Total size: 254534723272 B Total blocks: 5140 (avg. block size 49520374 B) Total dirs: 260 Total files: 1466 Over-replicated blocks: 8 (0.15564202 %) Under-replicated blocks: 0 (0.0 %) Target replication factor: 1 Real replication factor: 1.0015564 The filesystem under path '/' is HEALTHY
Re: Nutch slow how to speed up?
DistributedSearch 2x datanodes, 2x Task Trackers Sami Siren wrote: You are using DistributedSearch? and local filesystem to store index and related data? -- Sami Siren Håvard W. Kongsgård wrote: I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 memory), searching with queries like 'China Nuclear Forces' takes 20 – 25 s. My config: http.content.limit = 6165536 dfs.replication = 1 mapred.submit.replication = 2 mapred.child.java.opts = -Xmx800m My data: TOTAL urls: 3748140 retry 0: 3614731 retry 1: 85999 retry 2: 20772 retry 3: 26638 min score: 0.0 avg score: 0.64956105 max score: 3922.723 status 1 (DB_unfetched): 1316016 status 2 (DB_fetched): 2168397 status 3 (DB_gone): 263727 Status: HEALTHY Total size: 254534723272 B Total blocks: 5140 (avg. block size 49520374 B) Total dirs: 260 Total files: 1466 Over-replicated blocks: 8 (0.15564202 %) Under-replicated blocks: 0 (0.0 %) Target replication factor: 1 Real replication factor: 1.0015564 The filesystem under path '/' is HEALTHY
Re: problem parsing documents : word, rtf, excel, etc...
Post your conf/nutch-site.xml Aïcha wrote: Hi, I have a lot of parsing problems when I try to index my directory, about only 50% of files where indexed I ask the nutch-dev group but I try in the nutch-user, perhaps somebody had these problems and solved.. I put a list of the main problem the parsing encountred : - Error parsing: file:/C:/doc to index/conges.xls: failed(2,0): Can't be handled as micrsosoft document. org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance, the following exception occured: null - Error parsing: file:/C:/docs_a_indexer/doc1/test.doc: failed(2,0): Can't be handled as micrsosoft document. java.util.NoSuchElementException - Error parsing: file:/C:/docs_a_indexer/doc3/test.rtf: failed(2,0): Can't be handled as micrsosoft document. java.io.IOException: Invalid header signature; read 7015536635646467195, expected -2226271756974174256 - 2006-10-13 17:29:42,343 ERROR parse.OutlinkExtractor - getOutlinks java.net.MalformedURLException: unknown protocol: dsp at java.net.URL.init(URL.java:574) at java.net.URL.init(URL.java:464) at java.net.URL.init(URL.java:413) at org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78) at org.apache.nutch.parse.Outlink.init(Outlink.java:35) at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111) at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:84) at org.apache.nutch.parse.msword.MSWordParser.getParse(MSWordParser.java:43) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152) In the last error, the string after unknown protocol: is not always dsp, it seems to be different in each case and I don't understand what mean this string. Thank in advance Best regards, Aïcha ___ Découvrez une nouvelle façon d'obtenir des réponses à toutes vos questions ! Demandez à ceux qui savent sur Yahoo! Questions/Réponses http://fr.answers.yahoo.com
Re: why I use site:com to query , but no result return??
Site works this way. China site:www.ndu.edu = only results from http://www.ndu.edu China site:ndu.edu = only results from http://ndu.edu/ China site:*.ndu.edu = results from http://ndu.edu/ and http://www.ndu.edu I think you also can use Grouping in nutch: http://lucene.apache.org/java/docs/queryparsersyntax.html xu nutch wrote: nutch-user, I download nutch 0.7.2 and have been crawled some webpages. I can find some result by keywords, also I can find some result by query url:com , but it is no result by query site:sample.com , why? who can help me ?
nonzero status of 134
During a fetch a got this error on one of my nodes; java.io.IOException: Task process exit with nonzero status of 134. at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:242) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145)
Re: Problem in Distributed crawling using nutch 0.8
see: http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E Before you start tomcat remeber to change the path of your search directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes directory #This is an example of my configuration ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namefs.default.name/name valueLSearchDev01:9000/value /property property namesearcher.dir/name value/user/root/crawld/value /property /configuration Mohan Lal wrote: Hi, thanks for your valuable information, i have solved that problem after that iam facing another problem i have 2 slaves 1) MAC1 2) MAC2 but the job was running in MAC1 itself, and it take a long time to finish the crawling process how can i assign job to distributed machines i specified in tha slaves file ? But my Crowling process done successfully..also how ccan i specify the searcher dir in the nutch-site.xml file property namesearcher.dir/name value ? /value /property please help me. I have done the following setting. [EMAIL PROTECTED] ~]# cd /home/lucene/nutch-0.8.1/ [EMAIL PROTECTED] nutch-0.8.1]# bin/hadoop namenode -format Re-format filesystem in /tmp/hadoop/dfs/name ? (Y or N) Y Formatted /tmp/hadoop/dfs/name [EMAIL PROTECTED] nutch-0.8.1]# bin/start-all.sh starting namenode, logging to /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n amenode-mohanlal.qburst.local.out fpo: ssh: fpo: Name or service not known localhost: starting datanode, logging to /home/lucene/nutch-0.8.1/bin/../logs/ha doop-root-datanode-mohanlal.qburst.local.out starting jobtracker, logging to /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root -jobtracker-mohanlal.qburst.local.out fpo: ssh: fpo: Name or service not known localhost: starting tasktracker, logging to /home/lucene/nutch-0.8.1/bin/../logs /hadoop-root-tasktracker-mohanlal.qburst.local.out [EMAIL PROTECTED] nutch-0.8.1]# bin/stop-all.sh stopping jobtracker localhost: stopping tasktracker sonu: no tasktracker to stop stopping namenode sonu: no datanode to stop localhost: stopping datanode [EMAIL PROTECTED] nutch-0.8.1]# bin/start-all.sh starting namenode, logging to /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n amenode-mohanlal.qburst.local.out sonu: starting datanode, logging to /home/lucene/nutch-0.8.1/bin/../logs/hadoop- root-datanode-sonu.qburst.local.out localhost: starting datanode, logging to /home/lucene/nutch-0.8.1/bin/../logs/ha doop-root-datanode-mohanlal.qburst.local.out starting jobtracker, logging to /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root -jobtracker-mohanlal.qburst.local.out localhost: starting tasktracker, logging to /home/lucene/nutch-0.8.1/bin/../logs /hadoop-root-tasktracker-mohanlal.qburst.local.out sonu: starting tasktracker, logging to /home/lucene/nutch-0.8.1/bin/../logs/hado op-root-tasktracker-sonu.qburst.local.out [EMAIL PROTECTED] nutch-0.8.1]# bin/hadoop dfs -put urls urls [EMAIL PROTECTED] nutch-0.8.1]# bin/nutch crawl urls -dir crawl.1 -depth 2 -topN 10 crawl started in: crawl.1 rootUrlDir = urls threads = 100 depth = 2 topN = 10 Injector: starting Injector: crawlDb: crawl.1/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: starting Generator: segment: crawl.1/segments/20060929120038 Generator: Selecting best-scoring urls due for fetch. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl.1/segments/20060929120038 Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl.1/crawldb CrawlDb update: segment: crawl.1/segments/20060929120038 CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: starting Generator: segment: crawl.1/segments/20060929120235 Generator: Selecting best-scoring urls due for fetch. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl.1/segments/20060929120235 Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl.1/crawldb CrawlDb update: segment: crawl.1/segments/20060929120235 CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: crawl.1/linkdb LinkDb: adding segment: /user/root/crawl.1/segments/20060929120038 LinkDb: adding segment: /user/root/crawl.1/segments/20060929120235 LinkDb: done Indexer: starting Indexer: linkdb: crawl.1/linkdb Indexer: adding segment: /user/root/crawl.1/segments/20060929120038 Indexer: adding segment: /user/root/crawl.1/segments/20060929120235 Indexer: done Dedup: starting Dedup: adding indexes in: crawl.1/indexes Dedup: done Adding /user/root/crawl.1/indexes/part-0 Adding
Re: Tomcat 5 / Nutch web gui timeout blank page
I solved the problem by giving tomcat 5 more memory export JAVA_OPTS=-Xmx528m -Xms128m Håvard W. Kongsgård wrote: I have a problem with my Nutch web gui sometimes returning empty pages when I do a search. In Nutch 0.7 this was fixed by giving ipc.client.timeout a higher value in my webapp/ROOT/ WEB-INF/classes/hadoop-site.xml but this has no effect in nutch 0.8.1, the nutch web gui still times out after about 30s.
Re: Problem in Distributed crawling using nutch 0.8
Do /user/root/url exist, have you uploaded the url folder to you dfs system? bin/hadoop dfs -mkdir urls bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt or bin/hadoop -put localsrc dst Mohan Lal wrote: Hi all, While iam try to crawl using distributed machines its throw an error bin/nutch crawl urls -dir crawl -depth 10 -topN 50 crawl started in: crawl rootUrlDir = urls threads = 10 depth = 10 topN = 50 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: Input directory /user/root/urls in localhost:9000 is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.crawl.Injector.inject(Injector.java:138) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) whats wrong with my configuration, please help me.. Regards Mohan Lal
Indexing in nutch 0.8 / hadoop
What is the best way to create a master index on a nutch 8 / hadoop system? Is it to merge all of the segments together, and then create an index? Or like Roberto Navoni in his Tutorial First index all the segments separately and then merge the indexes into one master index? -.-.-.-.-.-.- # Create a new indexe0 bin/nutch index /user/root/crawld/indexe0 /user/root/crawld/ /user/root/crawld/linkdb /user/root/crawld/segments/20060722153133 # Create a new index1 bin/nutch index /user/root/crawld/indexe1 /user/root/crawld/ /user/root/crawld/linkdb /user/root/crawld/segments/20060722182213 #Dedup the new indexe0 bin/nutch dedup /user/root/crawld/indexe0 #Dedup the new index1 bin/nutch dedup /user/root/crawld/indexe1 #Delete the old index #Merge the new index merge directory bin/nutch merge /user/root/crawld/index /user/root/crawld/indexe0 /user/root/crawld/indexe1 ... #(and the other index create for the fetch segments) -.-.-.-.-.-.-
Tomcat 5 / Nutch web gui timeout blank page
I have a problem with my Nutch web gui sometimes returning empty pages when I do a search. In Nutch 0.7 this was fixed by giving ipc.client.timeout a higher value in my webapp/ROOT/ WEB-INF/classes/hadoop-site.xml but this has no effect in nutch 0.8.1, the nutch web gui still times out after about 30s.
How to search for multiple site:
In Google the user can search in more than one specific site using OR admission site:www.stanford.edu OR site: cmu.edu OR site:mit.edu OR site:berkeley.edu Is this possible in the nutch web gui?
Generate linkDb | hadoop/nutch 0.8
When I run “bin/nutch invertlinks linkdb segments” I get this error Exception in thread main java.io.IOException: Input directory /user/nutch/segments/parse_data in linux3:9000 is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:212) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:316) I have tried to create the directory segments/parse_data but still the same error.
Indexing segment | nutch 0.8/hadoop
When I try to index my second segment “bin/nutch index issep crawldb linkdb segments/x” I get this error Exception in thread main java.io.IOException: Output directory /user/nutch/issep already exists. at org.apache.hadoop.mapred.OutputFormatBase.checkOutputSpecs(OutputFormatBase.java:39) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:279) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.indexer.Indexer.index(Indexer.java:296) at org.apache.nutch.indexer.Indexer.main(Indexer.java:313)
Re: Generate linkDb | hadoop/nutch 0.8
Sami Siren wrote: try “bin/nutch invertlinks linkdb -dir segments” -- Sami Siren Håvard W. Kongsgård wrote: When I run “bin/nutch invertlinks linkdb segments” I get this error Exception in thread main java.io.IOException: Input directory /user/nutch/segments/parse_data in linux3:9000 is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:212) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:316) I have tried to create the directory segments/parse_data but still the same error. Thanks it worked
Re: Best performance approach for single MP machine?
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02394.html Teruhiko Kurosaka wrote: Can I use MapReduce to run Nutch on a multi CPU system? Yes. I want to run the index job on two (or four) CPUs on a single system. I'm not trying to distribute the job over multiple systems. If the MapReduce is the way to go, do I just specify config parameters like these: mapred.tasktracker.tasks.maxiumum=2 mapred.job.tracker=localhost:9001 mapred.reduce.tasks=2 (or 1?) and bin/start-all.sh ? That should work. You'd probably want to set the default number of map tasks to be a multiple of the number of CPUs, and the number of reduce tasks to be exactly the number of cpus. Don't use start-all.sh, but rather just: bin/nutch-daemon.sh start tasktracker bin/nutch-daemon.sh start jobtracker Must I use NDFS for MapReduce? No. Doug Doug Cook wrote: Hi, I've recently switched to 0.8 from 0.7, and after some initial fits and starts, I'm past the get it working at all stage to the get reasonable performance stage. I've got a single machine with 4 CPUs and a lot of memory. URL fetching works great because it's (mostly) multithreaded. But as soon as I hit the reduce phase of fetch, it's dog slow. I'm down to running on one CPU, and the phase can take days, leaving me vulnerable to losing everything should a process fail. Wait! you say. That's just what Hadoop is for! I'm all ears. I'd love some help getting my configuration right. I've seen examples/tutorials of configurations for multiple machines; am I just faking multiple machines on my single node (will that work?) or is there a cleaner, simpler approach? Alternatively, I was all excited to get an easy improvement with -numFetchers, and run 4 fetchers simultaneously to use all my CPUs, but it looks like -numFetchers has gone away, and though there was an 0.8 version patch, at a quick glance this didn't seem to have made it into the mainline source, and I don't see the value of trying to merge this in if there's a cleaner Hadoop-based approach. Many thanks for any help. Doug
Nutch 0.8 java 1.4/1.5
I am trying to get nutch/hadoop to run on 3 servers with SUSE linux. I have followed the Nutch Hadoop Tutorial and everything works find (I can run bin/hadoop dfs –ls), but when I run “bin/nutch inject crawldb urls” I get this error. Exception in thread main java.lang.UnsupportedClassVersionError: org/apache/commons/cli/ParseException (Unsupported major.minor version 49.0) at java.lang.ClassLoader.defineClass0(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:539) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:123) at java.net.URLClassLoader.defineClass(URLClassLoader.java:251) at java.net.URLClassLoader.access$100(URLClassLoader.java:55) at java.net.URLClassLoader$1.run(URLClassLoader.java:194) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:187) at java.lang.ClassLoader.loadClass(ClassLoader.java:289) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:274) at java.lang.ClassLoader.loadClass(ClassLoader.java:235) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:302) at org.apache.nutch.crawl.Injector.inject(Injector.java:138) at org.apache.nutch.crawl.Injector.main(Injector.java:164) I have set the JAVA_HOME variable in hadoop-env.sh to /usr/java/jdk1.5.0_07/ but nutch still tells me that I use version 48.0 (java 1.4). I have also tried to set the JAVA_HOME variable in bin/nutch but with the same result.
Re: Nutch on Windows
Kerry Wilson wrote: Trying to use nutch on windows and the executables are shell scripts, how do you use nutch on windows? http://wiki.apache.org/nutch/GettingNutchRunningWithWindows
Re: favicon?
For Internet Explorer http://www.favicon.com/ie.html Firefox Works for me in nutch 0.7.2 Is it the right size? http://www.photoshopsupport.com/tutorials/jennifer/favicon.html Bill Goffe wrote: At http://ese.rfe.org I've Nutch running for some time, but I have a minor question: how to put in my own favicon? In .71, I put my favicon.ico in src/site/src/documentation/resources/images/ and docs/img/ (wasn't sure which mattered), did an ant war, and redeployed the resulting war file. The correct favicon is in webapps/ROOT/img/ and http://ese.rfe.org/favicon.ico shows the correct icon. But, it shows inconsistently in Firefox and Internet Explorer on search results and on http://ese.rfe.org in spite of clearing the cache and history in both (in fact, after clearing them, it now doesn't show!). Also, in Firefox, when I drag the blank icon from the address bar to my list of shortcuts (term?) at the top of the browser, the correct icon shows up there but still not on the address bar. Ugh! Thanks, Bill
Re: Nutch shows same results multiple times.
Like this +http://[^/]*\.(com|org|net|biz|mil|us|info|cc)/ -.* see: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg00479.html Dima Mazmanov wrote: I'm not adding urls into urlfilter files. Besides, I still don't understand how to allow only one zone in urlfilter. Let's say I want to index only .ge zone. Which one of the following filters is correct? +^http://([a-z0-9]*\.)*([a-z0-9]*\.).ge/ +^http://([a-z0-9\-\.]*\.)*.ge/ +^http://([a-z0-9\-\.])*.ge/ +^http://www\..*\.ge/ +^http://www\..*\.*\.ge/ By the way if the site you are indexing is dynamic you may just disallow to index www.bbc.co.uk and index only second one. So what filter settings do you use? Like this +^http://([a-z0-9]*\.)*bbc.co.uk/ Then you will get bbc.co.uk and www.bbc.co.uk http://www.bbc.co.uk/ and since this site is dynamic, content might bee different. Have the same problem myself :-( --- Well my script already contains this command Run bin/nutch dedup segments dedup.tmp Dima Mazmanov wrote: Hi all!! I'm running on nutch-0.7.1. Here is result of my search. ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web Site Our web site has new look and ... link on the ... http://www.argosoft.org/RootPages/Default.aspx (Cached) ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web Site Our web site has new look and ... link on the ... http://www.argosoft.com/rootpages/Default.aspx (Cached) ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web Site Our web site has new look and ... link on the ... http://www.argosoft.com/RootPages/Default.aspx (Cached) ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web Site Our web site has new look and ... link on the ... http://www.argosoft.org/rootpages/Default.aspx (Cached) As you can see one result is shown multiple times. Why so? What is the difference between these links? I don't see any.. So, how can I avoid this problem? Thanks, Regards, Dima __ NOD32 1.1497 (20060419) Information __ This message was checked by NOD32 antivirus system. http://www.eset.com
Re: Nutch shows same results multiple times.
Don't know but you can try to upgrading to 0.7.2 See Nutch Change Log: http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158 Dima Mazmanov wrote: Hi,Håvard. Thank you again for your help. ..mmm. there is else once thing I'm cuerious about... The search result of several sites displays content like following : Cool-Warez [html] - 19.1 k - 11/3/2006 ... Avatars გართობა კონტაქტი როგორ მოვხსნათ www.sendspace.com Многие из Вас ... вопрос: Как качать сhttp://www http://www.cool.caucasus.net/index_moxsna_2.htm (Cached) (More from www.cool.caucasus.net) as you can see there is a lot of spaces between words.. is this bug or what?... maybe it's because of different borders in web page and nutch places spaces by his own ??? Is there any way to avoid this problem?
Re: Nutch shows same results multiple times.
So what filter settings do you use? Like this +^http://([a-z0-9]*\.)*bbc.co.uk/ Then you will get bbc.co.uk and www.bbc.co.uk http://www.bbc.co.uk/ and since this site is dynamic, content might bee different. Have the same problem myself :-( --- Well my script already contains this command Run bin/nutch dedup segments dedup.tmp Dima Mazmanov wrote: Hi all!! I'm running on nutch-0.7.1. Here is result of my search. ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web Site Our web site has new look and ... link on the ... http://www.argosoft.org/RootPages/Default.aspx (Cached) ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web Site Our web site has new look and ... link on the ... http://www.argosoft.com/rootpages/Default.aspx (Cached) ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web Site Our web site has new look and ... link on the ... http://www.argosoft.com/RootPages/Default.aspx (Cached) ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web Site Our web site has new look and ... link on the ... http://www.argosoft.org/rootpages/Default.aspx (Cached) As you can see one result is shown multiple times. Why so? What is the difference between these links? I don't see any.. So, how can I avoid this problem? Thanks, Regards, Dima
Re: Nutch shows same results multiple times.
Run bin/nutch dedup segments dedup.tmp Dima Mazmanov wrote: Hi all!! I'm running on nutch-0.7.1. Here is result of my search. ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web Site Our web site has new look and ... link on the ... http://www.argosoft.org/RootPages/Default.aspx (Cached) ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web Site Our web site has new look and ... link on the ... http://www.argosoft.com/rootpages/Default.aspx (Cached) ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web Site Our web site has new look and ... link on the ... http://www.argosoft.com/RootPages/Default.aspx (Cached) ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web Site Our web site has new look and ... link on the ... http://www.argosoft.org/rootpages/Default.aspx (Cached) As you can see one result is shown multiple times. Why so? What is the difference between these links? I don't see any.. So, how can I avoid this problem? Thanks, Regards, Dima
How to run bin/nutch dedup when running multiple servers
Hi, I am running nutch 0.7.2 on 3 servers|1 tomcat/db|2 segment servers port 8081| is it possible to run bin/nutch dedup on multiple servers so that nutch removes all duplicated pages?
Re: Nutch 0.7.2 release | upgrading from 0.7.1?
What about upgrading from 0.7.1? Can I use my existing db and segments? Piotr Kosiorowski wrote: Hello all, The 0.7.2 release of Nutch is now available. This is a bug fix release for 0.7 branch. See CHANGES.txt (http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986) for details. The release is available on http://lucene.apache.org/nutch/release/. Regards, Piotr
Re: nutch 0.7.1 where is the tutorial? crawldb not found?
http://wiki.media-style.com/display/nutchDocu/Home Roeland Weve wrote: Hi, I've installed Nutch 0.7.1 today on Windows XP with Cygwin and tried to follow the tutorial at: http://lucene.apache.org/nutch/tutorial.html But this tutorial seems to be written for another version of Nutch. Because, first of all the DmozParser is not available (I could'nt find it in the nutch-0.7.1.jar file, not under 'crawl', 'tools' or somewhere else): java.lang.NoClassDefFoundError: org/apache/nutch/crawl/DmozParser java.lang.NoClassDefFoundError: org/apache/nutch/tools/DmozParser Since I'm not really interested in Dmoz data, I continue with injecting URLs of my own (in the dmoz dir, the file is called 'urls', with on each line an url) in the database. Unfortunately, I got stuck again. I tried to execute: bin/nutch inject crawl/crawldb dmoz The error is: 060225 212634 parsing file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-default.xml 060225 212635 parsing file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-site.xml Usage: WebDBInjector (-local | -ndfs namenode:port) db_dir (-urlfile url_file | -dmozfile dmoz_file) [-subset subsetDenominator] [-includeAdultMaterial] [-skew skew] [-noDmozDesc] [-topicFile topic list file] [-topic topic [-topic topic [...]]] So I tried to adjust the parameters, with something like: bin/nutch inject crawl/crawldb -urlfile dmoz/urls But this leads to an exception: Exception in thread main java.io.FileNotFoundException: crawl\crawldb\webdb\pagesByURL\data There are some files in the crawldb dir, but not the webdb dir. Is there a possibility to create an empty or default database? Or do I need Nutch 0.8? If yes, where can I download it? Hopefully, this can this be done with Nutch 0.7.1, because I'm not a hero with compiling stuff on Cygwin The only thing I want is to inject URLs that can be found in a plain text file, with on each row a URL. The next step is the crawl those URLs. The URLs are all different, so I am not interested in the intranet option of Nitch. Hopefully someone can help me out with this problem. Roeland
Re: Pdf document title in nutch search
Take a look at the Google search result of this rand publication http://www.google.com/search?hs=z0nhl=enlr=client=firefox-arls=org.mozilla%3Aen-US%3Aofficialq=Implementing+Security+Improvement+Options+at+Los+Angeles+International+Airport+btnG=Search The pdf document (RAND_DB468-1.sum.pdf) has no pdf title, and google don't use the first 2 pages of the document for a title! Jérôme Charron wrote: It'd be nice if this was changed so that if a PDF has no title then the first xx words become the new title. I agree with that. Please create a JIRA issue for this point. (but it seems that the Google title process is more advanced that this) Really? Take a look at this : http://www.google.com/search?num=100hl=frsafe=offc2coff=1as_qdr=allq=http%3A%2F%2Fwww.trellix.com%2Fproducts%2Fdownloads%2Fsearchengines_siteopt.pdfbtnG=Rechercherlr= In fact Google always take the first characters of the document as the title. Google never use the Title property of the document. So, when there is some shaded characters in the first characters of the pdf document, you get a TTTiiitttllleee llliiikkkeee ttthhhaaattt ... is it really an advanced title processing? Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Pdf document title in nutch search
When searching with nutch the title of pdf documents is a url to the file like: http://www.ists.dartmouth.edu/library/wse0901.pdf I have noticed that google and ultraseek creates a normal title like: WebALPS: A Survey of E-Commerce Privacy and Security Applications Is it possible to make nutch do the same?
Re: Pdf document title in nutch search
Must I have index-more enabled to get the pdf titles to work. I did a test with some pdf files, all pdf titles were ignored (nutch 0.7.1). Håvard W. Kongsgård wrote: It'd be nice if this was changed so that if a PDF has no title then the first xx words become the new title. (but it seems that the Google title process is more advanced that this) Jérôme Charron wrote: When searching with nutch the title of pdf documents is a url to the file like: http://www.ists.dartmouth.edu/library/wse0901.pdf In Nutch, the title of PDF file is displayed if a title is available, otherwise the URL of the document is displayed. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Problem/bug setting java_home in hadoop nightly 16.02.06
Thanks it worked. Is there any other path I need to set? # The java implementation to use. export JAVA_HOME=/usr/lib/java Doug Cutting wrote: Have you edited conf/hadoop-env.sh, and defined JAVA_HOME there? Doug Håvard W. Kongsgård wrote: I am unable to set java_home in bin/hadoop, is there a bug? I have used nutch 0.7.1 with the same java path. localhost: Error: JAVA_HOME is not set. if [ -f $HADOOP_HOME/conf/hadoop-env.sh ]; then source ${HADOOP_HOME}/conf/hadoop-env.sh fi # some Java parameters if [ $JAVA_HOME != /usr/lib/java ]; then #echo run java in $JAVA_HOME JAVA_HOME=$JAVA_HOME fi if [ $JAVA_HOME = ]; then echo Error: JAVA_HOME is not set. exit 1 fi JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx1000m System: SUSE 10 64-bit | Java 1.4.2
Problem/bug setting java_home in hadoop nightly 16.02.06
I am unable to set java_home in bin/hadoop, is there a bug? I have used nutch 0.7.1 with the same java path. localhost: Error: JAVA_HOME is not set. if [ -f $HADOOP_HOME/conf/hadoop-env.sh ]; then source ${HADOOP_HOME}/conf/hadoop-env.sh fi # some Java parameters if [ $JAVA_HOME != /usr/lib/java ]; then #echo run java in $JAVA_HOME JAVA_HOME=$JAVA_HOME fi if [ $JAVA_HOME = ]; then echo Error: JAVA_HOME is not set. exit 1 fi JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx1000m System: SUSE 10 64-bit | Java 1.4.2
Re: Nutch inject problem with hadoop - Missing /tmp/hadoop/mapred/system
I get the same error (15.02 nightly build) Gal Nitzan wrote: I am getting this error all the time. Cant start inject. 060215 183808 parsing file:/home/nutchuser/nutch/conf/hadoop-site.xml Exception in thread main java.io.IOException: Cannot open filename /tmp/hadoop/mapred/system/submit_p4w14i/job.jar at org.apache.hadoop.ipc.Client.call(Client.java:301) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) at org.apache.hadoop.mapred.$Proxy0.submitJob(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:261) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:290) at org.apache.nutch.crawl.Injector.inject(Injector.java:114) at org.apache.nutch.crawl.Injector.main(Injector.java:138) I noticed the system folder doesn't exists. and created manually. now everything works but I get strange behavior like all task trackers are fetching from the same site? Any idea?
Hung threads
Hi, I have a problem with last Friday nightly build. When I try to fetch my segment the fetch process freezesAborting with 10 hung threads. After failing Nutch tries to run the same urls on another tasktracker but again fails. I have tried turning fetcher.parse off, protocol-httpclient, protocol-http. nutch-site.xml property namefs.default.name/name valuelinux3:5/value descriptionThe name of the default file system. Either the literal string local or a host:port for NDFS./description /property property namemapred.job.tracker/name valuelinux3:50020/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|msword)|index-basic|query-(basic|site|url)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property namefetcher.parse/name valuefalse/value descriptionIf true, fetcher will parse content./description /property
Re: The parsing is part of the Map or part of the Reduce?
So you have been following the quick tutorial for nutch 0.8 and later at media-style… The author has left out the parse and updatedb part. After fetch simply run bin/nutch parse segment/2006 and then bin/nutch crawldb updatedb segment/2006xxx. Rafit Izhak_Ratzin wrote: Hi, In what part of the mapred the parsing is done in the Map part or in the Reduce part? Thanks, Rafit _ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: The parsing is part of the Map or part of the Reduce?
otherwise how it get the next level of URLss? bin/nutch crawldb updatedb segment/2006xxx Rafit Izhak_Ratzin wrote: I thought that by running the fetch command (bin/nutch fetch ...) it already does some kind of parsing , otherwise how it get the next level of URLss? and in this case in what part the parsing is done in the mapping or in the reducing of the fetch process? Thanks again, Rafit From: Håvard W. Kongsgård [EMAIL PROTECTED] Reply-To: nutch-user@lucene.apache.org To: nutch-user@lucene.apache.org Subject: Re: The parsing is part of the Map or part of the Reduce? Date: Sat, 28 Jan 2006 23:05:05 +0100 So you have been following the quick tutorial for nutch 0.8 and later at media-style… The author has left out the parse and updatedb part. After fetch simply run bin/nutch parse segment/2006 and then bin/nutch crawldb updatedb segment/2006xxx. Rafit Izhak_Ratzin wrote: Hi, In what part of the mapred the parsing is done in the Map part or in the Reduce part? Thanks, Rafit _ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ _ FREE pop-up blocking with the new MSN Toolbar - get it now! http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
Re: Parsing PDF Nutch Achilles heel?
Cud you create a new version from the latest xpdf version, I know that the older versions of pdftotext (before October 2005) had some issues with PDF 1.6 (acrobat 7). Sorry my mistake! Have now tested pdftotext and it's faster than pdfbox, but it doesn't prevent the nutch freezes. Håvard W. Kongsgård wrote: Cud you create a new version from the latest xpdf version, I know that the older versions of pdftotext (before October 2005) had some issues with PDF 1.6 (acrobat 7). Doug Cutting wrote: Steve Betts wrote: I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run a lot faster, but it does allow it to complete. I find xpdf much faster than PDFBox. http://www.mail-archive.com/nutch-dev@incubator.apache.org/msg00161.html Does this work any better for you? Doug
Parsing PDF Nutch Achilles heel?
I have been doing some testing on different nutch configurations to see what slows down the fetching process on my servers(nutch 0.7.1). My general experience is that the PDF parse process is nutchs Achilles heel. Nutch works fine on older computers, but with the combination of |parse-(text|html|pdf) and http.content.limit = -1(needed to get PDF parsing to work) nutch sometimes freezes completely. Is there planned any improvement to the parsing of PDF files in the next version of nutch (0.8)?
Re: Parsing PDF Nutch Achilles heel?
PDFBox-0.7.2 or one of the nightly builds PDFBox-0.7.3-dev... Steve Betts wrote: I should have included the link, but I used PDFBox. Thanks, Steve Betts [EMAIL PROTECTED] 937-477-1797 -Original Message- From: Håvard W. Kongsgård [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 25, 2006 10:34 AM To: nutch-user@lucene.apache.org Subject: Re: Parsing PDF Nutch Achilles heel? From where do I get the new version http://www.pdfbox.org/ or http://svn.apache.org/viewcvs.cgi/lucene/nutch/ Steve Betts wrote: There is a bug in the PDF parser tool used with 0.7. You can get a newer version to replace the jars with the parse-pdf plugin and the freeze will go away. Thanks, Steve Betts [EMAIL PROTECTED] 937-477-1797 -Original Message- From: Håvard W. Kongsgård [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 25, 2006 10:10 AM To: nutch-user@lucene.apache.org Subject: Parsing PDF Nutch Achilles heel? I have been doing some testing on different nutch configurations to see what slows down the fetching process on my servers(nutch 0.7.1). My general experience is that the PDF parse process is nutchs Achilles heel. Nutch works fine on older computers, but with the combination of |parse-(text|html|pdf) and http.content.limit = -1(needed to get PDF parsing to work) nutch sometimes freezes completely. Is there planned any improvement to the parsing of PDF files in the next version of nutch (0.8)?
Re: Injecting new url
If your old urls have not expired(30 day) then a bin/nutch generate will process only the new urls. Ennio Tosi wrote: Hi, I created an index from an injected url. My problem is that if now I inject another url in the webdb, the fetcher reprocesses the starting url too... Is there a way to tell nutch to only process the latest injected resource? Thanks, Ennio
Nutch system running on multiple servers | fetcher
Hi I have setup a nutch (0.7.1) system running on multiple servers following Stefan Groschupf tutorial (http://wiki.media-style.com/display/nutchDocu/setup+multiple+search+sever). I already had a nutch index and a set of segments so I copied some segments to different servers. No I want to add some new sites to my engine. This time however I don’t want to use the main box as a fetcher but a faster one with no local web db. Can I simply create a new local web db on the new box and then store the local generated segment in localsegements/segments/ ?
Re: Access pasword protected sites?
No the current version of nutch don't support password protected sites, sites that are password protected = http error 404 in the nutch log Andy Morris wrote: Can nutch access password protected sites? If so how? Thanks, Andy
Search result is an empty site
Hi, I am running a nutch server with a db containing 20 docs. When I start tomcat and search for something the browser displays an empty site. Is this a memory problem, how do I fix it? System: 2,6 | Memory 1 GB | SUSE 9.2
Re: Search result is an empty site
No I use 0.7.1, I have tested nutch/tomcat with 20 000 docs so I know it works. Searching using site like china site:www.fas.org also works. Dominik Friedrich wrote: If you use the mapred version from svn trunk you might have run into the same problem as I have. In the mapred version the searcher.dir property in nutch-default.xml is set to crawl and not . anymore. If you use this version you have either to put the index and the segments dirs into a folder called crawl and start tomcat from above that folder or change that value in the nutch-site.xml in webapps/ROOT/WEB-INF/classes of your tomcat nutch deployment. regards Dominik Håvard W. Kongsgård wrote: Hi, I am running a nutch server with a db containing 20 docs. When I start tomcat and search for something the browser displays an empty site. Is this a memory problem, how do I fix it? System: 2,6 | Memory 1 GB | SUSE 9.2
fetcher.threads.per.host bug in 0.7.1?
Is there a bug in 0.7.1 that causes the fetcher.threads.per.host setting to be ignored? Nutch-site.xml property namefetcher.server.delay/name value15.0/value descriptionThe number of seconds the fetcher will delay between successive requests to the same server./description /property property namefetcher.threads.fetch/name value3/value descriptionThe number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection)./description /property property namefetcher.threads.per.host/name value1/value descriptionThis number is the maximum number of threads that should be allowed to access a host at one time./description /property Fetch Log 060109 202235 fetching http://www.fas.org/irp/news/1998/06/prs_rel21.html 060109 202250 fetch of http://www.fas.org/irp/news/1998/04/t04141998_t0414asd-3.html failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202250 fetch of http://www.fas.org/asmp/campaigns/smallarms/sawgconf.PDF failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202250 fetching http://www.fas.org/irp/commission/testhaas.htm 060109 202250 fetching http://www.fas.org/asmp/profiles/bahrain.htm 060109 202250 fetching http://www.fas.org/irp/cia/product/dci_speech_03082001.html 060109 202306 fetching http://www.fas.org/irp/news/1998/06/980609-drug10.htm 060109 202321 fetch of http://www.fas.org/irp/commission/testhaas.htm failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202321 fetch of http://www.fas.org/asmp/profiles/bahrain.htm failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202321 fetching http://www.fas.org/irp/news/1998/04/980422-terror2.htm 060109 202321 fetching http://www.fas.org/irp//congress/2004_cr/index.html 060109 202321 fetching http://www.fas.org/irp//congress/2001_rpt/index.html 060109 202338 fetching http://www.fas.org/irp/budget/fy98_navy/0601152n.htm 060109 202354 fetching http://www.fas.org/irp/dia/product/cent21strat.htm 060109 202408 fetch of http://www.fas.org/irp/news/1998/04/980422-terror2.htm failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202408 fetch of http://www.fas.org/irp//congress/2004_cr/index.html failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202408 fetching http://www.fas.org/faspir/2001/v54n2/qna.htm 060109 202408 fetching http://www.fas.org/graphics/predator/index.htm 060109 202409 fetching http://www.fas.org/irp/doddir/dod/5200-1r/chapter_6.htm 060109 202425 fetching http://www.fas.org/irp//congress/1995_hr/140.htm
Re: Search result is an empty site
Never mind solved it for tomcat 5 run export JAVA_OPTS=-Xmx128m -Xms128m Håvard W. Kongsgård wrote: No I use 0.7.1, I have tested nutch/tomcat with 20 000 docs so I know it works. Searching using site like china site:www.fas.org also works. Dominik Friedrich wrote: If you use the mapred version from svn trunk you might have run into the same problem as I have. In the mapred version the searcher.dir property in nutch-default.xml is set to crawl and not . anymore. If you use this version you have either to put the index and the segments dirs into a folder called crawl and start tomcat from above that folder or change that value in the nutch-site.xml in webapps/ROOT/WEB-INF/classes of your tomcat nutch deployment. regards Dominik Håvard W. Kongsgård wrote: Hi, I am running a nutch server with a db containing 20 docs. When I start tomcat and search for something the browser displays an empty site. Is this a memory problem, how do I fix it? System: 2,6 | Memory 1 GB | SUSE 9.2
No cluster results
No cluster results is displayed next to the search results. Is this because I turned clustering on after running the fetch and the indexing? nutch-site.xml valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)|clustering-carrot2/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. /description /property
Re: Out of memory exception-while updating
property nameindexer.max.tokens/name value1/value description The maximum number of tokens that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. Note that this effectively truncates large documents, excluding from the index tokens that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate an OutOfMemoryError. /description /property http://wiki.media-style.com/display/nutchDocu/Hardware K.A.Hussain Ali wrote: HI all I am Nutch to crawl some site but i get an Out Of Memory Error when i try updating the webdb with some good amount of URL's I tried to find some solution on the mailing list but find nothing for solution Could anyone put their suggestion over this ? How much of RAM do Nutch requires for proper updation and indexing with a lack of URL's ? Any help would be greatly appreciated regards -Hussain. No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/207 - Release Date: 19.12.2005
Re: is nutch recrawl possible?
About this blocking you can try to use the urlfilters, change the filter between each fetch/generate +^http://www.abc.com -^http://www.bbc.co.uk Pushpesh Kr. Rajwanshi wrote: Oh this is pretty good and quite helpful material i wanted. Thanks Havard for this. Seems like this will help me writing code for stuff i need :-) Thanks and Regards, Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: Try using the whole-web fetching method instead of the crawl method. http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling http://wiki.media-style.com/display/nutchDocu/quick+tutorial Pushpesh Kr. Rajwanshi wrote: Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005 No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005
Re: Problem with fetching segment
Sorry I misunderstood the way whole-web crawling works. One more question, how do I re-fetch the failed urls (failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.). Is this controlled by property namedb.default.fetch.interval/name value30/value descriptionThe default number of days between re-fetches of a page. /description /property Stefan Groschupf wrote: Sorry, I still do not understand what your problem is, may it is time for the weekend... :-) From your very first mail there is exactly the same in the log:.. 060109 014715 logging at INFO 060109 014715 fetching http://www.sourceforge.net/ 060109 014715 fetching http://www.apache.org/ 060109 014715 fetching http://www.nutch.org/ 060109 014715 http.proxy.host = null Isn't that the same as 060109 154712 fetching http://www.niap.no/magasinet/layout/set/print In any case that are just logging statement what makes you guess that something crashed? Stefan Am 09.12.2005 um 17:44 schrieb Håvard W. Kongsgård: But then i fetch the other domains www.sf.net http://www.sf.net/ . the output is only 060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; http:// lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) 060109 014715 fetcher.server.delay = 5000 060109 014715 http.max.delays = 52 060109 014718 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060109 014724 status: segment 20060109014654, 3 pages, 0 errors, 51033 bytes, 8309 ms 060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0 bytes/page there is not output like 060109 154712 fetching http://www.niap.no/magasinet/layout/set/print 060109 154712 fetching http://www.niap.no/magasinet/kontakt_oss 060109 154712 fetching http://www.niap.no/magasinet/ezinfo/about 060109 154712 fetching http://www.niap.no/index.php/magasinet/ nyheter/midt_sten Stefan Groschupf wrote: What is java.net.SocketTimeoutException? Can not connect to the server. In general you hammer your webserver and it may block the ip of your server. You can setup how many threads per host are loading from one host server. For a intranet crawl it is a good idea to have less less thread (may just as much you plan to use at the same time for the host) e.g. fetcherThreads = 2 maxThreadsPerHost = 2 If you have more threads you should increase the retry / delay configuration since in case a host is busy with the maximal threads per host the thread is delayed. If a thread is delayed to often than you get a Exceeded http.max.delays: retry later Sometimes I'm asking myself if not a queue based fetching would be better the actually implementation, however this is difficult to change. HTH Stefan - --- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.13.13/195 - Release Date: 08.12.2005
Re: Problem with fetching segment
When I feed my domain into the database the segment fetch output was like this: -.-.-.-.-.-.-.-.-.-.-.-.- 060109 154622 fetching http://www.niap.no/magasinet/nyheter/nord_amerika/usa/israelsk_lobby_sparker_to_ansatte 060109 154622 fetching http://www.niap.no/magasinet/nyheter/afrika 060109 154622 fetching http://www.niap.no/magasinet/nyheter/asia_australia 060109 154622 fetching http://www.niap.no/magasinet/nyheter/midtoesten/libya/eu_oensker_aa_oppheve_forbudet_mot_vaapenhandel_med_libya 060109 154622 fetching http://www.niap.no/magasinet/rss/feed/magasinet_rss1 060109 154622 fetching http://www.niap.no/magasinet/content/search 060109 154622 fetching http://www.niap.no/magasinet/nyheter/europa/tyrkia/tyrkia_vil_innfoere_fengselstraff_for_utroskap 060109 154622 fetching http://www.niap.no/magasinet/nyheter/europa/russland/stalin_vender_tilbake 060109 154622 fetching http://www.niap.no/magasinet/nyheter/nord_amerika 060109 154626 fetch okay, but can't parse http://www.niap.no/magasinet/rss/feed/magasinet_rss1, reason: failed(2,203): Content-Type not text/html: text/xml 060109 154626 fetching http://www.niap.no/magasinet/nyheter/midtoesten/irak/al_queida 060109 154633 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060109 154633 fetching http://www.niap.no/magasinet/niap/test 060109 154639 fetching http://www.niap.no/magasinet/nyheter/europa/italia/pave_benedict_xvi 060109 154642 fetch of http://www.niap.no/magasinet/nyheter/nord_amerika/usa/israelsk_lobby_sparker_to_ansatte failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154642 fetch of http://www.niap.no/magasinet/nyheter/asia_australia failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154642 fetch of http://www.niap.no/magasinet/nyheter/afrika failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154642 fetching http://www.niap.no/magasinet/nyheter/soer_amerika 060109 154642 fetch of http://www.niap.no/magasinet/nyheter/europa/tyrkia/tyrkia_vil_innfoere_fengselstraff_for_utroskap failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154642 fetch of http://www.niap.no/magasinet/nyheter/midtoesten/palestina_israel/israel_bekymret_for_landets_internasjonale_image failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154642 fetch of http://www.niap.no/magasinet/nyheter/midtoesten/libya/eu_oensker_aa_oppheve_forbudet_mot_vaapenhandel_med_libya failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154642 fetch of http://www.niap.no/magasinet/content/search failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154642 fetching http://www.niap.no/index.php/magasinet/nyheter/s_r_amerika -.-.-.-.-.-.- But then -.-.-.-.-.- 060109 154714 fetch of http://phpadsnew.niap.no/adx.js failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154714 fetching http://www.niap.no/magasinet/nyheter/midtoesten/syria/russland_selger_luftforsvarssystem_til_syria 060109 154722 fetch of http://www.niap.org/ failed with: java.lang.Exception: java.net.SocketTimeoutException: connect timed out 060109 154724 fetch of http://www.niap.no/index.php/magasinet/nyheter/nord_amerika failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154724 fetch of http://www.niap.no/magasinet failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154724 fetch of http://www.niap.no/magasinet/kontakt_oss failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154724 fetch of http://www.niap.no/magasinet/magasinet/om_magasinet failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154724 fetch of http://www.niap.no/magasinet/layout/set/print failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154729 fetch of http://www.niap.no/magasinet/nyheter/midtoesten/syria/russland_selger_luftforsvarssystem_til_syria failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 154730 status: segment 20060109154516, 12 pages, 31 errors, 181559 bytes, 68511 ms 060109 154730 status: 0.17515436 pages/s, 20.703678 kb/s, 15129.917 bytes/page -.-.-.-.-.- What is java.net.SocketTimeoutException? Håvard W. Kongsgård wrote: Is the fetcher not supposed to fetch all the docs
Problem with fetching segment
I have followed the media-style.com quick tutorial, but when I try to fetch my segment the fetch is killed! Have tried to set the system timer + 30 days, no anti-virus is running on the systems. System SUSE 9.2 and SUSE 10 # bin/nutch fetch segments/20060109014654/ 060109 014714 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch-default.xml 060109 014715 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch-site.xml 060109 014715 No FS indicated, using default:local 060109 014715 Plugins: looking in: /home/hkongsgaard/nutch-0.7.1/plugins 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/query-more 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query-site/plugin.xml 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/parse-html/plugin.xml 060109 014715 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/parse-text/plugin.xml 060109 014715 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-ext 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-pdf 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-rss 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query-basic/plugin.xml 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/index-more 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-js 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml 060109 014715 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/protocol-ftp 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-msword 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/creativecommons 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ontology 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/protocol-file 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/protocol-http/plugin.xml 060109 014715 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/clustering-carrot2 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/language-identifier 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/urlfilter-prefix 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query-url/plugin.xml 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/index-basic/plugin.xml 060109 014715 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/protocol-httpclient 060109 014715 logging at INFO 060109 014715 fetching http://www.sourceforge.net/ 060109 014715 fetching http://www.apache.org/ 060109 014715 fetching http://www.nutch.org/ 060109 014715 http.proxy.host = null 060109 014715 http.proxy.port = 8080 060109 014715 http.timeout = 1 060109 014715 http.content.limit = -1 060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) 060109 014715 fetcher.server.delay = 5000 060109 014715 http.max.delays = 52 060109 014718 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060109 014724 status: segment 20060109014654, 3 pages, 0 errors, 51033 bytes, 8309 ms 060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0 bytes/page
Re: Crawl auto updated in nutch?
So how to update a crawl, the updating section of the FAQ is empty :-( http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6 Doug Cutting wrote: Håvard W. Kongsgård wrote: - I want to index about 50 – 100 sites with lots of documents, is it best use the Intranet Crawling or Whole-web Crawling method. The intranet style is simpler and hence a good place to start. If it doesn't work well for you then you might try the whole-web style. - Is the crawl auto updated in nutch, or must I run a cron task It is not auto-updated. Doug
Re: Crawl auto updated in nutch?
So how to update a crawl, the updating section of the FAQ is empty! http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6 Doug Cutting wrote: Håvard W. Kongsgård wrote: - I want to index about 50 – 100 sites with lots of documents, is it best use the Intranet Crawling or Whole-web Crawling method. The intranet style is simpler and hence a good place to start. If it doesn't work well for you then you might try the whole-web style. - Is the crawl auto updated in nutch, or must I run a cron task It is not auto-updated. Doug
Crawl auto updated in nutch?
Hello I have still some questions about nutch - I want to index about 50 – 100 sites with lots of documents, is it best use the Intranet Crawling or Whole-web Crawling method. - Is the crawl auto updated in nutch, or must I run a cron task
Intranet craw folder
Hi, I am still testing nutch 0.7.1 but now I have another problem. When I do a normal intranet crawl on some web folders with 2000 pdfs, nutch only fetches 47 pdfs from each folder.
Re: Intranet craw folder
Do you mean http.content.limit? I have set it to -1 already. There are no Content truncated at 65536 bytes. Parser can't handle incomplete errors in the log. Stefan Groschupf wrote: Check the maximal content limit in nutch-default.xml Am 22.11.2005 um 16:38 schrieb Håvard W. Kongsgård: Hi, I am still testing nutch 0.7.1 but now I have another problem. When I do a normal intranet crawl on some web folders with 2000 pdfs, nutch only fetches 47 pdfs from each folder.
Re: Images
If you want an out of the box solution with another search engine try this link, http://www.searchtools.com/info/multimedia-search.html But I don't know if any of them is open source :-( Aled Jones wrote: Hi It's not very clear from the nutch site what can nutch do with images. Currently you can set the crawler to not ignore images, but it will only parse text data. Can it do an image search like google? Kind Regards Aled This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored. Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free. No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.5/177 - Release Date: 21.11.2005
Re: PDF indexing support?
Tanks it worked Jérôme Charron wrote: The value you specified is biggest than the maximal int value, so that it return an exception, and then the default value is used. As mentionned in the property's description, use a negative value (-1) for no truncation at all (or a value lesser than java.lang.Interger.MAX_VALUE). Regards Jérôme On 11/16/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: Have now added conf/nutch-site.xml but still the same problem. | Related to the problem? http://sourceforge.net/forum/message.php?msg_id=3391668 http://sourceforge.net/forum/message.php?msg_id=3398773 ?xml version=1.0? ?xml-stylesheet type=text/xsl href=nutch-conf.xsl? nutch-conf property namehttp.content.limit/name value45451515565536/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property /nutch-conf Håvard W. Kongsgård wrote: HTTP Sébastien LE CALLONNEC wrote: Hej Håvard, That's because you have to create one yourself. The values you will set in there will override the default values. Here are a few more questions to try to solve your problem: where is your PDF located? What protocol is used to fetch it (HTTP, FTP, etc.)? Regards, /sebastien --- Håvard W. Kongsgård [EMAIL PROTECTED] a écrit : Don't have a conf/nutch-site.xml Jérôme Charron wrote: conf/nutch-default Checks that they are not overrided in the conf/nutch-site If no, sorry, no more idea for now :-( Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date: 15.11.2005 ___ Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger Téléchargez cette version sur http://fr.messenger.yahoo.com -- http://motrech.free.fr/ http://www.frutch.org/ No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.3/173 - Release Date: 16.11.2005
Re: PDF indexing support?
conf/nutch-default Jérôme Charron wrote: http.content.limit=542256565536 and file.content.limit=4541165536 still the same error: where do you specify these values? in nutch-default or nutch-site? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date: 15.11.2005
Re: PDF indexing support?
Don't have a conf/nutch-site.xml Jérôme Charron wrote: conf/nutch-default Checks that they are not overrided in the conf/nutch-site If no, sorry, no more idea for now :-( Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date: 15.11.2005
PDF indexing support?
Hello I new with nutch how do I enable PDF indexing support?