Re: Help with parse-mp3?
Sir: you need to build it by enabling it in build.xml in src/plugin/ ; it should then be built with ant. On 18/01/2008, Rick Francis [EMAIL PROTECTED] wrote: I'm trying to get the parse-mp3 plugin working in my nutch installation. I've found and downloaded jid3lib-0.5.4.jar but I can't find parse-mp3.jar. I see the source for it in the nutch distribution, but not the jar file. I'm a Java newbie so I'm not sure exactly what I need to build the jar file from the source. Any help or pointers would be appreciated. Rick -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: problem with mp3 parser
On 12/12/2007, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: It did not help. Also I checked the search.dir value does not change in C:\Tomcat\webapps\ROOT\WEB-INF\classes\nutch-default.xml although I changed it in nutch/conf/nutch-deafult.xml. Should the size of nutch*.war file to change depending on how many sites are fetched. Also if I out all nutch command in a file and execute it, nutch gives errors like some directory is not found, although the dir is there. No, the data is stored outside the web archive. Is your machine externally accessible? If so, please email me offlist and I'd love to take a (brief) look and let you know if I see anything. -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: problem with mp3 parser
Think you may need the jar file in plugin/mp3/lib? On 12/11/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi All, I have in nutch/conf/nutch-default.xml the following property ? nameplugin.includes/name ? valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|js|mp3)|index-(basic|more)|query-(basic|more|site|url)|summary-basic|scoring-opic/value ... However in C:\Tomcat\webapps\ROOT\WEB-INF\classes\nutch-default.xml property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value as you see mp3 is missing. And mp3 plugin is also missing in tomcat's plugin dir. Any ideas why this happened? Thanks. Alex. -Original Message- From: [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Fri, 7 Dec 2007 3:08 pm Subject: problem with mp3 parser Hello, I have build mp3 parser and put it in C:\nutch\plugins . However, nutch does not find mp3's. I checked C:\Tomcat\webapps\ROOT\WEB-INF\classes\plugins dir. There is no parser-mp3 folder. Any idea how to fix this? Thanks. Alex. More new features than ever. Check out the new AIM(R) Mail ! - http://o.aolcdn.com/cdn.webmail.aol.com/mailtour/aol/en-us/text.htm?ncid=aimcmp000501 More new features than ever. Check out the new AIM(R) Mail ! - http://webmail.aim.com -- Sent from Gmail for mobile | mobile.google.com Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: Newbie questions about followed links
Sir: On 08/03/07, Jeroen Verhagen [EMAIL PROTECTED] wrote: Surely these links look ordinary enough to be seen and followed by nutch? Could someone please tell me what could be causing these links not be followed? conf/urlfilter.txt.template contains the line: [EMAIL PROTECTED] Remove the '?' and the links will be followed. -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: How can I setup an mp3 search engine?
On 28/10/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Can the plugin parse-mp3 parse the infomation in mp3 files such as author, song name, artist and so on ? The parse-mp3 plugin can obtain any information in the ID3 tags contained in the file. If this information is not part of the file, the plugin (as written) can not pluck information about the file from thin air. -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: crawling a certain site
On 01/08/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Anyway, I think you would need to write some code (be it directly for nutch or for the web in question). If you have perl available, you might want to take advantage of the code at http://prolificprogrammer.com/~hdiwan/getTitle.pl -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: Please Help.. recrawl script.. will send out to the list when finished for 0.8.0
Mr Holt: On 7/20/06, Matthew Holt [EMAIL PROTECTED] wrote: there is a resource online that describes manually recrawling, that'd be great as well. Thanks. http://wiki.apache.org/nutch/NutchTutorial -- you're welcome. -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: Eclipse IDE
Mr Holt: On 7/11/06, Matthew Holt [EMAIL PROTECTED] wrote: Can someone that has Nutch developement configured for Eclipse please paste their .project and .classpath files? Thanks. Do the following in your project properties: Source directories should be all the src and test subdirectories under plugin/* and the libraries should contain all the jar files. If you want to keep things simple, just use the build file from eclipse. -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: PluginRuntimeException
, but the slowest it is. /description /property /nutch-conf I believe they are mutually exclusive. Just use one or the other. I think someone posted here recently saying they had problems with httpclient. I'm using protocol-http myself. So noted, the change has been made. -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: NullPointerException
On 06/03/06, Howie Wang [EMAIL PROTECTED] wrote: Is query-basic or query-more included in your nutch-default.xml? It is indeed included in my nutch-site.xml :- property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-more|query-(more|site|url)/value /property Thanks for the help! -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: NullPointerException
... 060305 182204 Sorting updates by segment... 060305 182204 Updating segments... 060305 182204 updating /home/hdiwan/SpectraSearch/crawl/segments/20060305182200 060305 182204 Done updating /home/hdiwan/SpectraSearch/crawl/segments from /home/hdiwan/SpectraSearch/crawl/db 060305 182204 indexing segment: /home/hdiwan/SpectraSearch/crawl/segments/20060305182200 060305 182205 * Opening segment 20060305182200 060305 182205 * Indexing segment 20060305182200 060305 182205 * Optimizing index... 060305 182205 * Moving index to NFS if needed... 060305 182205 DONE indexing segment 20060305182200: total 15 records in 0.031 s (Infinity rec/s). 060305 182205 done indexing 060305 182205 indexing segment: /home/hdiwan/SpectraSearch/crawl/segments/20060305182203 060305 182205 * Opening segment 20060305182203 060305 182205 * Indexing segment 20060305182203 060305 182205 * Optimizing index... 060305 182205 * Moving index to NFS if needed... 060305 182205 DONE indexing segment 20060305182203: total 0 records in 0.075 s (NaN rec/s). 060305 182205 done indexing 060305 182205 Reading url hashes... 060305 182205 Sorting url hashes... 060305 182205 Deleting url duplicates... 060305 182205 Deleted 0 url duplicates. 060305 182205 Reading content hashes... 060305 182205 Sorting content hashes... 060305 182205 Deleting content duplicates... 060305 182205 Deleted 0 content duplicates. 060305 182205 Duplicate deletion complete locally. Now returning to NFS... 060305 182205 DeleteDuplicates complete 060305 182205 Merging segment indexes... 060305 182205 crawl finished: crawl That's the entire log. Hope it helps! My crawl-urlfilter.txt: # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept hosts in any domain +^http://([a-z0-9]*\.)*/ # skip everything else -. So, why isn't it fetching anything, if that is indeed the case? -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: NullPointerException
Mr Tang: Crawling seems ok. Can you pls try org.apache.nutch.searcher.NutchBean [your-query-string] in shell/cmd? server: 7:20pm % ./bin/nutch org.apache.nutch.searcher.NutchBean hasan 060305 192042 10 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml 060305 192042 10 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml 060305 192042 10 opening merged index in /home/hdiwan/SpectraSearch/crawl/index 060305 192042 10 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins 060305 192042 10 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/nutch-extensionpoints/plugin.xml 060305 192042 10 not including: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-file 060305 192042 10 not including: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp 060305 192042 10 not including: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-http 060305 192042 10 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient/plugin.xml 060305 192042 10 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.httpclient.Http 060305 192042 10 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.httpclient.Http 060305 192042 10 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml 060305 192042 10 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutc che.nutch.searcher.more.TypeQueryFilter 060305 192043 10 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.more.DateQueryFilter 060305 192043 10 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/query-site/plugin.xml 060305 192043 10 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 060305 192043 10 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml 060305 192043 10 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 060305 192043 10 not including: /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex 060305 192043 10 not including: /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix 060305 192043 10 not including: /home/hdiwan/nutch-0.7.1/build/plugins/creativecommons 060305 192043 10 not including: /home/hdiwan/nutch-0.7.1/build/plugins/language-identifier 060305 192043 10 not including: /home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2 060305 192043 10 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology Total hits: 0 -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: NullPointerException
Mr Tang: On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote: Weird! You are running nutch on local file system or distributed file system? Local file system And can you find the same query hasan via luke? Nope -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: NullPointerException
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote: I am not sure what's wrong in nutch-0.7.1 indexing, but now it is possible to upgrade to nutch 0.8(svn version)? It is possible, but I was under the assumption that 0.8 required NDFS? -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: NullPointerException
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote: You can still build it on local file system:) Build, yes, but what of deployment? Can I use it in the same way? At present, I don't have enough resources to run a distributed crawl. -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: NullPointerException
Right then.. compiled the svn version of nutch. Tried running the crawl with it and this is the log: server: 11:32pm % ./bin/nutch crawl ../SpectraSearch/urls -dir ../SpectraSearch/crawl -depth 2 -threads 20 060305 233255 parsing jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/hadoop-default.xml 060305 233255 parsing file:/home/hdiwan/nutch/conf/nutch-default.xml 060305 233255 parsing file:/home/hdiwan/nutch/conf/crawl-tool.xml 060305 233255 parsing jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml 060305 233255 parsing file:/home/hdiwan/nutch/conf/nutch-site.xml 060305 233255 parsing file:/home/hdiwan/nutch/conf/hadoop-site.xml 060305 233256 crawl started in: ../SpectraSearch/crawl 060305 233256 rootUrlDir = ../SpectraSearch/urls 060305 233256 threads = 20 060305 233256 depth = 2 060305 233256 Injector: starting 060305 233256 Injector: crawlDb: ../SpectraSearch/crawl/crawldb 060305 233256 Injector: urlDir: ../SpectraSearch/urls 060305 233256 Injector: Converting injected urls to crawl db entries. 060305 233256 parsing jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/hadoop-default.xml 060305 233256 parsing file:/home/hdiwan/nutch/conf/nutch-default.xml 060305 233256 parsing file:/home/hdiwan/nutch/conf/crawl-tool.xml 060305 233256 parsing jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml 060305 233256 parsing jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml 060305 233256 parsing file:/home/hdiwan/nutch/conf/nutch-site.xml 060305 233256 parsing file:/home/hdiwan/nutch/conf/hadoop-site.xml 060305 233256 parsing jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/hadoop-default.xml 060305 233256 parsing file:/home/hdiwan/nutch/conf/nutch-default.xml 060305 233256 parsing file:/home/hdiwan/nutch/conf/crawl-tool.xml 060305 233256 parsing jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml 060305 233256 parsing jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml 060305 233256 parsing jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml 060305 233256 parsing file:/home/hdiwan/nutch/conf/nutch-site.xml 060305 233256 parsing file:/home/hdiwan/nutch/conf/hadoop-site.xml 060305 233256 Running job: job_7n6bsm 060305 233256 parsing jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/hadoop-default.xml 060305 233256 parsing jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml 060305 233256 parsing /tmp/hadoop/mapred/local/localRunner/job_7n6bsm.xml 060305 233256 parsing file:/home/hdiwan/nutch/conf/hadoop-site.xml java.io.IOException: No input directories specified in: Configuration: defaults: hadoop-default.xml , mapred-default.xml , /tmp/hadoop/mapred/local/localRunner/job_7n6bsm.xmlfinal: hadoop-site.xml at org.apache.hadoop.mapred.InputFormatBase.listFiles(InputFormatBase.java:84) at org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:94) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:70) 060305 233257 map 0% reduce 0% Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:310) at org.apache.nutch.crawl.Injector.inject(Injector.java:114) at org.apache.nutch.crawl.Crawl.main(Crawl.java:104) I need to sleep now, so I'll check back tomorrow. Thanks for all the help! -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: nutch-extensionpoints 0.71
Mr. Braman (or anyone else): On 27/02/06, Richard Braman [EMAIL PROTECTED] wrote: bin/nutch fetch segments/latest_segment How would I determine which is the latest segment? I don't really know what your other question was. I know there are duplicate URLs in urls.txt. Why would I be getting the line below? 060227 150626 Deleted 0 content duplicates. Thanks again for the kind assistance. -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: Duplicate urls in urls file
Elwin: On 13/02/06, Elwin [EMAIL PROTECTED] wrote: Do you use fixed set of rss feeds for crawl or discover rss feeds dynamically? Before I broke the script, it would take the URL, grab the feeds specified from the link tags, then parse them. I suspect this is similar to what the parse-rss plugin does, but I have not had the chance to look at it as yet. -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Duplicate urls in urls file
I've written a perl script to build up a urls file to crawl from RSS feeds. Will nutch handle duplicate URLs in the crawl file or would that logic need to be in my perl script? -- Cheers, Hasan Diwan [EMAIL PROTECTED]
extension point... does not exist
I placed the URLs for a crawl in urls per the tutorial [1]. Then: % ./bin/nutch crawl urls -dir crawl.test -depth 2 ... gives me the following log: 060213 131631 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml 060213 131631 parsing file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml 060213 131631 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml 060213 131631 No FS indicated, using default:local 060213 131631 crawl started in: crawl.test 060213 131631 rootUrlFile = urls 060213 131631 threads = 10 060213 131631 depth = 2 060213 131632 Created webdb at LocalFS,/home/hdiwan/nutch-0.7.1/crawl.test/db 060213 131632 Starting URL processing 060213 131632 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-file 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp 060213 131632 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-http/plugin.xml 060213 131632 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient 060213 131632 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml 060213 131632 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-js 060213 131632 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/parse-text/plugin.xml 060213 131632 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-pdf 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-rss 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-msword 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-ext 060213 131632 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/index-basic/plugin.xml 060213 131632 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/index-more 060213 131632 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/query-basic/plugin.xml 060213 131632 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 060213 131632 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml 060213 131632 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 060213 131632 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex/plugin.xml 060213 131632 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/creativecommons 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/language-identifier 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2 060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology 060213 131632 SEVERE org.apache.nutch.plugin.PluginRuntimeException: extension point: org.apache.nutch.protocol.Protocol does not exist. Exception in thread main java.lang.ExceptionInInitializerError at org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437) at org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:378) at org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134) Caused by: java.lang.RuntimeException: org.apache.nutch.plugin.PluginRuntimeException: extension point: org.apache.nutch.protocol.Protocol does not exist. at org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.java:147) at org.apache.nutch.net.URLFilters.clinit(URLFilters.java:40) ... 4 more Caused by: org.apache.nutch.plugin.PluginRuntimeException: extension point: org.apache.nutch.protocol.Protocol does not exist. at org.apache.nutch.plugin.PluginRepository.installExtensions(PluginRepository.java:78) at org.apache.nutch.plugin.PluginRepository.init(PluginRepository.java:61) at org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.java:144) ... 5 more ... org/apache/nutch/protocol/Protocol.java does exist, as does org/apache/nutch/protocol/Protocol.class, jar tvf nutch-0.7.1.jar holds the class file. I could do further investigation, but would like some pointers as to where I should be looking first. Thanks! -- Cheers, Hasan Diwan [EMAIL PROTECTED] 1. http://lucene.apache.org/nutch/tutorial.html
Re: PDF indexing support?
On Nov 15, 2005, at 2:46 PM, Håvard W. Kongsgård wrote: Don't have a conf/nutch-site.xml Create it and put the overrides in there, per the nutch tutorial. Cheers, Hasan Diwan [EMAIL PROTECTED] PGP.sig Description: This is a digitally signed message part