Re:Search problem in nutch on eclipse (win XP)
I've installed nutch 1.0 on eclipse (windows XP).I performed crawling (on local filesystem,mostly html files present in a directory) and it worked fine ,but when I ran the search program with a query,it always gives result as "Total hits 0".(no matter what the query is) can anyone guess/knows what could be the problem?
Re: nutch crawl issue
Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files. Could you please open an issue in JIRA https://issues.apache.org/jira/browse/NUTCH) and attach the file you are trying to process? I'll have a look and see if it is related to TIKA-379. Thanks Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 29 April 2010 17:02, matthew a. grisius wrote: > in nutch-site.xml I modified plugin.includes > > parse-(html) works > parse-(tika) does not > > I need to also parse pdfs so I need both features, I tried parse-(html| > tika) to see if html would be selected before tika and that did not > work. > > On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: > > using Nutch nightly build nutch-2010-04-27_04-00-28: > > > > I am trying to bin/nutch crawl a single html file generated by javadoc > > and no links are followed. I verified this with bin/nutch readdb and > > bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base > > seed doc specified is processed. > > > > I searched and reviewed the nutch-user archive and tried several > > different settings but none of the settings appear to have any effect. > > > > I then downloaded maven-2.2.1 so that I could mvn install tika and > > produce tika-app-0.7.jar to command line extract information about the > > html javadoc file. I am not familiar w/ tika but the command line > > version doesn't return any metadata, e.g. no 'src=' links from the html > > 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how > > nutch uses tika and maybe it's not related . . . > > > > Has anyone crawled javadoc files or have any suggestions? Thanks. > > > > -m. > > > >
Parsing .ppt, .xls, .rtf and .doc
Hello everyone, I'm using Nutch v0.9 I'm able to crawl, fetch and parse html and .pdf. But when I try with .ppt, .xls, .rtf and .doc I don't have any problem but when I use SegmentReader to get the information of each url I don't find any parsetext in these formats. I configured the plugins and I allow them to work. This is the result that I get when I try with a .xls format http://n3.nabble.com/forum/FileDownload.jtp?type=n&id=765912&name=untitled2.bmp Any suggestion about what I'm doing wrong??How can I check if the plugins are parsing?? Thank you in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Parsing-ppt-xls-rtf-and-doc-tp765912p765912.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: why does nutch interpret directory as URL
I m also facing the same problem.. i thought of devlop a plugin that will return null when such URL is encountered and will return null. As a result that URl wont be indexed. But i was thinking what will be the criteria on the basis of which i ll discard the URl. I hope my approach is correct. On Thu, Apr 29, 2010 at 9:59 AM, xiao yang wrote: > Because it's a URL indeed. > You can either filter this kind of URL by configuring > crawl-urlfilter.txt (-^.*/$ may helps, but I'm not sure about the > regular expression) or filter the search result (you need to develop a > nutch plugin). > Thanks! > > Xiao > > On Thu, Apr 29, 2010 at 4:33 AM, BK wrote: >> While indexing files on local file system, why does NUTCH interpret the >> directory as a URL - fetching file:/C:/temp/html/ >> This causes the index page of this directory to show up on search results. >> Any solutions for this issue?? >> >> >> Bharteesh Kulkarni >> > -- Regards, Arpit Khurdiya
Re: nutch crawl issue
if u r using nigthly build, Did u changed d same in parse-plugin.xml?? uncomment this: hopefully this helps u On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius wrote: > in nutch-site.xml I modified plugin.includes > > parse-(html) works > parse-(tika) does not > > I need to also parse pdfs so I need both features, I tried parse-(html| > tika) to see if html would be selected before tika and that did not > work. > > On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: >> using Nutch nightly build nutch-2010-04-27_04-00-28: >> >> I am trying to bin/nutch crawl a single html file generated by javadoc >> and no links are followed. I verified this with bin/nutch readdb and >> bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base >> seed doc specified is processed. >> >> I searched and reviewed the nutch-user archive and tried several >> different settings but none of the settings appear to have any effect. >> >> I then downloaded maven-2.2.1 so that I could mvn install tika and >> produce tika-app-0.7.jar to command line extract information about the >> html javadoc file. I am not familiar w/ tika but the command line >> version doesn't return any metadata, e.g. no 'src=' links from the html >> 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how >> nutch uses tika and maybe it's not related . . . >> >> Has anyone crawled javadoc files or have any suggestions? Thanks. >> >> -m. >> > > -- Regards, Arpit Khurdiya
Re: nutch crawl issue
in nutch-site.xml I modified plugin.includes parse-(html) works parse-(tika) does not I need to also parse pdfs so I need both features, I tried parse-(html| tika) to see if html would be selected before tika and that did not work. On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: > using Nutch nightly build nutch-2010-04-27_04-00-28: > > I am trying to bin/nutch crawl a single html file generated by javadoc > and no links are followed. I verified this with bin/nutch readdb and > bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base > seed doc specified is processed. > > I searched and reviewed the nutch-user archive and tried several > different settings but none of the settings appear to have any effect. > > I then downloaded maven-2.2.1 so that I could mvn install tika and > produce tika-app-0.7.jar to command line extract information about the > html javadoc file. I am not familiar w/ tika but the command line > version doesn't return any metadata, e.g. no 'src=' links from the html > 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how > nutch uses tika and maybe it's not related . . . > > Has anyone crawled javadoc files or have any suggestions? Thanks. > > -m. >