Re:Search problem in nutch on eclipse (win XP)

2010-04-29 Thread Harish Kumar
I've installed nutch 1.0 on eclipse (windows XP).I performed crawling (on
local filesystem,mostly html files present in a directory) and it worked
fine ,but when I ran the search program with a query,it always gives result
as "Total hits 0".(no matter what the query is)

can anyone guess/knows what could be the problem?


Re: nutch crawl issue

2010-04-29 Thread Julien Nioche
Hi Matthew,

There is an open issue with Tika (e.g.
https://issues.apache.org/jira/browse/TIKA-379) that could explain the
differences betwen parse-html and parse-tika. Note that you can specify :
*parse-(html|pdf) *in order to get both HTML and PDF files.

Could you please open an issue in JIRA
https://issues.apache.org/jira/browse/NUTCH) and attach the file you are
trying to process? I'll have a look and see if it is related to TIKA-379.

Thanks

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 29 April 2010 17:02, matthew a. grisius  wrote:

> in nutch-site.xml I modified plugin.includes
>
> parse-(html) works
> parse-(tika) does not
>
> I need to also parse pdfs so I need both features, I tried parse-(html|
> tika) to see if html would be selected before tika and that did not
> work.
>
> On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
> > using Nutch nightly build nutch-2010-04-27_04-00-28:
> >
> > I am trying to bin/nutch crawl a single html file generated by javadoc
> > and no links are followed. I verified this with bin/nutch readdb and
> > bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
> > seed doc specified is processed.
> >
> > I searched and reviewed the nutch-user archive and tried several
> > different settings but none of the settings appear to have any effect.
> >
> > I then downloaded maven-2.2.1 so that I could mvn install tika and
> > produce tika-app-0.7.jar to command line extract information about the
> > html javadoc file. I am not familiar w/ tika but the command line
> > version doesn't return any metadata, e.g. no 'src=' links from the html
> > 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
> > nutch uses tika and maybe it's not related . . .
> >
> > Has anyone crawled javadoc files or have any suggestions? Thanks.
> >
> > -m.
> >
>
>


Parsing .ppt, .xls, .rtf and .doc

2010-04-29 Thread nachonieto3

Hello everyone,

I'm using Nutch v0.9 I'm able to crawl, fetch and parse html and .pdf. But
when I try with .ppt, .xls, .rtf and .doc I don't have any problem but when
I use SegmentReader to get the information of each url I don't find any
parsetext in these formats. I configured the plugins and I allow them to
work. This is the result that I get when I try with a .xls format
http://n3.nabble.com/forum/FileDownload.jtp?type=n&id=765912&name=untitled2.bmp 

Any suggestion about what I'm doing wrong??How can I check if the plugins
are parsing??

Thank you in advance
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Parsing-ppt-xls-rtf-and-doc-tp765912p765912.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: why does nutch interpret directory as URL

2010-04-29 Thread arpit khurdiya
I m also facing the same problem..

i thought of devlop a plugin  that will return null when such  URL is
encountered and will return null. As a result that URl wont be
indexed.

But i was thinking what will be the criteria on the basis of which i
ll discard the URl.

I hope my approach is correct.

On Thu, Apr 29, 2010 at 9:59 AM, xiao yang  wrote:
> Because it's a URL indeed.
> You can either filter this kind of URL by configuring
> crawl-urlfilter.txt (-^.*/$ may helps, but I'm not sure about the
> regular expression) or filter the search result (you need to develop a
> nutch plugin).
> Thanks!
>
> Xiao
>
> On Thu, Apr 29, 2010 at 4:33 AM, BK  wrote:
>> While indexing files on local file system, why does NUTCH interpret the
>> directory as a URL - fetching file:/C:/temp/html/
>> This causes the index page of this directory to show up on search results.
>> Any solutions for this issue??
>>
>>
>> Bharteesh Kulkarni
>>
>



-- 
Regards,
Arpit Khurdiya


Re: nutch crawl issue

2010-04-29 Thread arpit khurdiya
 if u r using  nigthly build, Did u changed d same in parse-plugin.xml??
uncomment this:
 



hopefully this helps u

On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius
 wrote:
> in nutch-site.xml I modified plugin.includes
>
> parse-(html) works
> parse-(tika) does not
>
> I need to also parse pdfs so I need both features, I tried parse-(html|
> tika) to see if html would be selected before tika and that did not
> work.
>
> On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
>> using Nutch nightly build nutch-2010-04-27_04-00-28:
>>
>> I am trying to bin/nutch crawl a single html file generated by javadoc
>> and no links are followed. I verified this with bin/nutch readdb and
>> bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
>> seed doc specified is processed.
>>
>> I searched and reviewed the nutch-user archive and tried several
>> different settings but none of the settings appear to have any effect.
>>
>> I then downloaded maven-2.2.1 so that I could mvn install tika and
>> produce tika-app-0.7.jar to command line extract information about the
>> html javadoc file. I am not familiar w/ tika but the command line
>> version doesn't return any metadata, e.g. no 'src=' links from the html
>> 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
>> nutch uses tika and maybe it's not related . . .
>>
>> Has anyone crawled javadoc files or have any suggestions? Thanks.
>>
>> -m.
>>
>
>



-- 
Regards,
Arpit Khurdiya


Re: nutch crawl issue

2010-04-29 Thread matthew a. grisius
in nutch-site.xml I modified plugin.includes

parse-(html) works
parse-(tika) does not

I need to also parse pdfs so I need both features, I tried parse-(html|
tika) to see if html would be selected before tika and that did not
work.

On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
> using Nutch nightly build nutch-2010-04-27_04-00-28:
> 
> I am trying to bin/nutch crawl a single html file generated by javadoc
> and no links are followed. I verified this with bin/nutch readdb and
> bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
> seed doc specified is processed.
> 
> I searched and reviewed the nutch-user archive and tried several
> different settings but none of the settings appear to have any effect.
> 
> I then downloaded maven-2.2.1 so that I could mvn install tika and
> produce tika-app-0.7.jar to command line extract information about the
> html javadoc file. I am not familiar w/ tika but the command line
> version doesn't return any metadata, e.g. no 'src=' links from the html
> 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
> nutch uses tika and maybe it's not related . . .
> 
> Has anyone crawled javadoc files or have any suggestions? Thanks.
> 
> -m.
>