I have a doubt...How are the final results of Nutch stored?I mean, in which
format is stored the information contained in the links analyzed?
I understood that Nutch need the information in plan text to parse it...but
in which format is stored finally?I know is stored in "segments" but how can
I
I have a doubt related with this topic (I guess)...How are the final results
of Nutch stored?I mean, in which format is stored the information contained
in the links analyzed?
I understood that Nutch need the information in plan text to parse it...but
in which format is stored finally?I know is s
Thank you a lot! Now I'm working on that, I have some doubts more...I'm not
able to run the command readseg...I've been consulting some help forum and
the basic synthesis is
readseg
I have the segments in this path: D:\nutch-0.9\crawl-20100420112025\segments
The file named crawl-20100420112025
Thank you a lot! Now I'm working on that, I have some doubts more...I'm not
able to run the command readseg...I've been consulting some help forum and
the basic synthesis is
readseg
I have the segments in this path: D:\nutch-0.9\crawl-20100420112025\segments
The file named crawl-20100420112025
Thank you very much!!!I've tried the command as you told me...but I still
have some problems...Till I understand is something about the JAVA_HOME,
that I've already defined and checked the integrability of the file. I leave
you a capture of the problem, maybe someone know what I'm doing wrong.
Th
Hello everyone,
I'm using Nutch v0.9 I'm able to crawl, fetch and parse html and .pdf. But
when I try with .ppt, .xls, .rtf and .doc I don't have any problem but when
I use SegmentReader to get the information of each url I don't find any
parsetext in these formats. I configured the plugins and I
Finally I solved. It was a problem of the URLS that I was trying to analyze.
I was trying to crawl and parse links with spaces in them. I mean, this kind
of links: http://nutch user/nutch.doc.
I solve this problem by changing some things of the URL filter.
Thanks by the way.
--
View this message
Good afternoon,
Once I solved my problem with the other formats. Now I'm trying to figure
out how to solve another one.
I'm able to parse .html format but I get the ParseText in one line. I would
like to respect at least the paragraphs of the original document. Anyone
know how to do it?
Thank you