Format of the Nutch Results

2010-04-20 Thread nachonieto3
I have a doubt...How are the final results of Nutch stored?I mean, in which format is stored the information contained in the links analyzed? I understood that Nutch need the information in plan text to parse it...but in which format is stored finally?I know is stored in "segments" but how can I

Re: how to parse html files while crawling

2010-04-20 Thread nachonieto3
I have a doubt related with this topic (I guess)...How are the final results of Nutch stored?I mean, in which format is stored the information contained in the links analyzed? I understood that Nutch need the information in plan text to parse it...but in which format is stored finally?I know is s

Re: Format of the Nutch Results

2010-04-21 Thread nachonieto3
Thank you a lot! Now I'm working on that, I have some doubts more...I'm not able to run the command readseg...I've been consulting some help forum and the basic synthesis is readseg I have the segments in this path: D:\nutch-0.9\crawl-20100420112025\segments The file named crawl-20100420112025

Re: how to parse html files while crawling

2010-04-21 Thread nachonieto3
Thank you a lot! Now I'm working on that, I have some doubts more...I'm not able to run the command readseg...I've been consulting some help forum and the basic synthesis is readseg I have the segments in this path: D:\nutch-0.9\crawl-20100420112025\segments The file named crawl-20100420112025

Re: Format of the Nutch Results

2010-04-22 Thread nachonieto3
Thank you very much!!!I've tried the command as you told me...but I still have some problems...Till I understand is something about the JAVA_HOME, that I've already defined and checked the integrability of the file. I leave you a capture of the problem, maybe someone know what I'm doing wrong. Th

Parsing .ppt, .xls, .rtf and .doc

2010-04-29 Thread nachonieto3
Hello everyone, I'm using Nutch v0.9 I'm able to crawl, fetch and parse html and .pdf. But when I try with .ppt, .xls, .rtf and .doc I don't have any problem but when I use SegmentReader to get the information of each url I don't find any parsetext in these formats. I configured the plugins and I

Re: Parsing .ppt, .xls, .rtf and .doc

2010-05-04 Thread nachonieto3
Finally I solved. It was a problem of the URLS that I was trying to analyze. I was trying to crawl and parse links with spaces in them. I mean, this kind of links: http://nutch user/nutch.doc. I solve this problem by changing some things of the URL filter. Thanks by the way. -- View this message

Parsing html

2010-05-04 Thread nachonieto3
Good afternoon, Once I solved my problem with the other formats. Now I'm trying to figure out how to solve another one. I'm able to parse .html format but I get the ParseText in one line. I would like to respect at least the paragraphs of the original document. Anyone know how to do it? Thank you