Re: how to parse html files while crawling

2010-04-22 Thread cefurkan0 cefurkan0
well main question is that

i need html elements removed files

this is important not other things

is this possible ?

On 21 April 2010 16:38, nachonieto3  wrote:

>
> Thank you a lot! Now I'm working on that, I have some doubts more...I'm not
> able to run the command readseg...I've been consulting some help forum and
> the basic synthesis is
> readseg 
> I have the segments in this path:
> D:\nutch-0.9\crawl-20100420112025\segments
> The file named  crawl-20100420112025 is the one where are stored the
> segments. So I'm trying to execute the command using these but none is
> working:
> readseg d/nutch-0.9/crawl-20100420112025/segments
> readseg crawl-20100420112025/segments
> readseg crawl-20100420112025
>
> What I'm doing wrong??When I try to execute I get bash: readseg:command not
> found.
> Any idea??Thank you in advance.
> --
> View this message in context:
> http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p739953.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>


Re: how to parse html files while crawling

2010-04-21 Thread nachonieto3

Thank you a lot! Now I'm working on that, I have some doubts more...I'm not
able to run the command readseg...I've been consulting some help forum and
the basic synthesis is 
readseg 
I have the segments in this path: D:\nutch-0.9\crawl-20100420112025\segments
The file named  crawl-20100420112025 is the one where are stored the
segments. So I'm trying to execute the command using these but none is
working:
readseg d/nutch-0.9/crawl-20100420112025/segments
readseg crawl-20100420112025/segments
readseg crawl-20100420112025

What I'm doing wrong??When I try to execute I get bash: readseg:command not
found.
Any idea??Thank you in advance.
-- 
View this message in context: 
http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p739953.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: how to parse html files while crawling

2010-04-21 Thread Ankit Dangi
To convert the Nutch's crawled data which is stored in segments to human
readable and interpretable forms, you will have to look at the 'segread'
command (which was earlier 'readseg'). It reads and exports the segment
data.

Details at Nutch Wiki:
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_segread

- Ankit Dangi


On Mon, Apr 19, 2010 at 9:15 PM, nachonieto3 wrote:

>
> I have a doubt related with this topic (I guess)...How are the final
> results
> of Nutch stored?I mean, in which format is stored the information contained
> in the links analyzed?
>
> I understood that Nutch need the information in plan text to parse it...but
> in which format is stored finally?I know is stored in "segments" but how
> can
> I access to this information in order to convert it to plan text?Is it
> possible?
>
> Thank you in advance
> --
> View this message in context:
> http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p729943.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Ankit Dangi


Re: how to parse html files while crawling

2010-04-20 Thread nachonieto3

I have a doubt related with this topic (I guess)...How are the final results
of Nutch stored?I mean, in which format is stored the information contained
in the links analyzed?

I understood that Nutch need the information in plan text to parse it...but
in which format is stored finally?I know is stored in "segments" but how can
I access to this information in order to convert it to plan text?Is it
possible?

Thank you in advance 
-- 
View this message in context: 
http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p729943.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: how to parse html files while crawling

2010-04-18 Thread Alexander Aristov
Hi

your task is clear but solution is not simple. That is why there are so many
companies which are competing for users and try to show them relevant
results.

Nutch is not cleaver enough to sort out if a page an opinion or just an
advertisement. So you MUST yourself to teach it.

First of all you need to find out what distinguishes opinions from other
pages. It might be special structure, special tag or key word 

Next you must rewrite HTML parser (indexer) and add necessary logic to
filter opinions form other pages.

so this's basically what you should do.

Best Regards
Alexander Aristov


On 12 April 2010 11:23, NareshG  wrote:

>
> Hi ,
>
> I am also a newbie with nutch.
> Actually our requirement is to do opinion crawling.
> i.e we are looking to crawl only certail html pages which contain user
> opinions about products, items or movies, mail the page should contains
> opinions. And our seed list contains some review web sites like amazon,
> mouthsut and some other sites.
> so during fetching of html pages, how should i verify weather the html page
> is a opinion or not...
> Nutch experts(Nutch reasearches) put your comments and suggestions.
> Hope the requirement is clear..
>
> Thanks and regards,
> Naresh
> --
> View this message in context:
> http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p712846.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>


Re: how to parse html files while crawling

2010-04-14 Thread xiao yang
The parsed html files are saved in "segments"

On Fri, Apr 9, 2010 at 3:40 AM, cefurkan0 cefurkan0  wrote:
> i can successfully crawl web sites with
>
> bin/nutch crawl command
>
> but i also want to save parsed html files
>
> how can i do that
>
> ty
>


Re: how to parse html files while crawling

2010-04-13 Thread NareshG

Hi ,

I am also a newbie with nutch. 
Actually our requirement is to do opinion crawling. 
i.e we are looking to crawl only certail html pages which contain user
opinions about products, items or movies, mail the page should contains
opinions. And our seed list contains some review web sites like amazon,
mouthsut and some other sites. 
so during fetching of html pages, how should i verify weather the html page
is a opinion or not...
Nutch experts(Nutch reasearches) put your comments and suggestions.
Hope the requirement is clear..

Thanks and regards,
Naresh
-- 
View this message in context: 
http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p712846.html
Sent from the Nutch - User mailing list archive at Nabble.com.