Hi All,
I am new Using Apache Nutch to crawl some sites , filter and get
content on the base of word not on the base of url. e.g.
1. I have to crawl those sites that contain words like 'shop' or 'product'
in contents(text). if these word not exists then not crawl further links on
that page and leave the page to further parse.
2. Apache Nutch is directly interact with the HBASE to dump whole webpage
source html but I want to get structured (json formate e.g text , url ,
metadata etc.) data instead of unstructured(whole page source) data.
3. Then Apache Nutch send this data to solr where data is index and
structured. but I want to show this data on my on web page instead of solr web
page. how can I get this data in structured format and categorized. it with
words i provide it to Nutch.
that's what I want to achieve, any little help would be appreciable.
Regards
Muhammad umer