Hi All,

             I am new Using Apache Nutch to crawl some sites , filter and get 
content on the base of word not on the base of url. e.g.


  1.  I have to crawl those sites  that contain words like 'shop'  or 'product' 
in contents(text). if these word not exists then not crawl further links on 
that page and leave the page to further parse.
  2.  Apache Nutch is directly interact with the HBASE to dump whole webpage 
source html but I want to get structured (json formate e.g text , url , 
metadata etc.) data instead of unstructured(whole page source) data.
  3.  Then Apache Nutch send this data to solr where data is index and 
structured. but I want to show this data on my on web page instead of solr web 
page. how can I get this data in structured format and categorized. it with 
words i provide it to Nutch.

that's what I want to achieve, any little help would be appreciable.

Regards
Muhammad umer

Reply via email to