Hi Azhar, If i dont misunderstand you want to build minimal crawler. Nutch is designed for whole web crawl. Maybe some fields are not used by you but it can be used by some plugins. If you want to change to code you can do it. But if you want to hbase store minimum space you can use hbase compression for all fields.
I Hope i could help you Talat 17 Nis 2014 13:24 tarihinde "Azhar Jassal" <[email protected]> yazdı: > Hi > > I'm using Nutch 2.2.1 with HBase > > How can I restrict the fields persisted in HBase? For example, I don't need > the "p:c" column (parser text field). Its actual content will never be used > by my search implementation (am not using a default text field). I can see > the "p:c" mapping is listed in conf/gora-hbase-mapping.xml but omitting it > from the file causes a Gora writer exception. > > I'm using my own set of plugins to extract the specific content I need and > adding it to metadata so its saved in column mtdt. > > Now I want to restrict the storage of additional data to the most minimum > required for Nutch to function (mostly to minimise hard disk usage). For > example, I don't want to store headers (column h)- how can I restrict them > from making it to HBase? > > Also, I'm using "fetcher.parse" = true, so don't require data persisted for > post-parsing > > > Thanks > > Az >

