> > > Well in o.a.n.metadata.Nutch some brief Javadoc's for the caching > fields mention the following > > static String CACHING_FORBIDDEN_ALL > Don't show either original forbidden content or summaries. > static String CACHING_FORBIDDEN_CONTENT > Don't show original forbidden content, but show summaries. > static String CACHING_FORBIDDEN_KEY > Sites may request that search engines don't provide access > to cached documents. > static org.apache.avro.util.Utf8 CACHING_FORBIDDEN_KEY_UTF8 > > static String CACHING_FORBIDDEN_NONE > Show both original forbidden content and summaries (default). > > I understand that caching data is held within and concerns metadata > (in trunk it is parse.getData().getMeta())
it does not concern metadata, we store as metadata the policies regarding caching that are specified in the html pages ( http://www.i18nguy.com/markup/metatags.html) then store the policy in the cache field * // add cached content/summary display policy, if available* * String caching = parse.getData().getMeta(Nutch.CACHING_FORBIDDEN_KEY);* * if (caching != null && !caching.equals(Nutch.CACHING_FORBIDDEN_NONE)) { * * doc.add("cache", caching);* * }* * * I expect that this was then used by our search web app to determine whether we could display the cached content or not. > but I still have no idea the > characteristics of the cache data, why this would be valuable for an > index. I personally have never queried for it before in my index. > we do not store the cached content as a field, just the policy. caching can be useful for an index e.g. when the target server is down and you want to have a peek at the content of the page indexing the policy instead of the actual cache content is probably not so relevant now that we've delegated the indexing + search to SOLR & ES. We could of course add a binary field with the content so that web apps querying the search backends could provide the cache if needed. We'd need to enforce the caching policy at the indexing level + put some restrictions on length etc... Makes sense? Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

