>
>
> Well in o.a.n.metadata.Nutch some brief Javadoc's for the caching
> fields mention the following
>
>  static String  CACHING_FORBIDDEN_ALL
>           Don't show either original forbidden content or summaries.
> static String   CACHING_FORBIDDEN_CONTENT
>           Don't show original forbidden content, but show summaries.
> static String   CACHING_FORBIDDEN_KEY
>           Sites may request that search engines don't provide access
> to cached documents.
> static org.apache.avro.util.Utf8        CACHING_FORBIDDEN_KEY_UTF8
>
> static String   CACHING_FORBIDDEN_NONE
>           Show both original forbidden content and summaries (default).
>
> I understand that caching data is held within and concerns metadata
> (in trunk it is parse.getData().getMeta())


it does not concern metadata, we store as metadata the policies regarding
caching that are specified in the html pages (
http://www.i18nguy.com/markup/metatags.html) then store the policy in the
cache field

* // add cached content/summary display policy, if available*
*    String caching = parse.getData().getMeta(Nutch.CACHING_FORBIDDEN_KEY);*
*    if (caching != null && !caching.equals(Nutch.CACHING_FORBIDDEN_NONE)) {
*
*      doc.add("cache", caching);*
*    }*
* *
I expect that this was then used by our search web app to determine whether
we could display the cached content or not.


> but I still have no idea the
> characteristics of the cache data, why this would be valuable for an
> index. I personally have never queried for it before in my index.
>

we do not store the cached content as a field, just the policy. caching can
be useful for an index e.g. when the target server is down and you want to
have a peek at the content of the page

indexing the policy instead of the actual cache content is probably not so
relevant now that we've delegated the indexing + search to SOLR & ES. We
could of course add a binary field with the content so that web apps
querying the search backends could provide the cache if needed. We'd need
to enforce the caching policy at the indexing level + put some restrictions
on length etc...

Makes sense?

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to