Re: Crawling Question

Lewis John Mcgibbney Sat, 19 Nov 2011 02:40:11 -0800

Hi,

Well for starters parse'-tika in Nutch trunk will parse your metadata and
send it to Solr for the following

http://tika.apache.org/0.10/formats.html

If there are additional formats you wish to get metadata from then I
suggest that you look towards writing some implementation which can extend
this.

hth

On Fri, Nov 18, 2011 at 6:19 PM, Michael Kelleher <[email protected]>wrote:

> How do people handle binary documents and images?  The "default" regex
> filter has:
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|sit|eps|wmf|zip|**
> ppt|mpg|xls|gz|rpm|tgz|mov|**MOV|exe|jpeg|JPEG|bmp|BMP)$
>
>
> but some of this content, I would want to pass along to Solr for indexing.
>
> Is anyone else doing this kind of thing?
>
>
>

-- 
*Lewis*

Re: Crawling Question

Reply via email to