Andrew Hatton wrote:

> We are running Swish-e to index our test site.
> If you are familiar with Swish-e you will know that like most spiders it can
> index over filesystems (ie: internally) or over http (externally)
>
> It is as happy as larry doing both, but only does PDF's via filesystem
> method for some reason at the moment.
> But this is fine as all the PDF content we have is held on this one box, so
> http method of indexing isn't required.
> The rub comes when we allow users to attach PDF's to documents
> using Nadmin's gui tool, as this means that these PDF's get saved to the
> blobs directory as a GUID.
> The Swish file system index method can pick these up but will
> give them GUID filesnames, making the search results v.strange,and
> meaningless.
> I notice that in the blobs table, the name and title of the attachment is
> saved!
> Is there any way of not saving as a GUID so my spider will
> pick this information up?, and if so, what are the consequences of doing
> this?

That duplicate names can legally exist, i.e. the files 'info.pdf',
'info.pdf', 'info.pdf' and 'info.pdf', attached to article 5, page 3764,
style 27 and topic 648 respectively can be different files. For the
search to be meaningfull I suppose you'd have to know what URL will
deliver the document to your client in any case.

If the URL isn't relevant information for your indexing, it'd be quite
simple to create a perl script that reads the blob table and builds a
directory full of symlinks with the real names into the blobs
dirtectory, and the spider can index that. If you do need the URL,
that's pretty hard to resolve externally. A blob can be served by
several URLs if you so happen to choose.

Haven't had time to look at OI yet. Sorry.

Emile



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to