That is a good first step which you can adjust later based on real stats
On Jul 24, 2011 6:58 PM, "Jakub Liska" <[email protected]> wrote:
> So that use assumption that tika is always right and that document of
> particular mime type should have corresponding extension ? But what
> about existence of multiple extensions per mime type ? should I always
> get the first one ?
>
> On Mon, Jul 25, 2011 at 1:36 AM, Mark Kerzner <[email protected]>
wrote:
>> Attach the right extension at the end of the wrong one
>>
>> On Jul 24, 2011 6:33 PM, "Jakub Liska" <[email protected]> wrote:
>>> Hey, I have this decision I can't make, what should one do when user
>>> uploads a document with file extension A but the file's detected mime
>>> type corresponds to extension B ? In most cases it could yield
>>> problems right ? I can't decide on the way of dealing with this.
>>>
>>> Warn user ? No...
>>> Change file name ext to Mime type extension ? Probably yes
>>> Do not use any extension ? Can't, documents will be accessible to
>>> different users right away
>>>
>>> What is the proper steps to verify integrity of these documents anyway
>>> html,doc,docx,odt,txt,rtf,srt,sub,pdf,odf,odp,xls,ppt ? Or at least
>>> for some types
>>>
>>> I guess that inputStream is always 99,99% read properly from MultiPart
>>> request otherwise exception would be thrown and action taken.
>>> But user can upload already corrupted file, MS docs, PDF or open
>>> document - do I use third party libraries for checking that ? Didn't
>>> see anything like that in odftoolkit, itextpdf or pdfbox
>>>
>>> I just get Media Type
>>>
>>> protected MediaType getContentType(InputStream is, String
>>> httpReqContentType) throws SystemException {
>>> MediaType httpReqMediaType = MediaType.parse(httpReqContentType);
>>> MediaType mediaType;
>>> try {
>>> mediaType = MediaType.parse(tika.detect(is));
>>> } catch (IOException ioe) {
>>> throw new SystemException(ioe.getMessage(), ioe);
>>> }
>>> if (mediaType.equals(MediaType.OCTET_STREAM) && httpReqMediaType !=
>>> null && !httpReqMediaType.equals(MediaType.OCTET_STREAM))
>>> return httpReqMediaType;
>>> else
>>> return mediaType;
>>> }
>>>
>>> Then I check whether it matches one of my supported mime types and
>>> then the file is meant to be deliver to a third party customer - which
>>> is practically mission critical here.
>>>
>>> What do you guys do in addition to what I just said for everything to
>>> be rock solid ? Can it produce a lot of emails from customers about
>>> not getting what they expected ?
>>