So that use assumption that tika is always right and that document of particular mime type should have corresponding extension ? But what about existence of multiple extensions per mime type ? should I always get the first one ?
On Mon, Jul 25, 2011 at 1:36 AM, Mark Kerzner <[email protected]> wrote: > Attach the right extension at the end of the wrong one > > On Jul 24, 2011 6:33 PM, "Jakub Liska" <[email protected]> wrote: >> Hey, I have this decision I can't make, what should one do when user >> uploads a document with file extension A but the file's detected mime >> type corresponds to extension B ? In most cases it could yield >> problems right ? I can't decide on the way of dealing with this. >> >> Warn user ? No... >> Change file name ext to Mime type extension ? Probably yes >> Do not use any extension ? Can't, documents will be accessible to >> different users right away >> >> What is the proper steps to verify integrity of these documents anyway >> html,doc,docx,odt,txt,rtf,srt,sub,pdf,odf,odp,xls,ppt ? Or at least >> for some types >> >> I guess that inputStream is always 99,99% read properly from MultiPart >> request otherwise exception would be thrown and action taken. >> But user can upload already corrupted file, MS docs, PDF or open >> document - do I use third party libraries for checking that ? Didn't >> see anything like that in odftoolkit, itextpdf or pdfbox >> >> I just get Media Type >> >> protected MediaType getContentType(InputStream is, String >> httpReqContentType) throws SystemException { >> MediaType httpReqMediaType = MediaType.parse(httpReqContentType); >> MediaType mediaType; >> try { >> mediaType = MediaType.parse(tika.detect(is)); >> } catch (IOException ioe) { >> throw new SystemException(ioe.getMessage(), ioe); >> } >> if (mediaType.equals(MediaType.OCTET_STREAM) && httpReqMediaType != >> null && !httpReqMediaType.equals(MediaType.OCTET_STREAM)) >> return httpReqMediaType; >> else >> return mediaType; >> } >> >> Then I check whether it matches one of my supported mime types and >> then the file is meant to be deliver to a third party customer - which >> is practically mission critical here. >> >> What do you guys do in addition to what I just said for everything to >> be rock solid ? Can it produce a lot of emails from customers about >> not getting what they expected ? >
