Unfortunately in my case it's not a content repository, but brokerage service where delivery of corrupted files to customer is "my" fault... I was seeking for statistics and experience of others ; - )
On Mon, Jul 25, 2011 at 2:00 AM, Mark Kerzner <[email protected]> wrote: > That is a good first step which you can adjust later based on real stats > > On Jul 24, 2011 6:58 PM, "Jakub Liska" <[email protected]> wrote: >> So that use assumption that tika is always right and that document of >> particular mime type should have corresponding extension ? But what >> about existence of multiple extensions per mime type ? should I always >> get the first one ? >> >> On Mon, Jul 25, 2011 at 1:36 AM, Mark Kerzner <[email protected]> >> wrote: >>> Attach the right extension at the end of the wrong one >>> >>> On Jul 24, 2011 6:33 PM, "Jakub Liska" <[email protected]> wrote: >>>> Hey, I have this decision I can't make, what should one do when user >>>> uploads a document with file extension A but the file's detected mime >>>> type corresponds to extension B ? In most cases it could yield >>>> problems right ? I can't decide on the way of dealing with this. >>>> >>>> Warn user ? No... >>>> Change file name ext to Mime type extension ? Probably yes >>>> Do not use any extension ? Can't, documents will be accessible to >>>> different users right away >>>> >>>> What is the proper steps to verify integrity of these documents anyway >>>> html,doc,docx,odt,txt,rtf,srt,sub,pdf,odf,odp,xls,ppt ? Or at least >>>> for some types >>>> >>>> I guess that inputStream is always 99,99% read properly from MultiPart >>>> request otherwise exception would be thrown and action taken. >>>> But user can upload already corrupted file, MS docs, PDF or open >>>> document - do I use third party libraries for checking that ? Didn't >>>> see anything like that in odftoolkit, itextpdf or pdfbox >>>> >>>> I just get Media Type >>>> >>>> protected MediaType getContentType(InputStream is, String >>>> httpReqContentType) throws SystemException { >>>> MediaType httpReqMediaType = MediaType.parse(httpReqContentType); >>>> MediaType mediaType; >>>> try { >>>> mediaType = MediaType.parse(tika.detect(is)); >>>> } catch (IOException ioe) { >>>> throw new SystemException(ioe.getMessage(), ioe); >>>> } >>>> if (mediaType.equals(MediaType.OCTET_STREAM) && httpReqMediaType != >>>> null && !httpReqMediaType.equals(MediaType.OCTET_STREAM)) >>>> return httpReqMediaType; >>>> else >>>> return mediaType; >>>> } >>>> >>>> Then I check whether it matches one of my supported mime types and >>>> then the file is meant to be deliver to a third party customer - which >>>> is practically mission critical here. >>>> >>>> What do you guys do in addition to what I just said for everything to >>>> be rock solid ? Can it produce a lot of emails from customers about >>>> not getting what they expected ? >>> >
