My practice comes from legal discovery. If you include both extensions, you
preserve the original name and also warn of a problem. Document sets are
different, and customers must have some. Intelligence
On Jul 24, 2011 7:05 PM, "Jakub Liska" <[email protected]> wrote:
> Unfortunately in my case it's not a content repository, but brokerage
> service where delivery of corrupted files to customer is "my" fault...
> I was seeking for statistics and experience of others ; - )
>
>
>
> On Mon, Jul 25, 2011 at 2:00 AM, Mark Kerzner <[email protected]>
wrote:
>> That is a good first step which you can adjust later based on real stats
>>
>> On Jul 24, 2011 6:58 PM, "Jakub Liska" <[email protected]> wrote:
>>> So that use assumption that tika is always right and that document of
>>> particular mime type should have corresponding extension ? But what
>>> about existence of multiple extensions per mime type ? should I always
>>> get the first one ?
>>>
>>> On Mon, Jul 25, 2011 at 1:36 AM, Mark Kerzner <[email protected]>
>>> wrote:
>>>> Attach the right extension at the end of the wrong one
>>>>
>>>> On Jul 24, 2011 6:33 PM, "Jakub Liska" <[email protected]> wrote:
>>>>> Hey, I have this decision I can't make, what should one do when user
>>>>> uploads a document with file extension A but the file's detected mime
>>>>> type corresponds to extension B ? In most cases it could yield
>>>>> problems right ? I can't decide on the way of dealing with this.
>>>>>
>>>>> Warn user ? No...
>>>>> Change file name ext to Mime type extension ? Probably yes
>>>>> Do not use any extension ? Can't, documents will be accessible to
>>>>> different users right away
>>>>>
>>>>> What is the proper steps to verify integrity of these documents anyway
>>>>> html,doc,docx,odt,txt,rtf,srt,sub,pdf,odf,odp,xls,ppt ? Or at least
>>>>> for some types
>>>>>
>>>>> I guess that inputStream is always 99,99% read properly from MultiPart
>>>>> request otherwise exception would be thrown and action taken.
>>>>> But user can upload already corrupted file, MS docs, PDF or open
>>>>> document - do I use third party libraries for checking that ? Didn't
>>>>> see anything like that in odftoolkit, itextpdf or pdfbox
>>>>>
>>>>> I just get Media Type
>>>>>
>>>>> protected MediaType getContentType(InputStream is, String
>>>>> httpReqContentType) throws SystemException {
>>>>> MediaType httpReqMediaType = MediaType.parse(httpReqContentType);
>>>>> MediaType mediaType;
>>>>> try {
>>>>> mediaType = MediaType.parse(tika.detect(is));
>>>>> } catch (IOException ioe) {
>>>>> throw new SystemException(ioe.getMessage(), ioe);
>>>>> }
>>>>> if (mediaType.equals(MediaType.OCTET_STREAM) && httpReqMediaType !=
>>>>> null && !httpReqMediaType.equals(MediaType.OCTET_STREAM))
>>>>> return httpReqMediaType;
>>>>> else
>>>>> return mediaType;
>>>>> }
>>>>>
>>>>> Then I check whether it matches one of my supported mime types and
>>>>> then the file is meant to be deliver to a third party customer - which
>>>>> is practically mission critical here.
>>>>>
>>>>> What do you guys do in addition to what I just said for everything to
>>>>> be rock solid ? Can it produce a lot of emails from customers about
>>>>> not getting what they expected ?
>>>>
>>