subject:"Re\: \[dspace\-tech\] bitstreams and character encodings"

Re: [dspace-tech] bitstreams and character encodings

2017-08-11 Thread Chris Gray

Thanks for the reminder, Terry.

I forgot that the text extraction doesn't use Imagemagick.  (I got confused 
because I am using Imagemagick and Tesseract to repair some bad text 
extractions in our DSpace.  We've found that the default mechanism has 
trouble in certain cases, including PDFs made from TeX and DVI.)

It also occurred to me, after posting, that Tomcat 7 is probably defaulting 
to ISO-8859-1 on our server.  We're using Ubuntu and the standard Ubuntu 
package for Tomcat and it probably needs to be set to serve files with 
UTF-8 since everything else in DSpace is UTF-8.

On Friday, August 11, 2017 at 1:46:00 PM UTC-4, Terry Brady wrote:
>
> Chris,
>
> The ImageMagick filter uses ghostscript to generate an image of the first 
> page of a document in order to create a thumbnail.  The full text 
> extraction is handled by a different filter.
>
>
> https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace/config/dspace.cfg#L346
>
> I have encountered similar issue to the one that you have described, but I 
> have not found a comprehensive solution.
>
> I suspect that some of our issues are related to the source PDF's rather 
> than the DSpace code base.
>
> On a related note, we used to host HTML finding aids in our repository, 
> and we encountered a number of character set issues when displaying those 
> files.  I made the following modification to this file
>
>
> https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace-xmlui/src/main/java/org/dspace/app/xmlui/cocoon/BitstreamReader.java#L403
>
> if (bitstreamMimeType.equals("text/html")) {
> bitstreamMimeType = "text/html; charset=UTF-8";
> }
>
>
> On Fri, Aug 11, 2017 at 6:45 AM, Chris Gray  > wrote:
>
>> You can fetch bitstreams from DSpace with URL paths like this (our xmlui 
>> context is implicit):
>>
>> /bitstream/id/{bitstream_id}/{bitstream_filename}
>>
>> I've been noticing that in our case txt bitstreams and pdf bitstreams are 
>> always delivered by the server with the character set in the response 
>> header set to ISO-8859-1 and not UTF-8.
>>
>> Is this a setting somewhere?  Is it possible to make it more flexible and 
>> adapt to actual content?
>>
>> In particular, I'm looking at the .pdf.txt files extracted by Imagemagick 
>> for full text indexing purposes.
>>
>> Is it possible to set a character encoding for individual pdfs and have 
>> Imagemagick take that into consideration in extracting full text?
>>
>> We are using DSpace 5.5 with security patches for 5.6 and 5.7 and XMLUI 
>> with Mirage2.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "DSpace Technical Support" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to dspace-tech...@googlegroups.com .
>> To post to this group, send email to dspac...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/dspace-tech.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Terry Brady
> Applications Programmer Analyst
> Georgetown University Library Information Technology
> http://georgetown-university-libraries.github.io/
> 425-298-5498 (Seattle, WA)
>

-- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Re: [dspace-tech] bitstreams and character encodings

2017-08-11 Thread Terry Brady

Chris,

The ImageMagick filter uses ghostscript to generate an image of the first
page of a document in order to create a thumbnail.  The full text
extraction is handled by a different filter.

https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace/
config/dspace.cfg#L346

I have encountered similar issue to the one that you have described, but I
have not found a comprehensive solution.

I suspect that some of our issues are related to the source PDF's rather
than the DSpace code base.

On a related note, we used to host HTML finding aids in our repository, and
we encountered a number of character set issues when displaying those
files.  I made the following modification to this file

https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace-
xmlui/src/main/java/org/dspace/app/xmlui/cocoon/BitstreamReader.java#L403

if (bitstreamMimeType.equals("text/html")) {
bitstreamMimeType = "text/html; charset=UTF-8";
}


On Fri, Aug 11, 2017 at 6:45 AM, Chris Gray  wrote:

> You can fetch bitstreams from DSpace with URL paths like this (our xmlui
> context is implicit):
>
> /bitstream/id/{bitstream_id}/{bitstream_filename}
>
> I've been noticing that in our case txt bitstreams and pdf bitstreams are
> always delivered by the server with the character set in the response
> header set to ISO-8859-1 and not UTF-8.
>
> Is this a setting somewhere?  Is it possible to make it more flexible and
> adapt to actual content?
>
> In particular, I'm looking at the .pdf.txt files extracted by Imagemagick
> for full text indexing purposes.
>
> Is it possible to set a character encoding for individual pdfs and have
> Imagemagick take that into consideration in extracting full text?
>
> We are using DSpace 5.5 with security patches for 5.6 and 5.7 and XMLUI
> with Mirage2.
>
> --
> You received this message because you are subscribed to the Google Groups
> "DSpace Technical Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dspace-tech+unsubscr...@googlegroups.com.
> To post to this group, send email to dspace-tech@googlegroups.com.
> Visit this group at https://groups.google.com/group/dspace-tech.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Terry Brady
Applications Programmer Analyst
Georgetown University Library Information Technology
http://georgetown-university-libraries.github.io/
425-298-5498 (Seattle, WA)

-- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Re: [dspace-tech] bitstreams and character encodings

Re: [dspace-tech] bitstreams and character encodings

2 matches

Site Navigation

Mail list logo

Footer information