[CODE4LIB] A harvesting question
Hi dear list, Can anyone give me an example of harvesting PubMed publications from a specific institution to DSpace? Could you help me to configure the harvesting setting under Collection-Harvesting-Content Source in DSpace: Content source: This collection harvests its content from an external source. OAI Provider:__?? (PubMed) OAI Set id: Specific sets_?? (for a specific institution) Metadata Format: Simple Dublin Core [or] DSpace Intermediate Metadata Content being harvested: Harvest metadata and bitstreams (requires ORE support) By the way, we've been downloading xml data directly from the PubMed website and transform it to DCXML using some local VBscript. Then we export the DCXML file to Excel, transform Excel to SIP packages using BloomaMohan's program. We add several additional fields to the data set and do quite some editing in the Excel file. I have been wondering whether the DSpace built-in harvesting will be a much better option. Thank you for any idea or help! Sophie
Re: [CODE4LIB] more on MARC char encoding
If a canned cleaner can be added in MarcEdit to deal with smart quotes/values, that will be great! Besides the smart quotes, please consider other special characters including Chemistry and mathematics symbols (these are different types of special characters, right?) To better understand the character encoding issue, can anybody point me to some resources or list like UTF8 encoded data but not in the MARC8 character set? Thanks a lot. Sophie -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Thursday, April 19, 2012 2:14 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding Ah, thanks Terry. That canned cleaner in MarcEdit sounds potentially useful -- I'm in a continuing battle to keep the character encoding in our local marc corpus clean. (The real blame here is on cataloger interfaces that let catalogers save data that are illegal bytes for the character set it's being saved as. And/or display the data back to the cataloger using a translation that lets them show up as expected even though they are _wrong_ for the character set being saved as. Connexion is theoretically the rolls royce of cataloger interfaces, does it do this? Gosh I hope not.) On 4/19/2012 2:20 PM, Reese, Terry wrote: Actually -- the issue isn't one of MARC8 versus UTF8 (since this data is being harvested from DSpace and is UTF8 encoded). It's actually an issue with user entered data -- specifically, smart quotes and the like. These values obviously are not in the MARC8 characterset and cause many who transform user entered data (which tend to be used by default on Windows) from XML to MARC. If you are sticking with a strickly UTF8 based system, there generally are not issues because these are valid characters. If you move them into a system where the data needs to be represented in MARC -- then you have more problems. We do a lot of harvesting, and because of that, we run into these types of issues moving data that is in UTF8, but has characters not represented in MARC8, from into Connexion and having some of that data flattened. Given the wide range of data not in the MARC8 set that can show up in UTF8, it's not a surprise that this would happen. My guess is that you could add a template to your XSLT translation that attempted to filter the most common forms of these smart quotes/values and replace them with the more standard values. Likewise, if there was a great enough need, I could provide a canned cleaner in MarcEdit that could fix many of the most common varieties of these smart quotes/values. --TR -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Thursday, April 19, 2012 11:13 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding If your records are really in MARC8 not UTF8, your best bet is to use a tool to convert them to UTF8 before hitting your XSLT. The open source 'yaz' command line tools can do it for Marc21. The Marc4J package can do it in java, and probably work for any MARC variant not just Marc21. Char encoding issues are tricky. You might want to first figure out if your records are really in Marc8, thus the problems, or if instead they illegally contain bad data or data in some other encoding (Latin1). Char encoding is a tricky topic, you might want to do some reading on it in general. The Unicode docs are pretty decent. On 4/19/2012 11:06 AM, Deng, Sai wrote: Hi list, I am a Metadata librarian but not a programmer, sorry if my question seems naïve. We use XSLT stylesheet to transform some harvested DC records from DSpace to MARC in MarcEdit, and then export them to OCLC. Some characters do not display correctly and need manual editing, for example: In MarcEditor Transferred to OCLC Edit in OCLC Bayes’ theorem Bayes⁰́₉ theorem Bayes' theorem ―it won‘t happen here‖ attitude ⁰́₅it won⁰́₈t happen here⁰́₆ attitude it won't happen here attitude “Generation Y” ⁰́₋Generation Y⁰́₊ Generation Y listeners‟ evaluations listeners⁰́ evaluations listeners' evaluations high school – from high school ⁰́₃ from high school – from Co₀․₅Zn₀․₅Fe₂O₄ Co²́⁰⁰́Þ²́⁵Zn²́⁰⁰́Þ²́⁵Fe²́²O²́⁴ Co0.5Zn0.5Fe2O4? μ Îơ
Re: [CODE4LIB] more on MARC char encoding
Hi list, I am a Metadata librarian but not a programmer, sorry if my question seems naïve. We use XSLT stylesheet to transform some harvested DC records from DSpace to MARC in MarcEdit, and then export them to OCLC. Some characters do not display correctly and need manual editing, for example: In MarcEditor Transferred to OCLC Edit in OCLC Bayes’ theorem Bayes⁰́₉ theorem Bayes' theorem ―it won‘t happen here‖ attitude ⁰́₅it won⁰́₈t happen here⁰́₆ attitude it won't happen here attitude “Generation Y” ⁰́₋Generation Y⁰́₊ Generation Y listeners‟ evaluationslisteners⁰́ evaluations listeners' evaluations high school – from high school ⁰́₃ from high school – from Co₀․₅Zn₀․₅Fe₂O₄ Co²́⁰⁰́Þ²́⁵Zn²́⁰⁰́Þ²́⁵Fe²́²O²́⁴ Co0.5Zn0.5Fe2O4? μ Îơ μ Nafion®Nafion℗ʼ Nafion® Lévy L©♭vy Lévy 43±13.20 years 43℗ł13.20 years 43±13.20 years 12.6 ± 7.05 ft∙lbs 12.6 ℗ł 7.05 ft⁸́₉lbs 12.6 ± 7.05 ft•lbs ‘Pouring on the Pounds' ⁰́₈Pouring on the Pounds' 'Pouring on the Pounds' k-ε turbulence k-Îæ turbulence k-ε turbulence student—neither parents student⁰́₄neither parents student-neither parents Λ = M – {p1, p2,…,pκ} Î₎ = M ⁰́₃ {p1, p2,⁰́Œ,pÎð} ? (won’t save) M = (0, δ)x × YM = (0, Îþ)x ©₇ Y ? 100°100℗ð 100⁰ (α ≥16º) (Îł ⁹́Æ16℗ð) (α=16⁰) naïve na©¯ve naïve To deal with this, we normally replace limited numbers of characters in MarcEditor first and then do the compiling and transfer. For example: replace ’ to ', “ to , ” to and ‟ to '. I am not sure about the right and efficient way to solve this problem. I see that the XSLT stylesheet specifies encoding=UTF-8. Is there a systematic way to make the character transform and display right? Thank you for your suggestion and feedback! Sophie -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Tod Olson Sent: Tuesday, April 17, 2012 10:13 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 In practice it seems to mean UTF-8. At least I've only seen UTF-8, and I can't imagine the code that processes this stuff being safe for UTF-16 or UTF-32. All of the offsets are byte-oriented, and there's too much legacy code that makes assumption about null-terminated strings. -Tod On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote: Okay, forget XML for a moment, let's just look at marc 'binary'. First, for Anglophone-centric MARC21. The LC docs don't actually say quite what I thought about leader byte 09, used to advertise encoding: a - UCS/Unicode Character coding in the record makes use of characters from the Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an industry subset. That doesn't say UTF-8. It says UCS or Unicode. What does that actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer to what used to be called UCS I think?). Whatever it actually means, do people violate it in the wild? Now we get to non-Anglophone centric marc. I think all of which is ISO_2709? A standard which of course is not open access, so I can't get it to see what it says. But leader 09 being used for
[CODE4LIB] DL Systems (allowing search within documents and access restrictions)?
Hello, list, Do you know the Digital Library systems which can search within the documents (e.g. PDFs) and handle access restrictions (e.g. DRM)? Has any of you compared these DL systems? Thanks for any information! Sophie
Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?
Maybe my question is not clear. We are looking for some system which can search the full text of the deposited documents; these are licensed materials, so we'll also need access restriction. We use DSpace, but I don't think DSpace does full text search, e.g. it doesn't search content in bitstreams (pdfs, ppts...). Any suggestion? Thanks! Sophie From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Han, Yan [h...@u.library.arizona.edu] Sent: Wednesday, October 20, 2010 3:25 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? I would think DSpace, Fedora, and Eprint. DSpace is fairly easy to implement, which has embargo support in 1.6 (https://wiki.duraspace.org/display/DSTEST/Embargo ). I have an article comparing DSpace and Fedora, but was written 6 years ago. DSpace has not been changed much, but Fedora is a different story. Yan -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Deng, Sai Sent: Wednesday, October 20, 2010 10:33 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? Hello, list, Do you know the Digital Library systems which can search within the documents (e.g. PDFs) and handle access restrictions (e.g. DRM)? Has any of you compared these DL systems? Thanks for any information! Sophie
Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?
For access restriction, I mean we would like to have certain documents open only to certain communities (UpLib cannot do that, right?). I don't know how DRM affects file indexing. On second thought, I searched for DSpace full text search and found this: https://wiki.duraspace.org/display/DSPACE/Configure+full+text+indexing However, I haven't seen any instance which shows the full text search results as I would see from vendor databases. Any idea on what system might be good/best for search within documents and DRM? Thank you for the reply! Sophie From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Bill Janssen [jans...@parc.com] Sent: Wednesday, October 20, 2010 4:01 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? Deng, Sai sai.d...@wichita.edu wrote: Do you know the Digital Library systems which can search within the documents (e.g. PDFs) and handle access restrictions (e.g. DRM)? Not sure what you mean by handle access restrictions. Do you mean it can index the documents put into it even if they have DRM encumbrances? UpLib has search within the documents -- if you search for a word or phrase, it shows you all the documents which match, but also all the pages in each document which match. Supports a wide variety of document formats, from JPEG2000 to PDF to Powerpoint. But as far as I know it doesn't deal with DRM restrictions. Bill
Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?
How can people tell it searches content in bitstreams (pdfs, word docs)? It looks like it only searches metadata. Thanks. From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Han, Yan [h...@u.library.arizona.edu] Sent: Wednesday, October 20, 2010 4:43 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? DSpace does Full-text search, you need to turn on the configuration file. See UAL http://arizona.openrepository.com/arizona/ Yan -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Deng, Sai Sent: Wednesday, October 20, 2010 2:14 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? For access restriction, I mean we would like to have certain documents open only to certain communities (UpLib cannot do that, right?). I don't know how DRM affects file indexing. On second thought, I searched for DSpace full text search and found this: https://wiki.duraspace.org/display/DSPACE/Configure+full+text+indexing However, I haven't seen any instance which shows the full text search results as I would see from vendor databases. Any idea on what system might be good/best for search within documents and DRM? Thank you for the reply! Sophie From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Bill Janssen [jans...@parc.com] Sent: Wednesday, October 20, 2010 4:01 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? Deng, Sai sai.d...@wichita.edu wrote: Do you know the Digital Library systems which can search within the documents (e.g. PDFs) and handle access restrictions (e.g. DRM)? Not sure what you mean by handle access restrictions. Do you mean it can index the documents put into it even if they have DRM encumbrances? UpLib has search within the documents -- if you search for a word or phrase, it shows you all the documents which match, but also all the pages in each document which match. Supports a wide variety of document formats, from JPEG2000 to PDF to Powerpoint. But as far as I know it doesn't deal with DRM restrictions. Bill
Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?
Thanks for the information! Greenstone has full text search, but I heard that its access control is much weaker than DSpace. Will it be able to allow certain documents open only to certain people or certain departments? Thanks. Sophie From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Bill Janssen [jans...@parc.com] Sent: Wednesday, October 20, 2010 4:31 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? Deng, Sai sai.d...@wichita.edu wrote: For access restriction, I mean we would like to have certain documents open only to certain communities (UpLib cannot do that, right?). OK, that's not I typically think of when I hear DRM. Access control is (I think) the way it's usually put. No, UpLib has no built-in access control system, though the hooks are there, and I know that some have used them to do access control. I know of one UpLib application which requires incoming connections to provide a client certificate, which it uses to give different clients different access rights. Probably overkill for most uses. You'd probably want to do an application-specific Web UI, though -- you could put the access restrictions there. I recently saw a Tomcat app which uses the UpLib Java client-side library to search for documents, then provided a completely custom UI. On second thought, I searched for DSpace full text search and found this: https://wiki.duraspace.org/display/DSPACE/Configure+full+text+indexing However, I haven't seen any instance which shows the full text search results as I would see from vendor databases. Any idea on what system might be good/best for search within documents and DRM? How about Greenstone? Bill
Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?
Thanks for the questions! We don't have a clear idea yet and we are looking for a system now. The basic idea is that we'll deposit some licensed materials for some department and open them only to that group. I guess a local account would be ok, of course, if a campus account can be recognized, that's better. They'll need to log in to see the document if it's not ip restricted, right? IP restriction might not be the best way since faculty members will not always be in their departments. Sophie From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Mark Jordan [mjor...@sfu.ca] Sent: Wednesday, October 20, 2010 5:08 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? Sophie, It might help some of us on the list to understand what types of access control you need if you can describe some of the ways that the allowed users (people and/or departments, to use your examples) will identify themselves? Will they have already logged into the system with a local (to the system) account, or with a campus account that knows that they are part of a specific department? Will they need to log into he system when they request to see a specific document? Will where they are sitting matter (i.e., restricted by IP address)? Mark Mark Jordan Head of Library Systems W.A.C. Bennett Library, Simon Fraser University Burnaby, British Columbia, V5A 1S6, Canada Voice: 778.782.5753 / Fax: 778.782.3023 / Skype: mark.jordan50 mjor...@sfu.ca - Original Message - Thanks for the information! Greenstone has full text search, but I heard that its access control is much weaker than DSpace. Will it be able to allow certain documents open only to certain people or certain departments? Thanks. Sophie From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Bill Janssen [jans...@parc.com] Sent: Wednesday, October 20, 2010 4:31 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? Deng, Sai sai.d...@wichita.edu wrote: For access restriction, I mean we would like to have certain documents open only to certain communities (UpLib cannot do that, right?). OK, that's not I typically think of when I hear DRM. Access control is (I think) the way it's usually put. No, UpLib has no built-in access control system, though the hooks are there, and I know that some have used them to do access control. I know of one UpLib application which requires incoming connections to provide a client certificate, which it uses to give different clients different access rights. Probably overkill for most uses. You'd probably want to do an application-specific Web UI, though -- you could put the access restrictions there. I recently saw a Tomcat app which uses the UpLib Java client-side library to search for documents, then provided a completely custom UI. On second thought, I searched for DSpace full text search and found this: https://wiki.duraspace.org/display/DSPACE/Configure+full+text+indexing However, I haven't seen any instance which shows the full text search results as I would see from vendor databases. Any idea on what system might be good/best for search within documents and DRM? How about Greenstone? Bill
[CODE4LIB] Digital imaging questions
Hi, list, A while ago, I read some interesting discussion on how to use camera to produce archival-quality images from this list. Now, I have some imaging questions and I think this might be a good list to turn to. Thank you in advance! We are trying to add some herbarium images to our DSpace. The specimen pictures will be taken at the Biology department and the library is responsible for depositing the images and transferring/mapping/adding metadata. On the testing stage, they use Fujifilm FinePix S8000fd digital camera (http://www.fujifilmusa.com/support/ServiceSupportSoftwareContent.do?dbid=874716prodcat=871639sscucatid=664260). It produces 8 megapixel images, and it doesn't have raw/tiff support. It seems that it cannot produce archival quality images. Before we persuade the Biology department to switch their camera, I want to make sure it is absolutely necessary. The pictures they took look fine with human eyes, see an example at: http://library.wichita.edu/techserv/test/herbarium/Astenophylla1-02710.jpg In order to make master images from a camera, it should be capable of producing raw or tiff images with 12 or above megapixels? A related archiving question, the biology field standard is DarwinCore, however, DSpace doesn't support it. The Biology Dept. already has some data in spreadsheet. In this case, when it is impossible to map all the elements to Dublin Core, is it a good practice for us to set up several local elements mapped from DarwinCore? Thanks a million, Sai Sai Deng Metadata Catalog Librarian Wichita State University Libraries 1845 Fairmount Wichita, KS 67260-0068 Phone: (316) 978-5138 Fax: (316) 978-3048 Email: sai.d...@wichita.edu said...@gmail.com
Re: [CODE4LIB] Digital imaging questions
Andrew and Yan, Thanks for the reply and the information! About DSpace metadata registry, we can add new schema or new elements to it, but the elements won’t be searchable, right? (We can change the input-forms.xml to make it display in the submission workflow if we will have item by item submission.) In our case, we already have the herbarium metadata in excel sheet created by Biology Dept. They are now in loose Darwin Core and kind of free style. If I would like to do data transformation (transform it to a mixture of DC and Darwin Core possibly) and batch import the xml to DSpace, how to proceed? Where should I add the Darwin Core metadata (in the dublin_core.xml as well)? It seems that it only has dcvalue element. Sai -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Andrew Hankinson Sent: Thursday, June 18, 2009 11:03 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Digital imaging questions Hi Sai, Archival Quality Images has some meaning, but it might be helpful to look up a standard and start your investigation for a new camera based on the recommendations of that standard. You might find this page from the Library of Congress helpful: http://www.digitalpreservation.gov/formats/content/still.shtml I think your indication that RAW/TIFF is a pretty safe bet, but being able to point to an actual standard might make your case for a new camera a bit more convincingly. Other factors to take into account (other than megapixels and format) are color reproduction, image 'noise' specifications, DPI, lighting, (and probably many other things). For DSpace you don't even need to map the elements of Dublin Core to DarwinCore. Dspace has the ability to input different schema in its metadata registry. You can then modify the inputforms.xml file in the Dspace config directory to add the appropriate fields for the additional metadata fields. Hope this helps! -Andrew On Thu, Jun 18, 2009 at 10:33 AM, Deng, Sai sai.d...@wichita.edu wrote: Hi, list, A while ago, I read some interesting discussion on how to use camera to produce archival-quality images from this list. Now, I have some imaging questions and I think this might be a good list to turn to. Thank you in advance! We are trying to add some herbarium images to our DSpace. The specimen pictures will be taken at the Biology department and the library is responsible for depositing the images and transferring/mapping/adding metadata. On the testing stage, they use Fujifilm FinePix S8000fd digital camera ( http://www.fujifilmusa.com/support/ServiceSupportSoftwareContent.do?dbid=874716prodcat=871639sscucatid=664260). It produces 8 megapixel images, and it doesn't have raw/tiff support. It seems that it cannot produce archival quality images. Before we persuade the Biology department to switch their camera, I want to make sure it is absolutely necessary. The pictures they took look fine with human eyes, see an example at: http://library.wichita.edu/techserv/test/herbarium/Astenophylla1-02710.jpg In order to make master images from a camera, it should be capable of producing raw or tiff images with 12 or above megapixels? A related archiving question, the biology field standard is DarwinCore, however, DSpace doesn't support it. The Biology Dept. already has some data in spreadsheet. In this case, when it is impossible to map all the elements to Dublin Core, is it a good practice for us to set up several local elements mapped from DarwinCore? Thanks a million, Sai Sai Deng Metadata Catalog Librarian Wichita State University Libraries 1845 Fairmount Wichita, KS 67260-0068 Phone: (316) 978-5138 Fax: (316) 978-3048 Email: sai.d...@wichita.edu said...@gmail.com