Re: [Wikisource-l] [Wikidata-l] DNB 11M bibliographic records as CC0

Luiz Augusto Fri, 06 Dec 2013 17:39:00 -0800

There are reasons for editing and there are also reasons for not editing.

One big reason *for* editing is if at any time data from new sources are
being imported.

We in Librarianship/Information Sciences makes decisions on how the data
will be available to our users/customers. Eg an author name. There are many
ways to write the same name from the same individual. The same individual
can adopt in his life dozens of nicknames, change their last name if gets
married and so on. The rule choosen in a particular library can be the same
in more libraries, or even an entire different one (based on how the local
community of users from a library will search/wants the data), or even no
rule is choosen and the data is recorded "as is" it is registered in the
publication. Some libraries have additional records specially devoted to
the synonymies for the same name, some not.

Google Book Search simply imported data from many libraries without making
any attempt to standardize then, resulting in the large amount of
duplicates and bullshits founds in some searches (specially those whose the
imprints didn't standardizet themselves the data).

Some special kinds of data from the same work can also be stored in
differents sets of "fields" and "subfields" of MARC21 records across
different libraries, again because the user/client need of informations
about the works can vary from place to place (ie you get data duplication
in the same record if you simple merge records from libraries).

MARC21 specification have also an entire design that IMHO is impossible to
reflect in the current MediaWiki schema, even with Semanctic MediaWiki.

And sometimes some libraries tells that their data is stored on MARC21
fields, but are on USMARC ones (yep, there are many flavours of MARC as
there are many flavors of Ubuntu). Or it is *based* on MARC21 fields, with
dozens of local adaptions.

I've just finished an internship in a library with 45k of records that was
migrating data from REPIDISCA *based* fields (let's call it as a FreeBSD
flavour) to MARC21 *based* fields (in this comparision, an Ubuntu flavour;
and yep, *based*, with local adaptions, we needs those changes). The data
is migrated in an automated fashion, but still needs to be validated record
by record if the library wants those records in the MARC21 fields as it's.

What I'm saying is:

1) You can't simply import data from many sources without validations
expecting a good quality end product. You will get a "search engine"
quality data (tons of random informations that will make sense only with a
continuously developed set of algorithms maybe more time+resources
consuming than standartizing the data);

2) Data standardize is an epic work dozens of times more epic than writing
an comprehensive encyclopedia about all subjects on all languages.
Institutional support will be needed, and in more compreensive ways
embracing more than just releasing their data to play around it (ie, with
additional hands for standardization).

[[Paul Otlet]] (1868-1944) tried it in efforts that some argues he's the
concept designer of Internet and hypertext. With no success, what is very
unfortunate. Will the wikimedians gets any level of success on it?

[[:m:User:555]]

On Fri, Dec 6, 2013 at 9:59 PM, Denny Vrandečić <[email protected]> wrote:

> Thanks for reviving this thread, Luiz. I also wanted to ask whether we
> should be updating parts of DNB and similar data. Maybe not create new
> entries, but for those that we already have, add some of the available data
> and point to the DNB dataset?
>
>
> On Fri, Dec 6, 2013 at 3:24 PM, Luiz Augusto <[email protected]> wrote:
>
>> Just found this thread while browsing my email archives (I'm/was inactive
>> on Wikimedia for at least 2 years)
>>
>> IMHO will be very helpfull if a central place hosting metadata from
>> digitized works will be created.
>>
>> In my past experience, I've found lots of PD-old books from languages
>> like french, spanish and english in repositories from Brazil and Portugal,
>> with UI mostly in portuguese (ie, with very low probabilities to get found
>> by volunteers from subdomains from those languages), for example.
>>
>> I particularly loves validating metadata more than proofreading books.
>> Perhaps a tool/place like this makes new ways to contribute to Wikisource
>> and helps on user retention (based on some wikipedians that gets fun making
>> good articles but loves also sometimes to simply make trivial changes on
>> their spare time)?
>>
>> I know that the thread was focused on general metadata from all kinds and
>> ages of books, but I had this idea while reading this
>>
>> [[:m:User:555]]
>>
>>
>> On Mon, Aug 26, 2013 at 10:42 AM, Thomas Douillard <
>> [email protected]> wrote:
>>
>>> I know, I started a discussion about porting the bot to WIkidata in
>>> scientific Journal Wikiproject. One answer I got : the bot owner had other
>>> things to do in his life than running the bot and was not around very often
>>> any more. Having everiyhing in Wikidata already will be a lot more reliable
>>> and lazier, no tool that works one day but not the other one, no effort to
>>> tell the newbies that they should go to another website, no significant
>>> problem.
>>>
>>> Maybe one opposition would be that the data would be vandalised easily,
>>> but maybe we should find a way to deal with imported sourced datas which
>>> have no real reason to be modified, just marked deprecated or updated by
>>> another import from the same source.
>>>
>>>
>>> 2013/8/26 David Cuenca <[email protected]>
>>>
>>>> If the problem is to automate bibliographic data importing, a solution
>>>> is what you propose, to import everything. Another one is to have an import
>>>> tool to automatically import the data for the item that needs it. In WP
>>>> they do that, there is a tool to import book/journal info by ISBN/doi. The
>>>> same can be done in WD.
>>>>
>>>> Micru
>>>>
>>>>
>>>> On Mon, Aug 26, 2013 at 9:23 AM, Thomas Douillard <
>>>> [email protected]> wrote:
>>>>
>>>>> If Wikidata has an ambition to be a really reliable database, we
>>>>> should do eveything we can to make it easy for users to use any source 
>>>>> they
>>>>> want. In this perspective, if we got datas with guaranted high quality, it
>>>>> make it easy for Wikidatian to find and use these references for users.
>>>>> Entering a reference in the database seems to me a highly fastidious,
>>>>> boring, and easily automated task.
>>>>>
>>>>> With that in mind, any reference that the user will not have to enter
>>>>> by hand is something good, and import high quality sources datas should
>>>>> pass every Wikidata community barriers easily. If there is no problem for
>>>>> the software to handle that many information, I say we really have no
>>>>> reason not to do the imports.
>>>>>
>>>>> Tom
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Wikidata-l mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Etiamsi omnes, ego non
>>>>
>>>> _______________________________________________
>>>> Wikidata-l mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Wikidata-l mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>>
>>>
>>
>> _______________________________________________
>> Wikidata-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>
>>
>
> _______________________________________________
> Wikidata-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>
>

_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Re: [Wikisource-l] [Wikidata-l] DNB 11M bibliographic records as CC0

Reply via email to