On 25 November 2014 at 11:33, Andrea Zanni <[email protected]> wrote:
> > How would I do that now? Wikisource pages are not structured data (though >> Wikimedia Commons image metadata will soon be!), so there is not a clear >> way to use the Wikisource API to extract just the relevant transcribed text >> on the page as a field. And on top of that, any text you do extract this >> way will be full of templates and other code that has no meaning outside of >> the context of Wikisource. I don't see a way to easily extract just the >> plain text that is meaningful and relevant (along with other fielded data, >> like what page or text it belongs to). >> > > Wikisource as a "structured" repository is what we ask from the dawn of > time :-) > The problem, as usual, is that if things are left to volunteer developers > thing will go slooooowly. > I do think this is fundamental: an ideal Wikisource would ingest and > understand many times metadata standards, and would give them back as well. > > As for the Wikimedia API, I did this awful script: > https://github.com/Aubreymcfato/ws_scraper > Please come and make it better :-D > > Awesome! I'll definitely give it a whirl. > It just scrapes the data from the HTML (it is localized to it.source, but > a quick glance at the HTML source of your own ws could help you, especially > if you use microformats) and puts them on a csv. > If you take the HTML you can also get the formatted text. > (I also wonder of a Wikisource which understands Markdown, but that's too > far :-) > You have a good point, though. One of the differences between Wikisource and most other platforms is that it is actually richly formatted. It's kind of a shame to strip all that formatting information out when extracting the transcriptions. (Though many destinations wouldn't know what to do with formatted text anyway.)
_______________________________________________ Wikisource-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikisource-l
