Greg, I've spent a while editing this data and have my progress posted here:
http://crosswire.org/~scribe/hesychius.tar.gz -Troy. Greg Hellings wrote: > On further inspection it appears that the only HTML formatting that > appears in the above document is a <div....> .... </div> that > corresponds with every <text> ... </text> element in the exported XML. > Thus all of the angle brackets that appear around anything other than > text/title/page are brackets that are somehow significantly placed > around Greek words. > > Perhaps this is the limit of where pure XSLT can take us? It seems > that it would be better at this point to process the remaining text > with something like python or perl and have that generate the desired > OSIS text, since the OSIS has nothing to do with the XML structure of > the current document but rather with its textual content? > > This really is my last e-mail tonight... > > --Greg > > On 11/10/06, Greg Hellings <[EMAIL PROTECTED]> wrote: >> And I forgot to mention that I had posted it to the wxSword download >> site on Soureforge: >> https://sourceforge.net/project/showfiles.php?group_id=142229 >> >> Sorry! >> >> --Greg >> >> On 11/10/06, Greg Hellings <[EMAIL PROTECTED]> wrote: >>> Getting the output from their included wiki export page was the >>> trivial portion of the task (read: I had to guess completely judging >>> from the directions that were on Wikipedia's site and extrapolate >>> those to figure out what name WikiSource actually wanted for each >>> page). Writing the XSLT is proving to be far more cumbersome. I just >>> spent over an hour trying to figure out why my XSLT was not producing >>> any output, only to realize that the exported file had a default >>> namespace. >>> >>> It will be incredibly difficult to extract any structural information >>> from the files in an automated system. For one, I am not familiar >>> with what Hesychius is, and while I took extensive Greek in my >>> undergrad course of study, reading through that massive document would >>> be unwieldy for me at this point, since I could not dedicate huge >>> amount of time to the work. >>> >>> For now I have posted an XML file that is the filtered XML that comes >>> from the export, with everything except for the page, title and text >>> fields removed (since the rest of the information simply pertains to >>> who performed the latest modification to the page and when it happened >>> and their change log entry). I have also modified all of the > and >>> < to be > and < in an effort to return the data to its display >>> format. >>> >>> Someone will need to figure out how to differentiate when the < or > >>> is pertinent to the HTML/XML or when it is pertinent to the more >>> specific data within. The WikiSource document seems to make very poor >>> use of the < and > characters to both denote a keyword and to >>> emphasize certain words or phrases, thus making the data even more >>> difficult to parse. I don't know that a fully automated solution will >>> be possible with this data or with the original data... but it's all >>> just a starting point. >>> >>> If you want other files, let me know. >>> >>> --Greg >>> >>> On 11/9/06, Troy A. Griffitts <[EMAIL PROTECTED]> wrote: >>>> Greg, >>>> You're amazing!!! I must have played with stuff for hours today >>>> trying >>>> to make sense from the wikimedia export docs. I even downloaded some >>>> PyWikipediaBot python thingy but couldn't get it to run either (I am >>>> inept at python, so I wasn't surprised, though quite frustrated, >>>> nonetheless). Thank you!!! If this might make any difference, my >>>> personal interest in the lexicon, after it is usable by SWORD, is to >>>> build a synonyms database from the data. If there is any indication in >>>> the data that a synonym for an entry is being listed, I would most >>>> appreciate a unique <seg type="x-synonym>, or some such. Thank you >>>> again, so much, for your work. I am very excited! >>>> >>>> -Troy. >>>> >>>> >>>> >>>> Greg Hellings wrote: >>>>> So yeah... I managed to grab the XML file from the Export (it's fun >>>>> trying to do that on a webpage written in modern Greek when you're >>>>> used to ancient Greek and you can't remember what the Koine word for >>>>> "hyperlink" or "webpage is" :P). >>>>> >>>>> It comes to a mere 4.2 MB file, so now the trick will be parsing the >>>>> text that is wanted out of that and creating an OSIS from it. The >>>>> main problem with that is that the text from the file is placed inside >>>>> of a tag with xml:space="preserve" attribute, and all of the HTML is >>>>> encoded as entities underneath of that. Therefore all of the >>>>> structure of the actual data (other than the large groupings under >>>>> alpha, beta, gamma, etc) is lost to an XML/XSL parsing combination. >>>>> >>>>> Wish me luck... ::dives into a pile of libxml2:: >>>>> >>>>> --Greg Hellings >>>>> >>>>> On 11/9/06, Troy A. Griffitts <[EMAIL PROTECTED]> wrote: >>>>>> We had a contributer on IRC, today, post this link: >>>>>> >>>>>> http://el.wikisource.org/wiki/%CE%93%CE%BB%E1%BF%B6%CF%83%CF%83%CE%B1%CE%B9 >>>>>> >>>>>> >>>>>> It looks promising. >>>>>> >>>>>> I know there is a way to download content in XML of a mediawiki site, >>>>>> but have no experience doing so. >>>>>> >>>>>> Anyone want to take a shot at producing a SWORD Hesychius Lexicon, (or >>>>>> even just a text file from this link? >>>>>> >>>>>> >>>>>> Thanks for everyone's input and help. >>>>>> >>>>>> -Troy. >>>>>> >>>>>> >>>>>> >>>>>> Peter von Kaehne wrote: >>>>>>> I spoke yesterday both to Prof Hansen and to Prof Ian Cunningham (who >>>>>>> is a collaborator of Hansen) >>>>>>> >>>>>>> http://www.csad.ox.ac.uk/CSAD/Hesychius/Hansen.html >>>>>>> >>>>>>> Prof Hansen mentioned the TLG and Prof Cunningham confirmed this + said >>>>>>> further there is no electronic version of Hansen's work available. I >>>>>>> understand that Hansen's work is published in de Gruyters' Sammlung >>>>>>> Griechischer and Lateinischer Altertuemer >>>>>>> >>>>>>> http://www.degruyter.com/rs/174_AT_E_ED_ENU_h.cfm?rc=19992&id=SER-M1-WDG-HESYCH-B-19992&fg=AT >>>>>>> >>>>>>> - a copy of which I found here to buy: >>>>>>> >>>>>>> http://www.basis-buch.de/main-173503.html >>>>>>> >>>>>>> WRT the TLG. I read the licence in detail and bluntly said, they have >>>>>>> no leg to stand upon to deny us using the texts: >>>>>>> >>>>>>> They already allowed us to do what we want to do on the base of the >>>>>>> licence - even if they get now cold feet on direct questioning. That >>>>>>> said, at least Schmidts edition is now public domain anyway and unless >>>>>>> there are DMCA-restrictions everyone can copy it out of there anyway. >>>>>>> And outside of DMCA -alike legislation only the public domain-ness >>>>>>> woudl appliy anyway.But IANAL etc. >>>>>>> >>>>>>> Wrt Latte/Hansen- I am not sure how far Latte's work would constitute >>>>>>> an original work in its own right - I presume it does - but again the >>>>>>> TLG licence does allow text extraction for scholarly work which is >>>>>>> non-commercial. >>>>>>> >>>>>>> Peter >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -------- Original-Nachricht -------- >>>>>>> Datum: Fri, 03 Nov 2006 17:23:03 -0700 >>>>>>> Von: "Troy A. Griffitts" <[EMAIL PROTECTED]> >>>>>>> An: SWORD Developers\' Collaboration Forum <[email protected]> >>>>>>> Betreff: Re: [sword-devel] Hesychius >>>>>>> >>>>>>>> Peter, >>>>>>>> Thank you for your time and info. We have an ongoing dialog with >>>>>>>> UCI >>>>>>>> regarding the use of the data from TLG. They have denied our request >>>>>>>> twice, but I am hoping a detailed third plea might solicit sympathy. >>>>>>>> >>>>>>>> -Troy. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Peter von Kaehne wrote: >>>>>>>>> The TLG has though also the older edition by Schmidt which should be >>>>>>>>> by >>>>>>>> now public domain as it is 1861 >>>>>>>>> Peter >>>>>>>>> >>>>>>>>> -------- Original-Nachricht -------- >>>>>>>>> Datum: Fri, 03 Nov 2006 15:59:02 +0100 >>>>>>>>> Von: "Peter von Kaehne" <[EMAIL PROTECTED]> >>>>>>>>> An: SWORD Developers\' Collaboration Forum <[email protected]> >>>>>>>>> Betreff: Re: [sword-devel] Hesychius >>>>>>>>> >>>>>>>>>> The TLG indeed contains parts of the Hesychius - Latte's work only. >>>>>>>>>> >>>>>>>>>> Hansen's work is published on paper only in Germany. Electronic >>>>>>>>>> copies >>>>>>>> are >>>>>>>>>> not available. >>>>>>>>>> >>>>>>>>>> The TLG licence of the text is so that the work might be possible to >>>>>>>>>> integrate - ie.e. commecial scholarly tools making use of teh whole >>>>>>>> text are >>>>>>>>>> forbidden but crosswire might be possible. >>>>>>>>>> >>>>>>>>>> HTH >>>>>>>>>> >>>>>>>>>> Peter >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -------- Original-Nachricht -------- >>>>>>>>>> Datum: Thu, 02 Nov 2006 16:38:36 -0700 >>>>>>>>>> Von: "Troy A. Griffitts" <[EMAIL PROTECTED]> >>>>>>>>>> An: [email protected] >>>>>>>>>> Betreff: [sword-devel] Hesychius >>>>>>>>>> >>>>>>>>>>> If anyone has the time to research where we can find an electronic >>>>>>>> copy >>>>>>>>>>> of Hesychius' Greek Lexicon, your efforts would be extremely >>>>>>>>>>> valuable >>>>>>>> to >>>>>>>>>>> me right now. I believe the TLG has a copy of it, but I currently >>>>>>>> don't >>>>>>>>>>> have easy access to the TLG. Thanks in advance. >>>>>>>>>>> >>>>>>>>>>> -Troy. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> sword-devel mailing list: [email protected] >>>>>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>>>>>>>>> Instructions to unsubscribe/change your settings at above page >>>>>>>>>> -- >>>>>>>>>> GMX DSL-Flatrate 0,- Euro* - Überall, wo DSL verfügbar ist! >>>>>>>>>> NEU: Jetzt bis zu 16.000 kBit/s! http://www.gmx.net/de/go/dsl >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> sword-devel mailing list: [email protected] >>>>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>>>>>>>> Instructions to unsubscribe/change your settings at above page >>>>>>>> _______________________________________________ >>>>>>>> sword-devel mailing list: [email protected] >>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>>>>>> Instructions to unsubscribe/change your settings at above page >>>>>> _______________________________________________ >>>>>> sword-devel mailing list: [email protected] >>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>>>> Instructions to unsubscribe/change your settings at above page >>>>>> >>>>> _______________________________________________ >>>>> sword-devel mailing list: [email protected] >>>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>>> Instructions to unsubscribe/change your settings at above page >>>> >>>> _______________________________________________ >>>> sword-devel mailing list: [email protected] >>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>> Instructions to unsubscribe/change your settings at above page >>>> > > _______________________________________________ > sword-devel mailing list: [email protected] > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page _______________________________________________ sword-devel mailing list: [email protected] http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
