subject:"Re\: scientific publishing process $was Re\: Cost and access$"


On 2014-10-07 15:44, Peter F. Patel-Schneider wrote:

Well, I remain totally unconvinced that any current HTML solution is as
good as the current PDF setup.  Certainly htlatex is not suitable.
There may be some way to get tex4ht to do better, but no one has
provided a solution. Sarven Capadisli sent me some HTML that looks much
better, but even on a math-light paper I could see a number of
glitches.  I haven't seen anything better than that.


Would you mind creating an issue for the glitches that you are experiencing?

https://github.com/csarven/linked-research/issues

Please mention your environment and the documents you've looked at. Also 
keep in mind the LNCS and ACM SIG authoring guidelines. The purpose of 
the LNCS and ACM CSS is to adhere to the authoring guidelines so that 
the the generated PDF file or print output looks as expected (within 
reason).


Much appreciated!

-Sarven
http://csarven.ca/#i




smime.p7s
Description: S/MIME Cryptographic Signature

Re: scientific publishing process (was Re: Cost and access)

2014-10-08 Thread Peter F. Patel-Schneider


Done.

The goal of a new paper-preparation and display system should, however, be to 
be better than what is currently available.  Most HTML-based solutions do not 
exploit the benefits of HTML, strangely enough.


Consider, for example, citation links.  They generally jump you to the 
references section.  They should instead pop up the reference, as is done in 
Wikipedia.


Similarly for links to figures.  Instead of blindly jumping to the figure, 
they should do something better, perhaps popping up the figure or, if the 
figure is already visible, just highlighting it.


I have put in both of these as issues.

peter

On 10/08/2014 03:18 AM, Sarven Capadisli wrote:

On 2014-10-07 15:44, Peter F. Patel-Schneider wrote:

Well, I remain totally unconvinced that any current HTML solution is as
good as the current PDF setup.  Certainly htlatex is not suitable.
There may be some way to get tex4ht to do better, but no one has
provided a solution. Sarven Capadisli sent me some HTML that looks much
better, but even on a math-light paper I could see a number of
glitches.  I haven't seen anything better than that.


Would you mind creating an issue for the glitches that you are experiencing?

https://github.com/csarven/linked-research/issues

Please mention your environment and the documents you've looked at. Also keep
in mind the LNCS and ACM SIG authoring guidelines. The purpose of the LNCS and
ACM CSS is to adhere to the authoring guidelines so that the the generated PDF
file or print output looks as expected (within reason).

Much appreciated!

-Sarven
http://csarven.ca/#i

Re: scientific publishing process (was Re: Cost and access)


On 2014-10-08 14:10, Peter F. Patel-Schneider wrote:

Done.

The goal of a new paper-preparation and display system should, however,
be to be better than what is currently available.  Most HTML-based
solutions do not exploit the benefits of HTML, strangely enough.

Consider, for example, citation links.  They generally jump you to the
references section.  They should instead pop up the reference, as is
done in Wikipedia.

Similarly for links to figures.  Instead of blindly jumping to the
figure, they should do something better, perhaps popping up the figure
or, if the figure is already visible, just highlighting it.

I have put in both of these as issues.


Thanks a lot for the issues! Really great to have this feedback.

I have resolved and commented on some of those already, and will look at 
the rest very shortly.


I am all for improving the interaction as well. I'd like to state again 
that the development was so far focused on adhering to the LNCS/ACM 
guidelines, and improving the final PDF/print product. That is to get on 
reasonable grounds with the state of the art.


Moving on: I plan to bring in the interaction and framework to easily 
semantically enrich the document as well as the overall UX. I have some 
preliminary code in my dev branch, and will bring it forward, and would 
like feedback as well.


Thanks again and please continue to bring forward any issues or feature 
requests. Contributors are most welcome!


-Sarven
http://csarven.ca/#i




smime.p7s
Description: S/MIME Cryptographic Signature

Re: scientific publishing process (was Re: Cost and access)

2014-10-08 Thread Phillip Lord

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

 PLOS is an interesting case.  The HTML for PLOS articles is relatively
 readable.  However, the HTML that the PLOS setup produces is failing at math,
 even for articles from August 2014.

 As well, sometimes when I zoom in or out (so that I can see the math better)
 Firefox stops displaying the paper, and I have to reload the whole page.

Interesting bug that. Worth reporting to PLoS.

 Strangely, PLOS accepts low-resolution figures, which in one paper I looked at
 are quite difficult to read.

Yep. Although, it often provides several links to download higher
res images, including in the original file format. Quite handy.

 However, maybe the PLOS method can be improved to the point where the HTML is
 competitive with PDF.

Indeed. For the moment, HTML views are about 1/5 of PDF. Partly this is
because scientists are used to viewing in print format, I suspect, but
partly not.

I'm hoping that, eventually, PLoS will stop using image based maths. I'd
like to be able to zoom maths independently, and copy and paste it in
either mathml or tex. Mathjax does this now already.

Phil

Re: scientific publishing process (was Re: Cost and access)

2014-10-08 Thread Peter F. Patel-Schneider




On 10/08/2014 05:31 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:


PLOS is an interesting case.  The HTML for PLOS articles is relatively
readable.  However, the HTML that the PLOS setup produces is failing at math,
even for articles from August 2014.

As well, sometimes when I zoom in or out (so that I can see the math better)
Firefox stops displaying the paper, and I have to reload the whole page.


Interesting bug that. Worth reporting to PLoS.


PLoS doesn't appear to have a bug reporting system in place.  Even their 
general assistance email is obsfucated.  I sent them a message anyway.



Strangely, PLOS accepts low-resolution figures, which in one paper I looked at
are quite difficult to read.


Yep. Although, it often provides several links to download higher
res images, including in the original file format. Quite handy.


In this case, even the original was low resolution.


However, maybe the PLOS method can be improved to the point where the HTML is
competitive with PDF.


Indeed. For the moment, HTML views are about 1/5 of PDF. Partly this is
because scientists are used to viewing in print format, I suspect, but
partly not.

I'm hoping that, eventually, PLoS will stop using image based maths. I'd
like to be able to zoom maths independently, and copy and paste it in
either mathml or tex. Mathjax does this now already.


I would suggest that this should have been one of their highest priorities.


Phil



peter

Re: scientific publishing process (was Re: Cost and access)

2014-10-08 Thread Phillip Lord

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:
 The goal of a new paper-preparation and display system should, however, be to
 be better than what is currently available.  Most HTML-based solutions do not
 exploit the benefits of HTML, strangely enough.

 Consider, for example, citation links.  They generally jump you to the
 references section.  They should instead pop up the reference, as is done in
 Wikipedia.

Yes, I agree. I do this on my blog or rather provide it as an option.
The reference list is also automatically generated here, so, for
example, there is no metadata associated with the two references in
this post:

http://www.russet.org.uk/blog/3015

In both cases, the reference list is formed from the metadata on the
other end of the link, gathered either from the HTML, or in the case of
arXiv from their XML-RPC interface.


 Similarly for links to figures.  Instead of blindly jumping to the figure,
 they should do something better, perhaps popping up the figure or, if the
 figure is already visible, just highlighting it.

Or better still, providing access to the code and data from which the
figure is derived.

Phil

Re: scientific publishing process (was Re: Cost and access)

2014-10-08 Thread Luca Matteis

Dear Sarven,

I really appreciate the work that you're doing with trying to style an
HTML page to look similar to the Latex templates. But there's so many
typesetting details that are not available in browsers, which means
you're going to do a lot of DOM hacking to be able to produce the same
quality typography that Latex is capable of. Latex will justify text,
automatically hyphenate, provide proper spacing, and other typesetting
features. Not to mention kerning. Kerning is a *huge* thing in
typography and with HTML you're stuck with creating a DOM element for
every single letter - yup you heard me right.

I think it would be super cool to create some sort of JavaScript
framework that would enable the same level of typography that Latex is
capable of, but you'll eventually hit some hard limitations and you'll
probably be stuck drawing on a canvas.

What are your ideas regarding these problems?

On Wed, Oct 8, 2014 at 2:26 PM, Sarven Capadisli i...@csarven.ca wrote:
 On 2014-10-08 14:10, Peter F. Patel-Schneider wrote:

 Done.

 The goal of a new paper-preparation and display system should, however,
 be to be better than what is currently available.  Most HTML-based
 solutions do not exploit the benefits of HTML, strangely enough.

 Consider, for example, citation links.  They generally jump you to the
 references section.  They should instead pop up the reference, as is
 done in Wikipedia.

 Similarly for links to figures.  Instead of blindly jumping to the
 figure, they should do something better, perhaps popping up the figure
 or, if the figure is already visible, just highlighting it.

 I have put in both of these as issues.


 Thanks a lot for the issues! Really great to have this feedback.

 I have resolved and commented on some of those already, and will look at the
 rest very shortly.

 I am all for improving the interaction as well. I'd like to state again that
 the development was so far focused on adhering to the LNCS/ACM guidelines,
 and improving the final PDF/print product. That is to get on reasonable
 grounds with the state of the art.

 Moving on: I plan to bring in the interaction and framework to easily
 semantically enrich the document as well as the overall UX. I have some
 preliminary code in my dev branch, and will bring it forward, and would like
 feedback as well.

 Thanks again and please continue to bring forward any issues or feature
 requests. Contributors are most welcome!

 -Sarven
 http://csarven.ca/#i

Re: scientific publishing process (was Re: Cost and access)

2014-10-08 Thread Phillip Lord



I'm always at a bit of a loss when I read this sort of thing. Kerning,
seriously? We can't share scientific content in HTML because of kerning?

In practice, web browsers do a perfectly reasonable job of text layout,
in real time, and do it in a way that allows easy reflowing. The 
thing about Sarven's LNCS style sheets, for instance, is that I like the
most is that I can turn them off; I don't like the LNCS format.

Having said all of that, 5 minutes of googling suggests that, kerning
support is in Canditate Recommendation form from W3C, and at least three
different JS libraries that support it.

Phil

Luca Matteis lmatt...@gmail.com writes:
 I really appreciate the work that you're doing with trying to style an
 HTML page to look similar to the Latex templates. But there's so many
 typesetting details that are not available in browsers, which means
 you're going to do a lot of DOM hacking to be able to produce the same
 quality typography that Latex is capable of. Latex will justify text,
 automatically hyphenate, provide proper spacing, and other typesetting
 features. Not to mention kerning. Kerning is a *huge* thing in
 typography and with HTML you're stuck with creating a DOM element for
 every single letter - yup you heard me right.

 I think it would be super cool to create some sort of JavaScript
 framework that would enable the same level of typography that Latex is
 capable of, but you'll eventually hit some hard limitations and you'll
 probably be stuck drawing on a canvas.

 What are your ideas regarding these problems?

 On Wed, Oct 8, 2014 at 2:26 PM, Sarven Capadisli i...@csarven.ca wrote:
 On 2014-10-08 14:10, Peter F. Patel-Schneider wrote:

 Done.

 The goal of a new paper-preparation and display system should, however,
 be to be better than what is currently available.  Most HTML-based
 solutions do not exploit the benefits of HTML, strangely enough.

 Consider, for example, citation links.  They generally jump you to the
 references section.  They should instead pop up the reference, as is
 done in Wikipedia.

 Similarly for links to figures.  Instead of blindly jumping to the
 figure, they should do something better, perhaps popping up the figure
 or, if the figure is already visible, just highlighting it.

 I have put in both of these as issues.


 Thanks a lot for the issues! Really great to have this feedback.

 I have resolved and commented on some of those already, and will look at the
 rest very shortly.

 I am all for improving the interaction as well. I'd like to state again that
 the development was so far focused on adhering to the LNCS/ACM guidelines,
 and improving the final PDF/print product. That is to get on reasonable
 grounds with the state of the art.

 Moving on: I plan to bring in the interaction and framework to easily
 semantically enrich the document as well as the overall UX. I have some
 preliminary code in my dev branch, and will bring it forward, and would like
 feedback as well.

 Thanks again and please continue to bring forward any issues or feature
 requests. Contributors are most welcome!

 -Sarven
 http://csarven.ca/#i






-- 
Phillip Lord,   Phone: +44 (0) 191 222 7827
Lecturer in Bioinformatics, Email: phillip.l...@newcastle.ac.uk
School of Computing Science,
http://homepages.cs.ncl.ac.uk/phillip.lord
Room 914 Claremont Tower,   skype: russet_apples
Newcastle University,   twitter: phillord
NE1 7RU

Re: scientific publishing process (was Re: Cost and access)

2014-10-08 Thread Bernadette Hyland

Hi Sarven,
Congratulations for kicking off a thread that has received over 150 replies 
across two W3 lists in a week.  That is impressive!  This isn't the first time 
(nor the last) that it has been discussed.  The active discussion reaffirms the 
need to drive a closer dialog between Web technologists  publishers for 
scientific publishing.

One gets the sense, there is serious depth of expertise on the publishing 
workflow on these lists. People have taken considerable time to reply  be 
constructive with ideas to advance the effort.  Thanks.

Can anyone advise on whether the publishers in 2014 are in fact on the 'front 
lines' of defining these standards that affect their core business, i.e., Web 
standards that are the foundation for layout  typography?  

Is this an opportunity for W3C members to take this up as a topic for 
discussion at the upcoming TPAC?  Perhaps this is already scheduled?  W3C 
staffers, any guidance on this?

I still content there is a great business opportunity for an entrepreneurial 
Web publishing-savvy team to build something really useful  immediately have 
1000+ researchers provide feedback  drive use.

Cheers,

Bernadette Hyland
CEO, 3 Round Stones, Inc.

http://3roundstones.com
http://about.me/bernadettehyland 

PS.  It's also clear, your PhD dissertation topic is of keen interest Sarven!!  
We'd like to read it when you're done (no pressure ;-)


On Oct 8, 2014, at 10:09 AM, Gray, Alasdair a.j.g.g...@hw.ac.uk wrote:

 
 On 8 Oct 2014, at 13:31, Phillip Lord phillip.l...@newcastle.ac.uk wrote:
 
 Peter F. Patel-Schneider pfpschnei...@gmail.com writes:
 
 [snip]
 However, maybe the PLOS method can be improved to the point where the HTML 
 is
 competitive with PDF.
 
 Indeed. For the moment, HTML views are about 1/5 of PDF. Partly this is
 because scientists are used to viewing in print format, I suspect, but
 partly not.
 
 
 Or is that because they want to import it into their own reference management 
 system, e.g. Mendeley, which does not support the HTML version?
 
 Alasdair
 
 [snip]
 Phil
 
 
 Alasdair J G Gray
 Lecturer in Computer Science, Heriot-Watt University, UK.
 Email: a.j.g.g...@hw.ac.uk
 Web: http://www.alasdairjggray.co.uk
 ORCID: http://orcid.org/-0002-5711-4872
 Telephone: +44 131 451 3429
 Twitter: @gray_alasdair
 
 
 
 
 
 
 
 
 
 We invite research leaders and ambitious early career researchers to join us 
 in leading and driving research in key inter-disciplinary themes. Please see 
 www.hw.ac.uk/researchleaders for further information and how to apply. 
 
 Heriot-Watt University is a Scottish charity registered under charity number 
 SC000278.

Re: scientific publishing process (was Re: Cost and access)


On 2014-10-08 15:14, Luca Matteis wrote:

Dear Sarven,

I really appreciate the work that you're doing with trying to style an
HTML page to look similar to the Latex templates. But there's so many
typesetting details that are not available in browsers, which means
you're going to do a lot of DOM hacking to be able to produce the same
quality typography that Latex is capable of. Latex will justify text,
automatically hyphenate, provide proper spacing, and other typesetting
features. Not to mention kerning. Kerning is a *huge* thing in
typography and with HTML you're stuck with creating a DOM element for
every single letter - yup you heard me right.

I think it would be super cool to create some sort of JavaScript
framework that would enable the same level of typography that Latex is
capable of, but you'll eventually hit some hard limitations and you'll
probably be stuck drawing on a canvas.

What are your ideas regarding these problems?


We do not have to have everything pixel perfect and comprehensive all up 
front. That is a common pitfall. Applying the Pareto principle is 
preferable.


LaTeX is great for what it is intended for! This was never in question. 
We are however looking at a bigger picture for Web Science communication 
and access. There will be far more concerns than the presentation layer 
alone.


As for your technical questions: we need to create issues or features, 
and more importantly, open discussions like in these threads, to better 
understand what the SW research community's needs are. So, please create 
an issue because what you raise is important to be looked into further. 
I do not have all the technical answers, even though I am very close to 
the world of typeface, typography, and book design :)


In any case, if it was possible in LaTeX, I hope it is not naive of me 
to say that it can be achieved (if not already) in HTML+CSS+JavaScript.


-Sarven
http://csarven.ca/#i



smime.p7s
Description: S/MIME Cryptographic Signature

Re: scientific publishing process (was Re: Cost and access)

2014-10-08 Thread Kingsley Idehen

On 10/8/14 10:18 AM, Sarven Capadisli wrote:

On 2014-10-08 15:14, Luca Matteis wrote:

Dear Sarven,

I really appreciate the work that you're doing with trying to style an
HTML page to look similar to the Latex templates. But there's so many
typesetting details that are not available in browsers, which means
you're going to do a lot of DOM hacking to be able to produce the same
quality typography that Latex is capable of. Latex will justify text,
automatically hyphenate, provide proper spacing, and other typesetting
features. Not to mention kerning. Kerning is a *huge* thing in
typography and with HTML you're stuck with creating a DOM element for
every single letter - yup you heard me right.

I think it would be super cool to create some sort of JavaScript
framework that would enable the same level of typography that Latex is
capable of, but you'll eventually hit some hard limitations and you'll
probably be stuck drawing on a canvas.

What are your ideas regarding these problems?

We do not have to have everything pixel perfect and comprehensive all
up front. That is a common pitfall. Applying the Pareto principle is
preferable.

LaTeX is great for what it is intended for! This was never in
question. We are however looking at a bigger picture for Web Science
communication and access. There will be far more concerns than the
presentation layer alone.

As for your technical questions: we need to create issues or features,
and more importantly, open discussions like in these threads, to
better understand what the SW research community's needs are. So,
please create an issue because what you raise is important to be
looked into further. I do not have all the technical answers, even
though I am very close to the world of typeface, typography, and book
design :)

In any case, if it was possible in LaTeX, I hope it is not naive of me
to say that it can be achieved (if not already) in HTML+CSS+JavaScript.

-Sarven
http://csarven.ca/#i

Sarven,

Linked Open Data dogfooding, re., issue tracking i.e., a 5-Star Linked
Open Data URI that identifies Github issue tracker for Linked Data Research:

[1]
http://linkeddata.uriburner.com/about/id/entity/https/github.com/csarven/linked-research/issues/4
-- Linked Open Data URI (basic entity description page)
[2] http://linkeddata.uriburner.com/c/8FDBH7 -- deeper follow-your-nose
over relations facets oriented entity description page
[3]
http://bit.ly/vapor-report-on-linked-data-uri-that-identifies-a-github-issue-re-linked-research-data
-- Vapor Report (re., Linked Open Data principles adherence) .

--
Regards,

Kingsley Idehen
Founder CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

smime.p7s
Description: S/MIME Cryptographic Signature

Reference management (was: Re: scientific publishing process (was Re: Cost and access))

2014-10-08 Thread Simon Spero

On Oct 8, 2014 10:15 AM, Gray, Alasdair a.j.g.g...@hw.ac.uk wrote:

Or is that because they want to import it into their own reference
management system, e.g. Mendeley, which does not support the HTML version?

1. It is quite easy to embedded metadata in HTML pages in forms designed
for accurate importing into reference managers (Hellman 2009). Mendeley has
been known to have problems with imports in cases where a proxy server is
involved.

COinS does have the slight problem of being kind of being based on top of
OpenURL, which is made of lose (Hellman 2010) , but is the current least
bad solution.

2. There is ongoing work to create a decent ontology for better embedding.
The BibEx work for schema.org is going in the right direction (Bibex 2014).

The Library of Congress BIBFRAME effort (LC 2014) is going in the right
direction iff the right direction is defined as straight off a cliff - see
eg Spero (2013)

2. There is a good comparison of Docear, Mendeley,and Zotero available in
Beel (2014), which is remarkably balanced given that he is the PI for
Docear. He includes a link to an earlier post mocking several completely
unbalanced comparison charts prepared by different vendors (he finishes by
making a similar chart showing Docear is the only possible choice. Table
snark FTW.)

My personal favorite tool is Bibdesk (2014), which is Mac and bibtex
specific, but justifies this by using many Mac specific capabilities. There
is some support for integration into word (Don't mention the Word. I
mentioned it once but I think I got away with it.)

3. All of these tools could benefit from even simple subsumption reasoning
(although vocabularies like the LCSH have errors that lead to amusing and
frustrating results - everything about doorbells is also about mammals,
eschatology, the soul, and psychotherapy (Spero 2008).

It is important to recognize the difference between a knowledge
organization system, for describing intentional concepts, and knowledge
representation systems, for describing a view of reality. Leonard Cohen via
Elaine Svenonius authorizes laughing at people who confuse the two.

http://ibiblio.org/ses/anyqs.jpg

3. Extended rants on misunderstandings of plausible Ontologies and
ontologies of the Bibliographic Universe omitted (cough SKOS cough).

Simon

References

Beel, Jorean (2014) . Comprehensive Comparison of Reference Managers :
Mendeley vs. Zotero vs. Docear. Available at
http://www.docear.org/2014/01/15/comprehensive-comparison-of-reference-managers-mendeley-vs-zotero-vs-docear/

Bibdesk (2014). Bibdesk wiki: Main Page. Available at
http://sourceforge.net/p/bibdesk/wiki/Main_Page/

BibEx (2014). Schema Bib Extend Community Group Wiki: Main Page. Available
at http://www.w3.org/community/schemabibex/wiki/index.php?title=Main_Page

Hellman, Eric (2009). OpenURL COinS : A convention to embed bibliographic
metadata in HTML. Available at http://ocoins.info

Hellman, Eric (2010). It's cool to hate on OpenURL (was Re: Twitter
Annotations). Available at
https://listserv.nd.edu/cgi-bin/wa?A2=CODE4LIB;axd%2FoQ;201004291208400400

https://www.mail-archive.com/code4lib@listserv.nd.edu/msg07857.html

LC (2014). BIBFRAME : Bibliographic Framework Initiative. Available at
http://www.loc.gov/bibframe/

Spero, Simon (2008). LCSH is to Thesaurus as Doorbell is to Mammal:
visualizing structural problems in the Library of Congress subject
headings. In Proceedings of the 2008 International Conference on Dublin
Core and Metadata Applications. DCMI. Available at:
http://iBiblio.org/ses/poster.pdf
Spero, Simon (2013). Prolegomena to any future metadata. Available at
http://www.ibiblio.org/fred2.0/wordpress/?p=269

Re: scientific publishing process (was Re: Cost and access)


On 2014-10-08 18:38, Kingsley Idehen wrote:

Sarven,

Linked Open Data dogfooding, re., issue tracking i.e., a 5-Star Linked
Open Data URI that identifies Github issue tracker for Linked Data Research:

[1]
http://linkeddata.uriburner.com/about/id/entity/https/github.com/csarven/linked-research/issues/4
-- Linked Open Data URI (basic entity description page)
[2] http://linkeddata.uriburner.com/c/8FDBH7 -- deeper follow-your-nose
over relations facets oriented entity description page
[3]
http://bit.ly/vapor-report-on-linked-data-uri-that-identifies-a-github-issue-re-linked-research-data
-- Vapor Report (re., Linked Open Data principles adherence) .


It's pretty cool that you can grab stuff out of GitHub issues, even 
comments!


Papers link to code and then to commits and issues. See also [1].

Even comments e.g., [2]. Or even in the direction of paper comments 
which can be integrated and picked right up from the page e.g., [3]. 
Just need to add +/-1 buttons and triplify the review ;) With WebID+ACL, 
we have the rest.


Do I have write access (via WebID?)to something like [4] ? e.g., 
deleting an older label or triple :)


[1] http://git2prov.org/
[2] 
https://linkeddata.uriburner.com/about/html/http/csarven.ca/call-for-linked-research
[3] 
https://linkeddata.uriburner.com/about/html/http/csarven.ca/sense-of-lsd-analysis%01comment_20140808164434
[4] 
http://linkeddata.uriburner.com/about/html/http://linkeddata.uriburner.com/about/id/entity/https/github.com/csarven/linked-research/issues/4


-Sarven



smime.p7s
Description: S/MIME Cryptographic Signature

Re: scientific publishing process (was Re: Cost and access)

2014-10-08 Thread Kingsley Idehen


On 10/8/14 3:13 PM, Sarven Capadisli wrote:

On 2014-10-08 18:38, Kingsley Idehen wrote:

Sarven,

Linked Open Data dogfooding, re., issue tracking i.e., a 5-Star Linked
Open Data URI that identifies Github issue tracker for Linked Data 
Research:


[1]
http://linkeddata.uriburner.com/about/id/entity/https/github.com/csarven/linked-research/issues/4 


-- Linked Open Data URI (basic entity description page)
[2] http://linkeddata.uriburner.com/c/8FDBH7 -- deeper follow-your-nose
over relations facets oriented entity description page
[3]
http://bit.ly/vapor-report-on-linked-data-uri-that-identifies-a-github-issue-re-linked-research-data 


-- Vapor Report (re., Linked Open Data principles adherence) .


It's pretty cool that you can grab stuff out of GitHub issues, even 
comments!


Papers link to code and then to commits and issues. See also [1].

Even comments e.g., [2]. Or even in the direction of paper comments 
which can be integrated and picked right up from the page e.g., [3]. 
Just need to add ±1 buttons and triplify the review ;) With WebID+ACL, 
we have the rest.


Do I have write access (via WebID?)to something like [4] ? e.g., 
deleting an older label or triple :)


[1] http://git2prov.org/
[2] 
https://linkeddata.uriburner.com/about/html/http/csarven.ca/call-for-linked-research 

[3] 
https://linkeddata.uriburner.com/about/html/http/csarven.ca/sense-of-lsd-analysis%01comment_20140808164434 

[4] 
http://linkeddata.uriburner.com/about/html/http://linkeddata.uriburner.com/about/id/entity/https/github.com/csarven/linked-research/issues/4 



-Sarven 


Yes, there are WebID+TLS and/or NetID+TLS based ACLs [1][2][3] in place. 
In addition, you can always make a full TURTLE doc in some data space, 
or embed your TURTLE in any text slot (e.g., comments or description 
fields) provided by a Web app/service using Nanotation, and you are set 
re. payload for upload into URIBurner. Basically, you have the following 
RWW options:


1. Append RDF statements to the existing the RDF document (named graph) 
identified by IRI http://csarven.ca/sense-of-lsd-analysis -- all you 
do is refresh  the URIBurner URI as data changes in github 
(?sponger:get=add at the end of a URIBurner URI has this effect)


2. Overwrite statements in the existing RDF document (named graph) -- 
simply add ?@Lookup@=refresh=clean to the end of the URIBurner URI, for 
this effect.


Of course there's lots more, but I'll let this flow one step at a time :-)


Links:

[1] 
http://bit.ly/enterprise-identity-management-and-attribute-based-access-controls
[2] 
http://www.slideshare.net/kidehen/how-virtuoso-enables-attributed-based-access-controls/34 
-- WebID-TLS (authenticates WebIDs)
[3] 
hhttp://www.slideshare.net/kidehen/how-virtuoso-enables-attributed-based-access-controls/40 
-- NetID-TLS (authenticates LinkedIn, Facebook, Twitter, G+, Amazon, 
Dropbox, and many other identities)
[4] http://bit.ly/blog-post-about-nanotation -- Nanotation (this SHOULD 
work wherever you're able to input plain text).


--
Regards,

Kingsley Idehen 
Founder  CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this




smime.p7s
Description: S/MIME Cryptographic Signature

Re: scientific publishing process (was Re: Cost and access)

2014-10-07 Thread Eric Prud'hommeaux

* Luca Matteis lmatt...@gmail.com [2014-10-07 00:41+0200]
 Sorry to jump into this once again but when it comes to typesetting
 nothing really comes close to Latex/PDF:
 http://tex.stackexchange.com/questions/120271/alternatives-to-latex -
 not even HTML/CSS/JavaScript

Making a floating model look like Latex/PDF at all resolutions seems
impossible. Perhaps targeting a fixed (A4 or 8½×11 @300dpi) resolution
is quite doable. Doing so allows one to use fixed position for all CSS
directives.

But Eric, that sucks!!

Well, sort of, because we can't conveniently read it on a phone and it
doesn't fill large displays, but that may be a small price to pay to
be able to use all of the rich markup that we wax poetic about on this
list. If it does work, then we can figure out ways to script it so it
has a simply-controlled, predictable behavior at a certain resolution
but is reasonable at arbitrary resolutions.


 On Tue, Oct 7, 2014 at 12:18 AM, Norman Gray nor...@astro.gla.ac.uk wrote:
 
  Greetings.
 
  On 2014 Oct 6, at 19:19, Alexander Garcia Castro alexgarc...@gmail.com 
  wrote:
 
  querying PDFs is NOT simple and requires a lot of work -and usually
  produces lots of errors. just querying metadata is not enough. As I said
  before, I understand the PDF as something that gives me a uniform layout.
  that is ok and necessary, but not enough or sufficient within the context
  of the web of data and scientific publications. I would like to have the
  content readily available for mining purposes. if I pay for the publication
  I should get access to the publication in every format it is available. the
  content should be presented in a way so that it makes sense within the web
  of data.  if it is the full content of the paper represented in RDF or XML
  fine. also, I would like to have well annotated content, this is simple and
  something that could quite easily be part of existing publication
  workflows. it may also be part of the guidelines for authors -for instance,
  identify and annotate rhetorical structures.
 
 
  The following might add something to this conversation.
 
  It illustrates getting the metadata from a LaTeX file, putting it into an 
  XMP packet in a PDF, and getting it out of the PDF as RDF.  Pace Peter's 
  mention of /Author, /Title, etc, this just focuses on the XMP packet.
 
  This has the document metadata, the abstract, and an illustrative bit of 
  argumentation.  Adding details about the document structure, and (RDF) 
  pointers to any figures would be feasible, as would, I suspect, 
  incorporating CSV files directly into the PDF.  Incorporating 
  \begin{tabular} tables would be rather tricky, but not impossible.  I can't 
  help feeling that the XHTML+RDFa equivalent would be longer and need more 
  documentation to instruct the author where to put the RDFa magic.
 
  It's not very fancy, and still has rough edges, but it only took me 100 
  minutes, from a standing start.
 
  Generating and querying this PDF seems pretty simple to me.
 
  
 
  $ cat test-xmp.tex
  \documentclass{article}
 
  \usepackage{xmp-management}
 
  \title{This is a test file}
  \author{Norman Gray}
  \date{2014 October 6}
 
  \begin{document}
 
  \maketitle
 
  \abstract{It's easy to include metadata in \LaTeX\ files.
 
  That's because there's plenty of metadata in there already.}
 
  There is text and metatext within files.
 
  \section{Further details}
 
  In this section we could potentially discuss moving information
  around.  I think we can assert that \claim{it is easy to move
information around}, and, further, that \claim{making metadata
readily available is a Good Thing}.  I hope that clears that up.
  \end{document}
  $ cat xmp-management.sty
  \ProvidesPackage{xmp-management}[2014/10/06]
 
  \newwrite\xmp@ttlfile
  \def\xmp@open{\immediate\openout\xmp@ttlfile \jobname.ttl
\let\xmp@open\relax}
  \long\def\xmp@stmt#1#2{%
\xmp@open
\write\xmp@ttlfile{ #1 #2.}}
  \let\xmp@origtitle\title
  \def\title#1{\xmp@stmt{dc:title}{#1}\xmp@origtitle{#1}}
  \let\xmp@origauthor\author
  \def\author#1{\xmp@stmt{dc:creator}{#1}\xmp@origauthor{#1}}
  \let\xmp@origdate\date
  \def\date#1{\xmp@stmt{dc:created}{#1}\xmp@origdate{#1}}
 
  \long\def\abstract#1{
\xmp@stmt{dc:abstract}{#1}
\begin{quotation}\textbf{Abstract:} #1\end{quotation}}
  \def\claim#1{
\xmp@stmt{xmpinfo:claim}{#1}
\emph{#1}}
 
  \let\xmp@origsection\section
  \def\section#1{\xmp@stmt{xmpinfo:has_section}{#1}
\xmp@origsection{#1}}
 
  \usepackage{xmpincl}
  \AtBeginDocument{\includexmp{info}}
  $ pdflatex test-xmp
  This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012)
   restricted \write18 enabled.
  entering extended mode
  (./test-xmp.tex
  LaTeX2e 2011/06/27
  [...BLAH...]
  Output written on test-xmp.pdf (1 page, 75667 bytes).
  Transcript written on test-xmp.log.
  $ cat test-xmp.ttl
   dc:title This is a test file.
   dc:creator Norman Gray.
   dc:created 2014 October 6.

Re: scientific publishing process (was Re: Cost and access)

2014-10-07 Thread José Manuel Gómez Pérez

This is precisely one of the main ideas we pursued in Wf4Ever. The paper
in whatever format is not enough, you also need to preserve the methods
and their implementation, including the workflows and the datasets, not
only for validation and reproducibility purpose in the face of
publication but ultimately for incremental reuse and scientific development.

Publications indeed shouldn't be seen as a static piece of paper but
rather as a (linked) piece of knowledge which can be revised and evolve
in time. So, tooling is required that supports the management of the
lifecycle of such knowledge, from creation of specific research objects
to reuse, including ways to deal with decay and exploration and
inspection capabilities.

In this direction, we took incremental steps through actual deployments
of project outcomes in the previously mentioned platforms. Furthermore,
we also integrated almost the whole set of functionalities into the
ROHub.org platform, which was demonstrated in the last Semantic
Publishing Challenge in ESWC [1,2] as a step forward in the direction
you mention.

To me it would make absolute sense to see further community pull of this
kind of tooling, starting with their utilization in the conferences and
journals of our own field (ESWC, ISWC, etc.) in order to incubate, gain
traction, and draw conclusions that we could generalize to other domains.

If this sounds appealing to the folks in this list, please let me know.

Cheers,
Jose

[1]
http://2014.eswc-conferences.org/sites/default/files/eswc2014-challenges_spc_submission_3.pdf

[2] http://2014.eswc-conferences.org/program/semwebeval

On 04/10/2014 13:14, Hugh Glaser wrote:

(c) Workflows and Datasets
I have mentionedhttp://www.myexperiment.org before, but can’t remember if I
have mentionedhttp://www.wf4ever-project.org
Again, these are Linked Data platforms for publishing; in this case workflows
and datasets etc.
They are seriously mature, certainly compared with what we might build - see,
for examplehttps://github.com/wf4ever/ro
And exactly the same as the Repositories.

What would be wrong with bringing up such a repository for SemWeb/Web
conferences, one for all, or for each or series?
…ditto…
Who know, maybe the Crawl, as well as the Challenge entries might be able to
usefully describe what they did using these ontologies etc.?

Please, please, let’s not build anything ourselves - if we are to do anything,
then let’s choose and join suitable existing activity and make it better for
everyone.

Dr. Jose Manuel Gomez-Perez
Director RD
jmgo...@isoco.com
#T +34913349797
#M +34609077103
Avda. del Partenón 10, Planta 1, Oficina 1.3A
Campo de las Naciones
28042 Madrid, Spain

iSOCO
enabling the networked economy
www.isoco.com

P Please consider your environmental responsibility before printing this
e-mail

---
Este mensaje no contiene virus ni malware porque la protección de avast!
Antivirus está activa.
http://www.avast.com

Re: scientific publishing process (was Re: Cost and access)

2014-10-07 Thread Norman Gray


Kingsley and all, hello.

On 2014 Oct 7, at 02:18, Kingsley Idehen kide...@openlinksw.com wrote:

 On 10/6/14 2:49 PM, Peter F. Patel-Schneider wrote:
 
 
 On 10/06/2014 11:03 AM, Kingsley Idehen wrote:
 On 10/6/14 12:48 PM, Peter F. Patel-Schneider wrote:
 It's not hard to query PDFs with SPARQL.  All you have to do is extract the
 
 Huh?  Every single PDF reader that I use can extract the PDF metadata and 
 display it.
 
 Again, this isn't about metadata.

With all respect to the larger goal of having fully semanticked-up documents, I 
think the question _is_ all about metadata.  The original spark to the thread 
was a lament that SW and LD conferences don't mandate something XMLish for 
submissions because X(HT)ML is clearly better for... well ... dammit, it's 
Better.

_One_ thing it would be better for is supporting the sort of full-scale 
RDF-everything view that you've described so eloquently.  But if that's your 
goal, then lexing the source text is really going to be the least of your 
problems.

A more modest goal, which is still valuable and _much_ more achievable, is to 
get at least some RDF out of submitted articles.  That practically means 
metadata, plus perhaps some document structure, plus, if you're keen and can 
get the authors to invest their effort, some argumentation.  That's available 
for free (and right now) from LaTeX authors, and available from XHTML authors 
depending on how hard it would be to get them to put @profile attribute in the 
right places.

So no, not just about 'metadata' in the narrow sense, but I think this thread 
is about what RDF you can in practice extract from the materials that authors 
can in practice be induced or obliged to submit to conference proceedings.

That original lament has overlapped with a parallel lament that PDF is a 
dead-end format -- it's not 'webby'.  I believe that the demo in my earlier 
message undermines that claim as far as RDF goes.

 1. The extractors are platform specific -- AWWW is about platform 
 agnosticism
 (I don't want to mandate an OS for experiencing the power of Linked Open 
 Data
 transformers / rdfizers)
 
 Well, the extractors would be specific to PDF, but that's hardly surprising, 
 I think.

[I've lost track of whose comment this is...]

The extractor I demoed wasn't PDF-specific.

 We want to leverage the productivity and simplicity that AWWW brings to data
 representation, access, interaction, and integration.
 
 Sure, but the additional costs, if any, on paper authors, reviewers, and 
 readers have to be considered.  If these costs are eliminated or at least 
 minimized then this good is much more likely to be realized.
 
 With some help from Adobe we can have the best of all worlds here. I am going 
 to take a look at their latest cloud offerings and associated APIs.

I forgot to attach the extractor I wrote -- done.  The demo didn't use any 
Adobe API, neither to put the XMP into the PDF nor to extract the RDF from it.

All the best,

Norman


-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK


extract-xmp.c
Description: Binary data

Re: scientific publishing process (was Re: Cost and access)

2014-10-07 Thread Norman Gray


Eric, hello.

This is a bit of a side-issue, but...

On 2014 Oct 7, at 07:13, Eric Prud'hommeaux e...@w3.org wrote:

 * Luca Matteis lmatt...@gmail.com [2014-10-07 00:41+0200]
 Sorry to jump into this once again but when it comes to typesetting
 nothing really comes close to Latex/PDF:
 http://tex.stackexchange.com/questions/120271/alternatives-to-latex -
 not even HTML/CSS/JavaScript
 
 Making a floating model look like Latex/PDF at all resolutions seems
 impossible. Perhaps targeting a fixed (A4 or 8½×11 @300dpi) resolution
 is quite doable.

This isn't as hard as you might think (if I'm understanding you correctly).

At http://purl.org/nxg/text/general-relativity I have some lecture notes.  
The downloads there include:

http://www.astro.gla.ac.uk/users/norman/lectures/GR/part2.pdf
http://www.astro.gla.ac.uk/users/norman/lectures/GR/part2-usletter.pdf
http://www.astro.gla.ac.uk/users/norman/lectures/GR/part2-screen.pdf

Those come from the _same_ source file with different \documentclass options (I 
keep meaning to do something about the marginal notes in the screen version, 
but have never got around to it).  There's no resolution/DPI problem, because 
these are all vector fonts, not bitmaps.  There's should be no 'missing font' 
problem because the fonts are automatically embedded properly (the maths font 
in those documents is a commercial one, so it's unlikely to be on your 
computer).

This won't dynamically reflow, it's true (and that's a pity), but if I ever get 
a tablet computer, I doubt I'll be able to resist producing versions in a 
layout which is targeted at that size of screen.

All the best,

Norman


-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK

Re: scientific publishing process (was Re: Cost and access)

2014-10-07 Thread Kingsley Idehen


On 10/7/14 5:39 AM, Norman Gray wrote:

Kingsley and all, hello.

On 2014 Oct 7, at 02:18, Kingsley Idehen kide...@openlinksw.com wrote:


On 10/6/14 2:49 PM, Peter F. Patel-Schneider wrote:


On 10/06/2014 11:03 AM, Kingsley Idehen wrote:

On 10/6/14 12:48 PM, Peter F. Patel-Schneider wrote:

It's not hard to query PDFs with SPARQL.  All you have to do is extract the

Huh?  Every single PDF reader that I use can extract the PDF metadata and 
display it.

Again, this isn't about metadata.

With all respect to the larger goal of having fully semanticked-up documents, I 
think the question _is_ all about metadata.


It can't be. The metadata focus is a subtle misconception. We need 
access to all of the data in the document.



   The original spark to the thread was a lament that SW and LD conferences 
don't mandate something XMLish for submissions because X(HT)ML is clearly 
better for... well ... dammit, it's Better.


The initial gripe (as I've always seen it) is that we are trying to tell 
the world about Linked Open Data virtues while rarely putting them to 
use (instinctively) ourselves. It just so happens that conferences are 
provide an example that most have experienced in some capacity.




_One_ thing it would be better for is supporting the sort of full-scale 
RDF-everything view that you've described so eloquently.  But if that's your 
goal, then lexing the source text is really going to be the least of your 
problems.

A more modest goal, which is still valuable and _much_ more achievable, is to 
get at least some RDF out of submitted articles.


Yes, or just make references to RDF sources relevant to the paper, but 
on the basis that those references (to the degree possible) resolve. 
This also about the data represented in tabular form (as tables) and the 
data behind the tables, so to speak.



  That practically means metadata, plus perhaps some document structure, plus, 
if you're keen and can get the authors to invest their effort, some 
argumentation.  That's available for free (and right now) from LaTeX authors, 
and available from XHTML authors depending on how hard it would be to get them 
to put @profile attribute in the right places.

So no, not just about 'metadata' in the narrow sense, but I think this thread 
is about what RDF you can in practice extract from the materials that authors 
can in practice be induced or obliged to submit to conference proceedings.


For those conferences associated with themes such as Linked Open Data 
and the Semantic Web, RDF should be the norm for structured data 
representation. If that isn't possible then what are we saying to the 
world about RDF, in regards to structured data representation and data 
de-silo-fication?





That original lament has overlapped with a parallel lament that PDF is a 
dead-end format -- it's not 'webby'.


The are linked :-)


   I believe that the demo in my earlier message undermines that claim as far 
as RDF goes.


1. The extractors are platform specific -- AWWW is about platform agnosticism
(I don't want to mandate an OS for experiencing the power of Linked Open Data
transformers / rdfizers)

Well, the extractors would be specific to PDF, but that's hardly surprising, I 
think.

[I've lost track of whose comment this is...]

The extractor I demoed wasn't PDF-specific.


Platform in the context of my comments really relates to operating 
systems i.e., most PDF extractors are operating system specific. That's 
why I mentioned the massive opportunity for Adobe (and 3rd parties too, 
as Mike Bergman added) in regards to providing Web Services to accessing 
and indexing PDF document content.





We want to leverage the productivity and simplicity that AWWW brings to data
representation, access, interaction, and integration.

Sure, but the additional costs, if any, on paper authors, reviewers, and 
readers have to be considered.  If these costs are eliminated or at least 
minimized then this good is much more likely to be realized.

With some help from Adobe we can have the best of all worlds here. I am going 
to take a look at their latest cloud offerings and associated APIs.

I forgot to attach the extractor I wrote -- done.  The demo didn't use any 
Adobe API, neither to put the XMP into the PDF nor to extract the RDF from it.


You forgot the extractor demo link :)



All the best,

Norman





--
Regards,

Kingsley Idehen 
Founder  CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this




smime.p7s
Description: S/MIME Cryptographic Signature

Re: scientific publishing process (was Re: Cost and access)



The stack exchange discussion mostly talks about the user side of
things. Go back (quite) a few years and using PDF from tex was a pain,
pretty much up until pdflatex become the norm. 

For those who thing that latex is still the best, I do not see that an
HTML centric publishing framework should be a barrier. If the majority
of papers were being produced from Word, then it might be more of an
issue. 

Phil


Luca Matteis lmatt...@gmail.com writes:

 Sorry to jump into this once again but when it comes to typesetting
 nothing really comes close to Latex/PDF:
 http://tex.stackexchange.com/questions/120271/alternatives-to-latex -
 not even HTML/CSS/JavaScript

Re: scientific publishing process (was Re: Cost and access)

Norman Gray nor...@astro.gla.ac.uk writes:

 This won't dynamically reflow, it's true (and that's a pity), but if I ever
 get a tablet computer, I doubt I'll be able to resist producing versions in a
 layout which is targeted at that size of screen.


Sure, that's fine. But why not have a version which behaves reasonably
at all screen sizes. This should be achievable.

Phil

Re: scientific publishing process (was Re: Cost and access)

2014-10-07 Thread Sarven Capadisli


On 2014-10-07 11:39, Norman Gray wrote:

The original spark to the thread was a lament that SW and LD conferences don't 
mandate something XMLish for submissions because X(HT)ML is clearly better 
for... well ... dammit, it's Better.


Straw man argument. Please stop that now!

I will spell out the main proposal and purpose for you because it sounds 
like you are completely oblivious to them. Let me know if anything is 
unclear.


* Conferences on SW/LD research should encourage and allow submissions 
using the Web native technology stack (e.g., starting from HTML and 
friends for instance) alongside the existing requirements. As the 
required submission in PDF can be generated via HTML+CSS, those that 
wish to arrive at the PDF by their own means can still do so, meanwhile 
without asking or forcing the existing authorship or review process to 
change. It is backwards compatible. The underlying idea is to use our 
own technologies, not only for the sake of using them, but also to 
identify the pains as a precursor to raising the quality of the 
(Semantic) Web stack for scientific research publishing, discovery, and 
reuse. This is plain and simple dogfooding and it is important.


* There is an opportunity for granular data discovery, reuse, and 
machines to aid in reproducibility of scientific research. This goes 
completely beyond off the shelf metadata e.g., author, title, subject, 
or what you can stuff into LaTeX+Whatever, not to mention mangling 
around what's primarily intended for desktop and print, to squeeze in 
some Web in there. We are talking about making reasonable strides 
towards having scientific knowledge that is universally accessible 
on the Web. PDF and friends do not fit into that equation that well, 
however, no one is blocked from doing what they already do. Some of us 
would like to do a bit more than that to test things out so that we can 
collectively have more wins.


* There is also an opportunity to attract more funding and interest 
groups, if we can better assess the state of Web Science. This is 
simply due to the fact that we would be able to mine more useful 
information from existing research. Moreover, we can identify research 
areas of potential value better. It is to elevate the support that we 
can get from machines to excel and to do our work better. This is in 
contrast to what we can currently achieve with the existing workflow 
i.e., the current process is only concerned about making it easy for 
the author, reviewer, and publisher, and not about gleaning 
high-fidelity information.



A more modest goal, which is still valuable and _much_ more achievable, is to 
get at least some RDF out of submitted articles.  That practically means 
metadata, plus perhaps some document structure, plus, if you're keen and can 
get the authors to invest their effort, some argumentation.  That's available 
for free (and right now) from LaTeX authors, and available from XHTML authors 
depending on how hard it would be to get them to put @profile attribute in the 
right places.
That original lament has overlapped with a parallel lament that PDF is a 
dead-end format -- it's not 'webby'.  I believe that the demo in my earlier 
message undermines that claim as far as RDF goes.


Let me get this right: you are advocating that LaTeX + RDF/XML + 
whatever processes one has to go through, is a more sensible approach 
than HTML? If so, we have a different view on what creates a good UX.


It may come as news to you, but the SW/LD community is not in favour of 
authors using RDF/XML unless it is completely within some tool-chain 
left for machines to deal with. There are alternative RDF notations 
which are more preferable. You should look it up. The problem with your 
proposal is that, the author has to boggle their mind with two 
completely different syntaxes (LaTeX and RDF/XML), whereas the original 
proposal was to deal with one i.e., HTML. Styling is no more of an issue 
as the templates in the case of LaTeX is provided, and for HTML, I've 
made a modest PoC with:


https://github.com/csarven/linked-research

However, you are somehow completely oblivious to that even though it was 
mentioned several times now on this mailing list. No, it is not perfect, 
and yes it can be better. There are alternative solutions to achieve 
something along those lines with the same vision in mind, which area all 
okay too.


If this is not about coding, but rather using WYSIWYG editors or 
authoring/publication tools, have a look and try a few here or from a 
service near you:


* http://en.wikipedia.org/wiki/Comparison_of_HTML_editors

* http://en.wikipedia.org/wiki/List_of_content_management_systems

Or you know, take 30 seconds to create a WordPress account and another 
30 seconds to publish. Let me know if you still think that's 
insufficient or completely unreasonable / difficult for Web Science 
people to handle.


So, *do as you like, but do not prevent me* from doing encouraging the 
SW/LD

Re: scientific publishing process (was Re: Cost and access)

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

 tex4ht takes the slight strange approach of having an strange and
 incomprehensible command line, and then lots of scripts which do default
 options, of which xhmlatex is one. In my installation, they've only put
 the basic ones into the path, so I ran this with
 /usr/share/tex4ht/xhmlatex.


 Phil


 So someone has to package this up so that it can be easily used.  Before then,
 how can it be required for conferences?

http://svn.gnu.org.ua/sources/tex4ht/trunk/bin/ht/unix/xhmlatex


 I have tex4ht installed, but there is no xhmlatex file to be found.  I managed
 to find what appears to be a good command line

I don't know why that would be. It is installed with the debian package,
although as I said, it is not in the system path. I found it with dpkg
-S. Am afraid it's a long time since I used an RPM based system, so I
can't remember how do this on fedora.


 htlatex schema-org-analysis.tex xhtml,mathml  -cunihtf -cvalidate

 This looks better when viewed, but the resultant HTML is unintelligible.

 There is definitely more work needed here before this can be considered as a
 potential solution.

Yes, I agree.

So, the question is how to enable this. One way would, for example, be
for ISWC and ESWC to accept HTML and have a prize for the best semantic
paper submitted. Then people with the inclination would do the work.

Again, I suspect it's not that much, but we will not know until we try.

Phil

Re: scientific publishing process (was Re: Cost and access)

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

 On 10/06/2014 11:00 AM, Phillip Lord wrote:
 Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

 On 10/06/2014 09:32 AM, Phillip Lord wrote:
 Peter F. Patel-Schneider pfpschnei...@gmail.com writes:
 Who cares what the authors intend? I mean, they are not reading the
 paper, are they?

 For reviewing, what the authors intend is extremely important.  Having
 different rendering of the paper interfere with the authors' message is
 something that should be avoided at all costs.

 Really? So, for example, you think that a reviewer with impared vision
 should, for example, be forced to review a paper using the authors
 rendering, regardless of whether they can read it or not?

 No, but this is not what I was talking about. I was talking about
 interfering with the authors' message via changes from the rendering
 that the authors' set up.

 It *is* exactly what you are talking about.

 Well, maybe I was not being clear, but I thought that I was talking about
 rendering  changes interfering with comprehension of the authors' intent.


And if only you had a definition of rendering changes that interfere
with authors intent as opposed to just rendering changes.

I can guarantee that rendering a paper to speech WILL change at least
some of the authors intent because, for example, figures will not
reproduce. You state that this should be avoided at all costs.

I think this is wrong. There are many reasons to change rendering. That
should be the readers choice.

Phil

Re: scientific publishing process (was Re: Cost and access)

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

 So, you believe that there is an excellent set of tools for preparing,
 reviewing, and reading scientific publishing.

 Package them up and make them widely available.  If they are good, people will
 use them.

 Convince those who run conferences.  If these people are convinced, then they
 will allow their use in conferences or maybe even require their use.

Is that not the point of the discussion?

Unfortuantely, we do not know why ISWC and ESWC insist on PDF.

 I'm not convinced by what I'm seeing right now, however.

Sure, but at least the discussion has meant that you have looked at some
of the tools again. That's no bad thing.

My question would be, are more convinced than you were last time you
looked or less?

Phil

Re: scientific publishing process (was Re: Cost and access)

2014-10-07 Thread Robert Stevens




What I'd suggest for conference organisers is something like the following:

1. Keep the PDF as the main thing, as it's not going anywhere soon.
3. Also allow submission in some alternative form, including semantic 
content, and have the conference run a competition for alternative 
publishing forms - including voting by delegates on what  they like and 
what they want. this could promote such alternative forms and offer a 
migration route over time.


Robert.

On 07/10/2014 13:27, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

So, you believe that there is an excellent set of tools for preparing,
reviewing, and reading scientific publishing.

Package them up and make them widely available.  If they are good, people will
use them.

Convince those who run conferences.  If these people are convinced, then they
will allow their use in conferences or maybe even require their use.

Is that not the point of the discussion?

Unfortuantely, we do not know why ISWC and ESWC insist on PDF.


I'm not convinced by what I'm seeing right now, however.

Sure, but at least the discussion has meant that you have looked at some
of the tools again. That's no bad thing.

My question would be, are more convinced than you were last time you
looked or less?

Phil




--
Professor Robert Stevens
Bio-health Informatics Group
School of Computer Science
University of Manchester
Oxford Road
Manchester
United Kingdom
M13 9PL

robert.stev...@manchester.ac.uk
Tel: +44 (0) 161 275 6251
Blog: http://robertdavidstevens.wordpress.com
Web: http://staff.cs.manchester.ac.uk/~stevensr/

KBO

Re: scientific publishing process (was Re: Cost and access)

If you mean that published papers have to be in PDF, but that they can 
optionally have a second format, then I had no problem with this proposal.  I 
also have no problem with encouraging use of other formats.


However, this is an added burden on conference organizers.  Someone would have 
to volunteer to handle the extra work, particularly the work involved in 
checking that papers using the second format abide by the publishing requirements.


peter



On 10/07/2014 05:52 AM, Robert Stevens wrote:



What I'd suggest for conference organisers is something like the following:

1. Keep the PDF as the main thing, as it's not going anywhere soon.
3. Also allow submission in some alternative form, including semantic content,
and have the conference run a competition for alternative publishing forms -
including voting by delegates on what  they like and what they want. this
could promote such alternative forms and offer a migration route over time.

Robert.

On 07/10/2014 13:27, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

So, you believe that there is an excellent set of tools for preparing,
reviewing, and reading scientific publishing.

Package them up and make them widely available.  If they are good, people will
use them.

Convince those who run conferences.  If these people are convinced, then they
will allow their use in conferences or maybe even require their use.

Is that not the point of the discussion?

Unfortuantely, we do not know why ISWC and ESWC insist on PDF.


I'm not convinced by what I'm seeing right now, however.

Sure, but at least the discussion has meant that you have looked at some
of the tools again. That's no bad thing.

My question would be, are more convinced than you were last time you
looked or less?

Phil

Re: scientific publishing process (was Re: Cost and access)


On 10/07/2014 05:27 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:


So, you believe that there is an excellent set of tools for preparing,
reviewing, and reading scientific publishing.

Package them up and make them widely available.  If they are good, people will
use them.

Convince those who run conferences.  If these people are convinced, then they
will allow their use in conferences or maybe even require their use.


Is that not the point of the discussion?


Not at all.  Where was the proposal to put together something that met the 
requirements of preparing, reviewing, and publishing scientific papers?


To me, the initial discussion was about how much better HTML was for carrying 
data.  Other aspects of paper preparation, review, and publishing were not 
being considered.  Now, maybe, aspects of presentation and review and ease of 
use are part of the discussion.   A change in the paper submission process 
needs to take into account what the paper submission process is about, not 
just some aspect of what might be included in submitted papers.



Unfortuantely, we do not know why ISWC and ESWC insist on PDF.


As far as I am concerned, ISWC and ESWC insist on PDF for submissions because 
the reviewing process is so much better with PDF than with anything else.



I'm not convinced by what I'm seeing right now, however.


Sure, but at least the discussion has meant that you have looked at some
of the tools again. That's no bad thing.

My question would be, are more convinced than you were last time you
looked or less?


Well, I remain totally unconvinced that any current HTML solution is as good 
as the current PDF setup.  Certainly htlatex is not suitable.  There may be 
some way to get tex4ht to do better, but no one has provided a solution. 
Sarven Capadisli sent me some HTML that looks much better, but even on a 
math-light paper I could see a number of glitches.  I haven't seen anything 
better than that.


It's not as if the basics (MathML, CSS, etc.)  are unavailable to put together 
most, or maybe even all, of an HTML-based solution.  These basics have been 
around for some time now.  However, I haven't seen a setup that is as good as 
LaTeX and PDF for preparation, review, and publishing of scientific papers.


Yes, it took a lot of effort to get to the current state with respect to LaTeX 
and PDF.  In the past, I experienced quite a number of problems with using 
LaTeX and PDF for writing, reviewing, and publishing scientific papers, but 
most of these are in the past.  Yes, there are still some problems with using 
LaTeX and PDF.  Produce something better and people will use it, eventually.



Phil


peter

Re: scientific publishing process (was Re: Cost and access)




On 10/07/2014 05:23 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:


On 10/06/2014 11:00 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:


On 10/06/2014 09:32 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

Who cares what the authors intend? I mean, they are not reading the
paper, are they?


For reviewing, what the authors intend is extremely important.  Having
different rendering of the paper interfere with the authors' message is
something that should be avoided at all costs.


Really? So, for example, you think that a reviewer with impared vision
should, for example, be forced to review a paper using the authors
rendering, regardless of whether they can read it or not?


No, but this is not what I was talking about. I was talking about
interfering with the authors' message via changes from the rendering
that the authors' set up.


It *is* exactly what you are talking about.


Well, maybe I was not being clear, but I thought that I was talking about
rendering  changes interfering with comprehension of the authors' intent.



And if only you had a definition of rendering changes that interfere
with authors intent as opposed to just rendering changes.

I can guarantee that rendering a paper to speech WILL change at least
some of the authors intent because, for example, figures will not
reproduce. You state that this should be avoided at all costs.

I think this is wrong. There are many reasons to change rendering. That
should be the readers choice.

Phil


I think that for reviewing the authors should be able to dictate how their 
submission looks, within the bounds of the submission requirements.  If the 
reviewer wants, or needs, to change the way a submission is presented then it 
is up to the reviewer to ensure that their review is not coloured by this change.


When I review papers I routinely point out presentation problems.  Sometimes I 
take into account presentation problems when I evaluate papers.  However, I 
try very hard to evaluate the submission based on what the authors submitted, 
not on any changes that I made to the submission.  For example, I will point 
out problems with using colours in graphs, but I will evaluate the paper based 
on the coloured version of the graphs, not a black and white version. 
However, if the authors submitted low-resolution figures and something is 
missing because of this, then I feel free to take this into account in my 
evaluation.


In a situation where I do not know what presentation the authors wanted, for 
example if explicit line breaks and indentation are sometimes preserved, but 
not always, the evaluation of submissions can become very much harder.


peter

Re: scientific publishing process (was Re: Cost and access)




On 10/07/2014 05:20 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:


tex4ht takes the slight strange approach of having an strange and
incomprehensible command line, and then lots of scripts which do default
options, of which xhmlatex is one. In my installation, they've only put
the basic ones into the path, so I ran this with
/usr/share/tex4ht/xhmlatex.


Phil



So someone has to package this up so that it can be easily used.  Before then,
how can it be required for conferences?


http://svn.gnu.org.ua/sources/tex4ht/trunk/bin/ht/unix/xhmlatex


Somehow this is not in my tex4ht package.

In any case, the HTML output it produces is dreadful.   Text characters, even 
outside math, are replaced by numeric XML character entity references.


peter

Re: scientific publishing process (was Re: Cost and access)

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:
 tex4ht takes the slight strange approach of having an strange and
 incomprehensible command line, and then lots of scripts which do default
 options, of which xhmlatex is one. In my installation, they've only put
 the basic ones into the path, so I ran this with
 /usr/share/tex4ht/xhmlatex.


 Phil


 So someone has to package this up so that it can be easily used.  Before 
 then,
 how can it be required for conferences?

 http://svn.gnu.org.ua/sources/tex4ht/trunk/bin/ht/unix/xhmlatex

 Somehow this is not in my tex4ht package.

 In any case, the HTML output it produces is dreadful.   Text characters, even
 outside math, are replaced by numeric XML character entity references.


So, I am willing to spend some time getting this to work. I would like
to plug some ESWC papers into tex4ht, to get some HTML which works plain
and also with Sarven's templates so that it *looks* like a PDF.

Would you be willing to a) try it and b) give worked and short test
cases for things that do not work?

Phil

Re: scientific publishing process (was Re: Cost and access)

2014-10-07 Thread Mark Diggory

Hi John, Kingsley, et al,

On Mon, Oct 6, 2014 at 8:39 AM, John Erickson olyerick...@gmail.com wrote:

This is an incredibly rich and interestingly conversation. I think there
are two separate themes:
1. What is required and/or asked-for by the conference organizers...
a. ...that is needed for the review process
b. ...that is needed to implement value-added services for the conference
c. ...that contributes to the body of work

2. What is required and/or asked for by the publisher?

All of (1) is about the meat of the contributions, including
establishing a long-term legacy. (2) is about (presumably) prestigious
output.

What added services could esp. Easychair provide that would go beyond 1.a.
and contribute to 1.b. and 1.c., etc.? Are there any Easychair committers
watching this thread? ;)

John

--
John S. Erickson, Ph.D.
Deputy Director, Web Science Research Center
Tetherless World Constellation (RPI)
http://tw.rpi.edu olyerick...@gmail.com
Twitter Skype: olyerickson

This makes me think of PLoS. For example, PLoS has a published format
guidelines using Work and Latex (http://www.plosone.org/static/guidelines),
a workflow for semantically structuring their resulting output and their
final output is well structured and available in XML based on a known
standard (http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd),
PDF and the published HTML on their website (
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011233).

This results In semantically meaningful XML that is transformed to HTML

http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0011233representation=XML

Interestingly as well, they have provided this framework in an open source
form:
http://www.ambraproject.org/

Clearly the publication process can support a semantic solution and when
its in the best interest of the publisher. They will adopt and drive their
own markup processes to meet external demand.

Providing tools that both the publisher and the author may use
independently could simplify such an effort, but is not a main driver in
achieving that final result you see in PLoS. This is especially the case
given even the debate concerning file formats here. For PLoS, the solution
that is currently successful is the one that worked to solve todays
immediate local need with todays tools.

Cheers,
Mark

p.s. Finally, on the reference of moving repositories such as EPrints and
DSpace towards supporting semantic markup of their contents. Being somewhat
of a participant in LoD on the DSpace side, I note that these efforts are
inherently just Repository Centric, describing the the structure of the
repository (IE Collections of Items), not the semantic structure contained
within the Item contents (articles, citations, formulas, data tables,
figures, ideas). In both platforms, these capabilities are in their
infancy, lacking any rendering other than to offer the original file for
download, they ultimately suffer from the absence of semantic structure in
the content going into them.

--
Mark R. Diggory

Re: scientific publishing process (was Re: Cost and access)

Sure, I have lots of papers (none for ESWC, though) that could serve as test 
cases.


peter


On 10/07/2014 07:49 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

tex4ht takes the slight strange approach of having an strange and
incomprehensible command line, and then lots of scripts which do default
options, of which xhmlatex is one. In my installation, they've only put
the basic ones into the path, so I ran this with
/usr/share/tex4ht/xhmlatex.


Phil



So someone has to package this up so that it can be easily used.  Before then,
how can it be required for conferences?


http://svn.gnu.org.ua/sources/tex4ht/trunk/bin/ht/unix/xhmlatex


Somehow this is not in my tex4ht package.

In any case, the HTML output it produces is dreadful.   Text characters, even
outside math, are replaced by numeric XML character entity references.



So, I am willing to spend some time getting this to work. I would like
to plug some ESWC papers into tex4ht, to get some HTML which works plain
and also with Sarven's templates so that it *looks* like a PDF.

Would you be willing to a) try it and b) give worked and short test
cases for things that do not work?

Phil

Re: scientific publishing process (was Re: Cost and access)

2014-10-07 Thread Simon Spero

BLUF: This is where information science comes in. Technology must meet the
needs of real users.

It may be better to generate better Tagged PDFs, and to experiment, using
some existing methodology annotation ontologies, with generating auxiliary
files of triples. This might require new/changed latex packages, new
div/span classes, etc.
\huge

But what is really needed is actually working with SMEs to discover the
cultural practices within the field and subfield, and developing systems
that support their work styles. This is why Information Science is
important.

If there are changes in practices that would be beneficial, and these
benefits can be demonstrated to the appropriate audiences, then these can
be suggested.

If existing programs, libraries, and operating systems can be modified to
provide these wins transparently, then it is easier to get the changes
adopted.

If the benefits require additional work, then the additional work must give
proportionate benefits to those doing the work, or be both of great benefit
to funding agencies or other gatekeepers, *and* be easily verifiable.

An example might be a proof (or justified belief) that a paper and it's
supplemental materials do, or do not contain everything required to attempt
to replicate the results.
This might be feasible in many fields through combination of annotation,
with sufficiently powerful KR language and reasoning system.

Similarly, relatively simple meta-statistical analysis can note common
errors (like multiple comparisons that do not correct for False Discovery
Rate). This can be easy if the analysis code is embedded in the paper (eg
SWeave), or if the adjustment method is part of the annotation, and the
decision process need not be total.

This kind of validation can be useful to researchers (less embarrassment),
and useful to gatekeepers (less to manually review).

Convincing communities working with large datasets to use RDF as a native
data format is unlikely to work.

The primary problem is that it isn't a very good one. It's great for
combining data from multiple sources- as long as ever datum is true.
If you want to be less credulous , KMAC YOYO.

Convincing people to add metadata describing values in structures as
owl/rdfs datatypes or classes is much easier- for example, as HDF5
attributes.

If the benefits require major changes to the cultural practices within a
given knowledge community, then they must be extremely important *to that
community*, and will still be resisted, especially by those most
accultutrated into that knowledge community.

An example of this kind of change might be inclusion in supplemental
materials of analyses and data that did not give positive results. This
reduces the file drawer effect, and may improve the justified level of
belief in the significance of published results (p 1.0).

This level of change may require a blood upgrade (
https://www.goodreads.com/quotes/4079-a-new-scientific-truth-does-not-triumph-by-convincing-its).

It might also be imposable from above by extreme measures (if more than 10%
of your claimed significant results can't be replicated, and you can't
provide a reasonable explanation in a court of law, you may be held liable
for consequential damages incurred by others reasonably relying on your
work, and reasonable costs possible punitive damages for costs incurred
attempting to replicate.

Repeat offenders will be fed to a ravenous mob of psychology
undergraduates, or forced to teach introductory creative writing ).

Simon
P. S.

[dvips was much easier if you had access to Distiller]

It is possible to add mathematical content to html pages, but it is not
easy.

MathML is not something that browser developers want, which means that the
only viable approach is MathJax (http://mathjax.org).

Mathjax is impressive, and supports a nice subset of LaTeX (including some
AMS).
However, it adds a noticeable delay to page rendering, as it is heavy duty
eczema script, and is computing layout on the fly.

It does not require server side support, so is usable from static sites
like github pages (see e g. the tests at the bottom of
http://who-wg.github.io).

However the common deployment pattern, using their CDN, adds archival
dependencies.

From a processing perspective, this does not make semantic processing of
the text much easier, as it may require eczema script code to be executed.
On Oct 7, 2014 8:14 AM, Phillip Lord phillip.l...@newcastle.ac.uk
wrote:

On 10/07/2014 05:20 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

tex4ht takes the slight strange approach of having an strange and
incomprehensible command line, and then lots of scripts which do default
options, of which xhmlatex is one. In my installation, they've only put
the basic ones into the path, so I ran this with
/usr/share/tex4ht/xhmlatex.

Phil

So someone has to package this up so that it can be easily used. Before
then,
how can it be required for

Re: scientific publishing process (was Re: Cost and access)

2014-10-07 Thread Kingsley Idehen

On 10/7/14 1:14 PM, Norman Gray wrote:

Sarven, hello.

On 2014 Oct 7, at 13:13, Sarven Capadisli i...@csarven.ca wrote:

On 2014-10-07 11:39, Norman Gray wrote:

The original spark to the thread was a lament that SW and LD conferences don't
mandate something XMLish for submissions because X(HT)ML is clearly better
for... well ... dammit, it's Better.

Straw man argument. Please stop that now!

I will spell out the main proposal and purpose for you because it sounds like
you are completely oblivious to them. Let me know if anything is unclear.

My remark was intended as facetious rather than fractious, but if you feel I
misjudged the balance, I apologise.

I want to clarify what I meant, because on reflection it explains (at least to
me) why I'm participating in this thread at such length. My intention was to
indicate that I don't feel that HTML is as central as you, amongst others, seem
to assert it is.

I characterise the web as:

1. URIs for addressing things,
2. HTTP for retrieving things (other protocols exist, but...),
3. a downloadable format which clients can parse to obtain more URIs, with a
'follow this' semantic.

How about:

1. HTTP URIs for naming (or identifying) things -- basically, the
combined effects of denotation (signification) and connotation
(perceptible description)
2. RDF abstract language for describing things -- systematic use of
signs, syntax, and role semantics for communication
3. Notations for inscribing RDF language based descriptions to documents
-- where notations serve the medium-specific purpose of representing the
words of a language.

Once you have the base RDF Document in place, using a preferred
notation, and subject to viewer preferences, you transform the RDF
document into other document types (HTML, PDF, etc..), in line with
viewer preferences.

Now, the obvious candidate for (3) is of course HTML; but on the web, and
_especially_ on the Semantic Web, it can be anything: RDF in one or other
format, XML+GRDDL, some discipline-specific format with has a link semantic in
it, or even a PDF file with a standardised lump of RDF/XMP inside it.

The trouble with the paragraph above is that RDF isn't a format. That
presumption is the root of mass confusion.

That RDF may be immediately present, or it may require some sort of heuristic
or deterministic extraction (as Kingsley has discussed).

All of these are web-native technologies, and I'd go as far as to say that the
_least_ interesting thing you can find at the end of a URI is an HTML file.

For sure!

The big deal, for me, in the idea of the Semantic Web, and the RDF world, is
the realisation that the RDF model is sufficiently general that you can turn
almost any structured data into RDF, put it into a big bucket, and start
inferencing, querying, linking, and so on. That generation/extraction of RDF
is probably easier if the stuff is already pointy-bracketed for you, but that's
only a detail.

Yes, which is why we have to think of RDF (accurately) as a Language,
and never a format. The format issue is something that should have been
attended to years ago in W3C literature i.e., the notion of abstract and
concrete syntaxes leads to the misconception that RDF is about document
content formats. The loose-coupling of language (signs, syntax, and
semantics) and notations (representation of words of a language) aspect
isn't visible, and as a result lost or overlooked (on a good day).

JSON-LD and TURTLE are all accurately pitched (across all related
collateral) as Notations. Funnily enough, each is also associated with
significant RDF uptake initiatives: TURTLE re., the LOD Cloud and
JSON-LD re., Google, Bing!, Yandex, and possibly Yahoo!, as major RDF
supporters and adopters that are driving mass production of HTML
documents that include RDF-language based structured data (inline or via
structured data islands using script/) .

The interesting thing, for me, is just how the web as a whole can go about
collectively managing or facilitating this generation/extraction in a way which
balances faithfulness to the original with interoperable meaning (Dublin Core
and FOAF are truly wonderful things). That is why I do feel that -- especially
in this SW/LD community --

HTML is a bit of a sideshow.

Yes, it is, but I think Sarven uses it as a simple starting point i.e.,
a point of least distraction, so to speak.

HTML is a splendid thing for all the reasons that you know and I know, but if it's seen
as central, if all questions turn into what does that look like in HTML?, if
it's so in-our-face that we can't see round it, then we miss the interesting questions.

Yes!

So it's not that I've a particular downer on HTML, or a particular enthusiasm for PDF, but I think
that what does that look like in PDF? and what does that look like in FITS?
(the format of choice in my area) are more interesting.

Yes.

(or put another way, I don't think that

Re: scientific publishing process (was Re: Cost and access)

PLOS is an interesting case. The HTML for PLOS articles is relatively
readable. However, the HTML that the PLOS setup produces is failing at math,
even for articles from August 2014.

As well, sometimes when I zoom in or out (so that I can see the math better)
Firefox stops displaying the paper, and I have to reload the whole page.

Strangely, PLOS accepts low-resolution figures, which in one paper I looked at
are quite difficult to read.

However, maybe the PLOS method can be improved to the point where the HTML is
competitive with PDF.

peter

This makes me think of PLoS. For example, PLoS has a published format
guidelines using Work and Latex (http://www.plosone.org/static/guidelines), a
workflow for semantically structuring their resulting output and their final
output is well structured and available in XML based on a known standard
(http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd), PDF and the
published HTML on their website
(http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011233).

This results In semantically meaningful XML that is transformed to HTML

http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0011233representation=XML
http://www.plosone..org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0011233representation=XML

Interestingly as well, they have provided this framework in an open source form:
http://www.ambraproject.org/

Clearly the publication process can support a semantic solution and when its
in the best interest of the publisher. They will adopt and drive their own
markup processes to meet external demand.

Providing tools that both the publisher and the author may use independently
could simplify such an effort, but is not a main driver in achieving that
final result you see in PLoS. This is especially the case given even the
debate concerning file formats here. For PLoS, the solution that is currently
successful is the one that worked to solve todays immediate local need with
todays tools.

Cheers,
Mark

p.s. Finally, on the reference of moving repositories such as EPrints and
DSpace towards supporting semantic markup of their contents. Being somewhat of
a participant in LoD on the DSpace side, I note that these efforts are
inherently just Repository Centric, describing the the structure of the
repository (IE Collections of Items), not the semantic structure contained
within the Item contents (articles, citations, formulas, data tables, figures,
ideas). In both platforms, these capabilities are in their infancy, lacking
any rendering other than to offer the original file for download, they
ultimately suffer from the absence of semantic structure in the content going
into them.

--
Mark R. Diggory

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Sarven Capadisli


On 2014-10-06 06:59, Ivan Herman wrote:
 Of course, I could expect a Web technology related crows to use HTML 
source editing directly but the experience by Daniel and myself with the 
World Wide Web conference(!) is that people do not want to do that. 
(Researchers in, say, Web Search have proven to be unable or unwilling 
to edit HTML source. It was a real surprise...). Ie, the authoring tool 
offers are still limited.


Can you please elaborate on that? When was that and what tools were 
available or used? Do you have any documentation on the landscape from 
that time that we can use or learn from?


My understanding is that, you've experienced some issues about a decade 
ago and your reasoning is clouded by that. Do you think that it would be 
fair to revisit the situation based on today's landscape and see how it 
will play out?


From my perspective, we should have a bit more faith in the SW 
community because then we might actually strive to deliver, as opposed 
to walking away from the problem.


Like I said in my previous emails, (which I'm sure you've read), the 
current workshops on SW/LD research publishing did not deliver. Why do 
you have so much faith for waiting out and hope that they will deliver? 
They might, and I hope they do. But, I'm not putting all my chips on 
that option alone. I would rather see grass-roots efforts in parallel 
e.g., http://csarven.ca/call-for-linked-research


What's the number of human hours on CfP on Linked Science + Semantic 
Publishing so far? How was the delivery of machine and human-friendly 
research changed or evolved? What's visible or countable? On that front, 
what can we do right now that wasn't possible 5-10 years ago?


In the meantime, if the conferences, workshops can get back on track and 
motivate people (at least), we would not only see more value drawn out 
of the SW research, but also growing funding opportunities, and faster 
progress across the field.


I am disappointed by the fact that instead of addressing the core issue 
can the conferences allow or encourage the Web stack? we are 
discussing distractions e.g., perfection in authoring tools. Every user 
has their own preferences i.e., some will code, some will use tool X. 
What you are suggesting is that, lets wait it out because the 
developments may reveal the perfect authorship tooling. If that was ever 
the case, we'd see it in the general market, not something that might 
one day emerge out of SW/LD workshops.


I will bet that if the requirements evolve towards Webby submissions, 
within 3-5 years time, we'd see a notable change in how we collect, 
document and mine scientific research in SW. This is not just being 
hopeful. I believe that if all of the newcomers into the (academic) 
research scene start from HTML (and friends) instead of LaTeX/Word (and 
friends), we wouldn't be having this discussion. If the newcomes are 
told to deal with LaTeX/Word (regardless of hand coding or using a 
WYSIWYG editor) today, they are going to do exactly that. That basically 
pushes the date further for complete switch over to Webby tools because 
majority of those researchers would have to be flushed out of the 
system, before the next wave of Webby users can have their chance.


Even if we have all of the perfect or appropriate tooling (which I think 
is the wrong thing to aim for) right now, it will still take a few years 
to flush out or have the current LaTeX/Word users to evolve. I would 
rather see the smallest change happen right now than nothing at all.


*AGAIN*, technology is not the problem. #DIY

-Sarven
http://csarven.ca/#i



smime.p7s
Description: S/MIME Cryptographic Signature

Re: scientific publishing process (was Re: Cost and access)

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

 One problem with allowing HTML submission is ensuring that reviewers can
 correctly view the submission as the authors intended it to be viewed.  How
 would you feel if your paper was rejected because one of the reviewers could
 not view portions of it?  At least with PDF there is a reasonably good chance
 that every paper can be correctly viewed by all its reviewers, even if they
 have to print it out.  I don't think that the same claim can be made for
 HTML-based systems.


I don't think this is a valid point. It is certainly possible to write
HTML that will not be look good on every machine, but these days, it is
easier to write HTML that does.

The same is true with PDF. Font problems used to be routine. And, as
other people have said, it's very hard to write a PDF that looks good on
anything other than paper.


 Further, why should there be any technical preference for HTML at all?  (Yes,
 HTML is an open standard and PDF is a closed one, but is there anything else
 besides that?)  Web conference vitally use the web in their reviewing and
 publishing processes.  Doesn't that show their allegiance to the web?  Would
 the use of HTML make a conference more webby?

PDF is, I think, open these days. But, yes, I do think that conferences
should dog food. I mean, what would you think if W3C produced all of
their documents in PDF? Would that make sense?

Phil

Re: scientific publishing process (was Re: Cost and access)

Luca Matteis lmatt...@gmail.com writes:

 On Sun, Oct 5, 2014 at 4:34 PM, Ivan Herman i...@w3.org wrote:
 The real problem is still the missing tooling. Authors, even if technically
 savy like this community, want to do what they set up to do: write their
 papers as quickly as possible. They do not want to spend their time going
 through some esoteric CSS massaging, for example. Let us face it: we are not
 yet there. The tools for authoring are still very poor.

 But are they still very poor? I mean, I think there are more tools for
 rendering HTML than there are for rendering Latex. In fact there are
 probably more tools for rendering HTML than anything else out there,
 because HTML is used more than anything else. Because HTML powers the
 Web!

 You can write in Word, and export in HTML. You can write in Markdown
 and export in HTML. You can probably write in Latex and export in HTML
 as well :)


Yes, you can. Most of the publishers use XML at some point in their
process, and latex gets exported to that.

I am quite happy to keep LaTeX as a user interface, because it's very
nice, and the tools for it are mature for academic documents
(in practice, this means cross-referencing and bibliographies).

So, as well as providing a LNCS stylesheet, we'd need a htlatex cf.cfg,
and one CSS and it's done. Be good to have another CSS for on-screen
viewing; LNCS's back of a postage stamp is very poor for that.

Phil

Re: scientific publishing process (was Re: Cost and access)

Sarven Capadisli i...@csarven.ca writes:

 I will bet that if the requirements evolve towards Webby submissions, within
 3-5 years time, we'd see a notable change in how we collect, document and mine
 scientific research in SW. This is not just being hopeful. I believe that if
 all of the newcomers into the (academic) research scene start from HTML (and
 friends) instead of LaTeX/Word (and friends), we wouldn't be having this
 discussion. If the newcomes are told to deal with LaTeX/Word (regardless of
 hand coding or using a WYSIWYG editor) today, they are going to do exactly
 that.


I would look at an environment which has less external force. The free
software engineering community produces it's documents in a very
wide-range of formats. If you peruse github, the key characteristics
are, I think: that they are text formats because they are easy to version
with source and are hackable; and mostly they dump to HTML. PDFs are
very rare these days.

It would be fun to see what the most used are. Markdown is a big
contender, as we as language specific formats (python and
reStructuredText for example).

I don't believe that HTML is a good authoring format any more than PDF
is. I don't think see this as huge problem. HTML needs to be part of the
tool-chain, not all of it.

Phil

Re: scientific publishing process (was Re: Cost and access)


On 10/6/14 7:43 AM, Phillip Lord wrote:

I don't believe that HTML is a good authoring format any more than PDF
is. I don't think see this as huge problem. HTML needs to be part of the
tool-chain, not all of it.


+1

--
Regards,

Kingsley Idehen 
Founder  CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this




smime.p7s
Description: S/MIME Cryptographic Signature

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Mark Diggory

Hello,

My apologies if this is a repost (errors were encountered and my last post
bounced from the listserv)...

On Sun, Oct 5, 2014 at 1:19 PM, Luca Matteis lmatt...@gmail.com wrote:

On Sun, Oct 5, 2014 at 4:34 PM, Ivan Herman i...@w3.org wrote:
The real problem is still the missing tooling. Authors, even if
technically savy like this community, want to do what they set up to do:
write their papers as quickly as possible. They do not want to spend their
time going through some esoteric CSS massaging, for example. Let us face
it: we are not yet there. The tools for authoring are still very poor.

But are they still very poor? I mean, I think there are more tools for
rendering HTML than there are for rendering Latex. In fact there are
probably more tools for rendering HTML than anything else out there,
because HTML is used more than anything else. Because HTML powers the
Web!

You can write in Word, and export in HTML. You can write in Markdown
and export in HTML. You can probably write in Latex and export in HTML
as well :)

The tools are not the problem. The problem to me is the printing
afterwords. Conferences/workshops need to print the publications.
Printing consistent Latex/PDF templates is a lot easier than printing
inconsistent (layout wise) HTML pages.

Best,
Luca

There are tools, for example, theres already a bit of work to provide a
plugin for semantic markup in Microsoft Word (
https://ucsdbiolit.codeplex.com/) and similar efforts on the Latex side (
https://trac.kwarc.info/sTeX/)

But, this is not a question of technology available to authors, but of
requirements defined by publishers. If authors are too busy for this
effort, then publishers facilitate that added value when it is in their
best interest.

For example, PLoS has a published format guidelines using Work and Latex (
http://www.plosone.org/static/guidelines), a workflow for semantically
structuring their resulting output and their final output is well
structured and available in XML based on a known standard (
http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd), PDF and the
published HTML on their website (
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011233).

This results In semantically meaningful XML that is transformed to HTML

http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0011233representation=XML

Clearly the publication process can support solutions and when its in the
best interest of the publisher. They will adopt and drive their own markup
processes to meet external demand.

Providing tools that both the publisher and the author may use
independently could simplify such an effort, but is not a main driver in
achieving that final result you see in PLoS. This is especially the case
given that both file formats and efforts to produce the ideal solution
are inherently localized, competitive and diverse, not collaborative in
nature. For PLoS, the solution that is currently successful is the one that
worked to solve todays immediate local need with todays tools, not the one
that was perfectly designed to meet all tomorrows hypothetical requirements.

Cheers,
Mark Diggory

p.s. Finally, on the reference of moving repositories such as EPrints and
DSpace towards supporting semantic markup of their contents. Being somewhat
of a participant in LoD on the DSpace side, I note that these efforts are
inherently just Repository Centric, describing the the structure of the
repository (IE collections of files), not the semantic structure contained
within those files (ideas, citations, formulas, data tables, figures). In
both cases, these capabilities are in their infancy and without any strict
format and content driven publication workflow, and lacking any rendering
other than to offer the file for download, they ultimately suffer from the
same need for a common Semantic Document format that can be leveraged for
rendering, referencing and indexing.

--
[image: @mire Inc.]
*Mark Diggory*
*2888 Loker Avenue East, Suite 315, Carlsbad, CA. 92010*
*Esperantolaan 4, Heverlee 3001, Belgium*
http://www.atmire.com

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Mark Diggory

On Sun, Oct 5, 2014 at 2:39 PM, Mark Diggory mdigg...@atmire.com wrote:

Hello Community,

On Sun, Oct 5, 2014 at 1:19 PM, Luca Matteis lmatt...@gmail.com wrote:

You can write in Word, and export in HTML. You can write in Markdown
and export in HTML. You can probably write in Latex and export in HTML
as well :)

For example, PLoS has a published format guidelines using Work and Latex (
http://www.plosone.org/static/guidelines), a workflow for semantically
structuring their resulting output and their final output is well
structured and available in XML based on a known standard (
http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd), PDF and
the published HTML on their website (
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011233
).

This results In semantically meaningful XML that is transformed to HTML

http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0011233representation=XML

Clearly the publication process can support solutions and when its in the
best interest of the publisher. They will adopt and drive their own markup
processes to meet external demand.

Providing tools that both the publisher and the author may use
independently could simplify such an effort, but is not a main driver in
achieving that final result you see in PLoS. This is especially the case
given that both file formats and efforts to produce the ideal solution
are inherently localized, competitive and diverse, not collaborative in
nature. For PLoS, the solution that is currently successful is the one that
worked to solve todays immediate local need with todays tools, not the one
that was perfectly designed to meet all tomorrows hypothetical requirements.

Cheers,
Mark Diggory

p.s. Finally, on the reference of moving repositories such as EPrints and
DSpace towards supporting semantic markup of their contents. Being somewhat
of a participant in LoD on the DSpace side, I note that these efforts are
inherently just Repository Centric, describing the the structure of the
repository (IE collections of files), not the semantic structure contained
within those files (ideas, citations, formulas, data tables, figures). In
both cases, these capabilities are in their infancy and without any strict
format and content driven publication workflow, and lacking any rendering
other than to offer the file for download, they ultimately suffer from the
same need for a common Semantic Document format that can be leveraged for
rendering, referencing and indexing.

--
[image: @mire Inc.]
*Mark Diggory*
*2888 Loker Avenue East, Suite 315, Carlsbad, CA. 92010*
*Esperantolaan 4, Heverlee 3001, Belgium*
http://www.atmire.com

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Mark Diggory

Hello Community,

On Sun, Oct 5, 2014 at 1:19 PM, Luca Matteis lmatt...@gmail.com wrote:

You can write in Word, and export in HTML. You can write in Markdown
and export in HTML. You can probably write in Latex and export in HTML
as well :)

This results In semantically meaningful XML that is transformed to HTML

http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0011233representation=XML

Clearly the publication process can support solutions and when its in the
best interest of the publisher. They will adopt and drive their own markup
processes to meet external demand.

Providing tools that both the publisher and the author may use
independently could simplify such an effort, but is not a main driver in
achieving that final result you see in PLoS. This is especially the case
given that both file formats and efforts to produce the ideal solution
are inherently localized, competitive and diverse, not collaborative in
nature. For PLoS, the solution that is currently successful is the one that
worked to solve todays immediate local need with todays tools, not the one
that was perfectly designed to meet all tomorrows hypothetical requirements.

Cheers,
Mark Diggory

p.s. Finally, on the reference of moving repositories such as EPrints and
DSpace towards supporting semantic markup of their contents. Being somewhat
of a participant in LoD on the DSpace side, I note that these efforts are
inherently just Repository Centric, describing the the structure of the
repository (IE collections of files), not the semantic structure contained
within those files (ideas, citations, formulas, data tables, figures). In
both cases, these capabilities are in their infancy and without any strict
format and content driven publication workflow, and lacking any rendering
other than to offer the file for download, they ultimately suffer from the
same need for a common Semantic Document format that can be leveraged for
rendering, referencing and indexing.

--
[image: @mire Inc.]
*Mark Diggory*
*2888 Loker Avenue East, Suite 315, Carlsbad, CA. 92010*
*Esperantolaan 4, Heverlee 3001, Belgium*
http://www.atmire.com

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Paul Houle

Frankly I don't see the reason for the hate on PDF files.

I do a lot of reading on a tablet these days because I can take it to the
gym or on a walk or in the car. Network reliability is not universal when
I leave the house (even if I had a $10 a GB LTE plan) so downloaded PDFs
are my document format of choice.

There might be a lot of hypothetical problems with PDFs, and I am sure
there is a better way to view files on a small screen, but practically I
have no trouble reading papers from arXiv.org, books from oreilly.com, be
these produced by a TeX-derived or Word-derived toolchains or a toolchain
that involves a real page layout tool for that matter.

On Sun, Oct 5, 2014 at 5:43 PM, Mark Diggory mdigg...@atmire.com wrote:

On Sun, Oct 5, 2014 at 2:39 PM, Mark Diggory mdigg...@atmire.com wrote:

Hello Community,

On Sun, Oct 5, 2014 at 1:19 PM, Luca Matteis lmatt...@gmail.com wrote:

You can write in Word, and export in HTML. You can write in Markdown
and export in HTML. You can probably write in Latex and export in HTML
as well :)

For example, PLoS has a published format guidelines using Work and Latex (
http://www.plosone.org/static/guidelines), a workflow for semantically
structuring their resulting output and their final output is well
structured and available in XML based on a known standard (
http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd), PDF and
the published HTML on their website (
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011233
).

This results In semantically meaningful XML that is transformed to HTML

http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0011233representation=XML

Clearly the publication process can support solutions and when its in the
best interest of the publisher. They will adopt and drive their own markup
processes to meet external demand.

Providing tools that both the publisher and the author may use
independently could simplify such an effort, but is not a main driver in
achieving that final result you see in PLoS. This is especially the case
given that both file formats and efforts to produce the ideal solution
are inherently localized, competitive and diverse, not collaborative in
nature. For PLoS, the solution that is currently successful is the one that
worked to solve todays immediate local need with todays tools, not the one
that was perfectly designed to meet all tomorrows hypothetical requirements.

Cheers,
Mark Diggory

p.s. Finally, on the reference of moving repositories such as EPrints and
DSpace towards supporting semantic markup of their contents. Being somewhat
of a participant in LoD on the DSpace side, I note that these efforts are
inherently just Repository Centric, describing the the structure of the
repository (IE collections of files), not the semantic structure contained
within those files (ideas, citations, formulas, data tables, figures). In
both cases, these capabilities are in their infancy and without any strict
format and content driven publication workflow, and lacking any rendering
other than to offer the file for download, they ultimately suffer from the
same need for a common Semantic Document format that can be leveraged for
rendering, referencing and indexing.

--
[image: @mire Inc.]
*Mark Diggory*
*2888 Loker Avenue East, Suite 315, Carlsbad, CA. 92010*
*Esperantolaan 4, Heverlee 3001, Belgium*
http://www.atmire.com

--
[image: @mire Inc.]
*Mark Diggory*
*2888 Loker Avenue East, Suite

Re: scientific publishing process (was Re: Cost and access)


On 10/06/2014 04:15 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:


One problem with allowing HTML submission is ensuring that reviewers can
correctly view the submission as the authors intended it to be viewed.  How
would you feel if your paper was rejected because one of the reviewers could
not view portions of it?  At least with PDF there is a reasonably good chance
that every paper can be correctly viewed by all its reviewers, even if they
have to print it out.  I don't think that the same claim can be made for
HTML-based systems.



I don't think this is a valid point. It is certainly possible to write
HTML that will not be look good on every machine, but these days, it is
easier to write HTML that does.

The same is true with PDF. Font problems used to be routine. And, as
other people have said, it's very hard to write a PDF that looks good on
anything other than paper.


My aesthetics are different.  I routinely view PDFs on my laptop, and find 
that they indeed look great.  As I said before, I prefer PDF to HTML for 
viewing of just about any technical material on my computers.  Yes, on limited 
displays two-column PDF may not be viewable at all.  Single-column PDF should 
look good on displays with resolution of HD or better.


When I view HTML documents, even the ones I have written, I have to do a lot 
of adjusting to get something that looks even half-decent on the screen.  And 
when I print HTML documents, the result is invariably bad, and often very bad.


However, my point was not about looking good.  It was about being able to see 
the paper in the way that the author intended.  My experience is that this is 
generally possible with PDF, but generally not possible with HTML.  I do write 
papers with considerable math in them, so my experience may not be typical, 
but whenever I have tried to produce HTML versions of my papers, I have ended 
up quite frustrated because even I cannot get them to display the way I want 
them to.


It may be that there are now good tools for producing HTML that carries the 
intent of the author.  htlatex has been mentioned in this thread.  A solution 
that uses htlatex would have the benefit of building on much of the work that 
has been done to make latex a reasonable technology for producing papers.  If 
someone wants to create the necessary infrastructure to make htlatex work as 
well as pdflatex does, then feel free.




Further, why should there be any technical preference for HTML at all?  (Yes,
HTML is an open standard and PDF is a closed one, but is there anything else
besides that?)  Web conference vitally use the web in their reviewing and
publishing processes.  Doesn't that show their allegiance to the web?  Would
the use of HTML make a conference more webby?


PDF is, I think, open these days. But, yes, I do think that conferences
should dog food. I mean, what would you think if W3C produced all of
their documents in PDF? Would that make sense?


Actually, I would have been very happy if W3C had produced all its technical 
documents in PDF.  It would have made my life much easier.



Phil



peter

Re: scientific publishing process (was Re: Cost and access)


On 10/06/2014 04:27 AM, Phillip Lord wrote:

[On using htlatex for conferences.]


So, as well as providing a LNCS stylesheet, we'd need a htlatex cf.cfg,
and one CSS and it's done. Be good to have another CSS for on-screen
viewing; LNCS's back of a postage stamp is very poor for that.

Phil


I would be totally astonished if using htlatex as the main way to produce 
conference papers were as simple as this.


I just tried htlatex on my ISWC paper, and the result was, to put it mildly, 
horrible.  (One of my AAAI papers was about the same, the other one caused an 
undefined control sequence and only produced one page of output.)   Several 
parts of the paper were rendered in fixed-width fonts.  There was no attempt 
to limit line length.  Footnotes were in separate files.  Many non-scalable 
images were included, even for simple math.  My carefully designed layout for 
examples was modified in ways that made the examples harder to understand. 
The footnotes did not show up at all in the printed version.


That said, the result was better than I expected.  If someone upgrades htlatex 
to work well I'm quite willing to use it, but I expect that a lot of work is 
going to be needed.


peter

Re: scientific publishing process (was Re: Cost and access)


On 10/6/14 10:25 AM, Paul Houle wrote:

Frankly I don't see the reason for the hate on PDF files.

I do a lot of reading on a tablet these days because I can take it to 
the gym or on a walk or in the car.  Network reliability is not 
universal when I leave the house (even if I had a $10 a GB LTE plan) 
so downloaded PDFs are my document format of choice.


There might be a lot of hypothetical problems with PDFs,  and I am 
sure there is a better way to view files on a small screen,  but 
practically I have no trouble reading papers from arXiv.org,  books 
from oreilly.com http://oreilly.com,  be these produced by a 
TeX-derived or Word-derived toolchains or a toolchain that involves a 
real page layout tool for that matter.


Paul,

As I see it, the issue here is more to do with PDF being the only 
option, rather than no PDFs at all. Put differently, we are not using 
our horses for course technology (the Web that emerges from AWWW 
exploitation) to produce horses for course conference artifacts. 
Instead, we continue to impose (overtly or covertly) specific options 
that are contradictory, and of diminishing value.


Conferences (associated with themes like Semantic Web and Linked Open 
Data) should accept submissions that provide open access to relevant 
research data. In a sense, imagine if PDFs where submitted without 
bibliographic references. Basically, that's what happening here with 
research data circa. 2014, where we have a functioning Web of Linked 
(Open) Data, which is based on AWWW.


Loosely coupling the print-friendly documents (PDFs, Latex etc.), 
http-browser friendly documents (HTML), and actual raw data references 
(which take the form of 5-Star Linked Open Data ) is a practical staring 
point. Adding experiment workflow (which is also becoming the norm in 
the bio informatics realm) is a nice bonus, as already demonstrated by 
examples provided by Hugh Glaser (see: this weekend's thread).


Kingsley







On Sun, Oct 5, 2014 at 5:43 PM, Mark Diggory mdigg...@atmire.com 
mailto:mdigg...@atmire.com wrote:



On Sun, Oct 5, 2014 at 2:39 PM, Mark Diggory mdigg...@atmire.com
mailto:mdigg...@atmire.com wrote:

Hello Community,

On Sun, Oct 5, 2014 at 1:19 PM, Luca Matteis
lmatt...@gmail.com mailto:lmatt...@gmail.com wrote:

On Sun, Oct 5, 2014 at 4:34 PM, Ivan Herman i...@w3.org
mailto:i...@w3.org wrote:
 The real problem is still the missing tooling. Authors,
even if technically savy like this community, want to do
what they set up to do: write their papers as quickly as
possible. They do not want to spend their time going
through some esoteric CSS massaging, for example. Let us
face it: we are not yet there. The tools for authoring are
still very poor.

But are they still very poor? I mean, I think there are
more tools for
rendering HTML than there are for rendering Latex. In fact
there are
probably more tools for rendering HTML than anything else
out there,
because HTML is used more than anything else. Because HTML
powers the
Web! 



You can write in Word, and export in HTML. You can write
in Markdown
and export in HTML. You can probably write in Latex and
export in HTML
as well :) 



The tools are not the problem. The problem to me is the
printing
afterwords. Conferences/workshops need to print the
publications.
Printing consistent Latex/PDF templates is a lot easier
than printing
inconsistent (layout wise) HTML pages.


There are tools, for example, theres already a bit of work to
provide a plugin for semantic markup in Microsoft Word
(https://ucsdbiolit.codeplex.com/) and similar efforts on the
Latex side (https://trac.kwarc.info/sTeX/)

But, this is not a question of technology available to
authors, but of requirements defined by publishers. If authors
are too busy for this effort, then publishers facilitate that
added value when it is in their best interest.

For example, PLoS has a published format guidelines using Work
and Latex (http://www.plosone.org/static/guidelines), a
workflow for semantically structuring their resulting output
and their final output is well structured and available in XML
based on a known standard
(http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd), PDF
and the published HTML on their website

(http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011233).

This results In semantically meaningful XML that is
transformed to HTML

Re: scientific publishing process (was Re: Cost and access)

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:
 However, my point was not about looking good.  It was about being able to see
 the paper in the way that the author intended. 

Yes, I understand this. It's not something that I consider at all
important, which perhaps represents our different view points. Readers
have different preferences. I prefer reading in inverse video; I like to
be able to change font size to zoom in and out. I quite like fixed width
fonts. Other people like the two column thing. Other people want things
read to them.

Who cares what the authors intend? I mean, they are not reading the
paper, are they?


 I do write papers with considerable math in them, so my experience may
 not be typical, but whenever I have tried to produce HTML versions of
 my papers, I have ended up quite frustrated because even I cannot get
 them to display the way I want them to.

I've been using mathjax on my website for a long time and it seems to
work well, although I am not maths heavy.


 It may be that there are now good tools for producing HTML that carries the
 intent of the author.  htlatex has been mentioned in this thread.  A solution
 that uses htlatex would have the benefit of building on much of the work that
 has been done to make latex a reasonable technology for producing papers.  If
 someone wants to create the necessary infrastructure to make htlatex work as
 well as pdflatex does, then feel free.

It's more to make htlatex work as well as lncs.sty works. htlatex
produces reasonable, if dull, HTML of the bat.

Phil

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Luca Matteis

On Mon, Oct 6, 2014 at 5:29 PM, Phillip Lord
phillip.l...@newcastle.ac.uk wrote:
 Who cares what the authors intend? I mean, they are not reading the
 paper, are they?

Authors might have adjusted things that way specifically to deliver
their message. I think being able to have consistent layouts *as the
authors intend it* is a very important thing. It's also important on
the Web: people want their site to look  feel in a very specific and
consistent way.

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread John Erickson

This is an incredibly rich and interestingly conversation. I think there
are two separate themes:
1. What is required and/or asked-for by the conference organizers...
a. ...that is needed for the review process
b. ...that is needed to implement value-added services for the conference
c. ...that contributes to the body of work

2. What is required and/or asked for by the publisher?

All of (1) is about the meat of the contributions, including establishing
a long-term legacy. (2) is about (presumably) prestigious output.

What added services could esp. Easychair provide that would go beyond 1.a.
and contribute to 1.b. and 1.c., etc.? Are there any Easychair committers
watching this thread? ;)

John

On Mon, Oct 6, 2014 at 11:17 AM, Kingsley Idehen kide...@openlinksw.com
wrote:

  On 10/6/14 10:25 AM, Paul Houle wrote:

 Frankly I don't see the reason for the hate on PDF files.

  I do a lot of reading on a tablet these days because I can take it to
 the gym or on a walk or in the car.  Network reliability is not universal
 when I leave the house (even if I had a $10 a GB LTE plan) so downloaded
 PDFs are my document format of choice.

  There might be a lot of hypothetical problems with PDFs,  and I am sure
 there is a better way to view files on a small screen,  but practically I
 have no trouble reading papers from arXiv.org,  books from oreilly.com,
  be these produced by a TeX-derived or Word-derived toolchains or a
 toolchain that involves a real page layout tool for that matter.


 Paul,

 As I see it, the issue here is more to do with PDF being the only option,
 rather than no PDFs at all. Put differently, we are not using our horses
 for course technology (the Web that emerges from AWWW exploitation) to
 produce horses for course conference artifacts. Instead, we continue to
 impose (overtly or covertly) specific options that are contradictory, and
 of diminishing value.

 Conferences (associated with themes like Semantic Web and Linked Open
 Data) should accept submissions that provide open access to relevant
 research data. In a sense, imagine if PDFs where submitted without
 bibliographic references. Basically, that's what happening here with
 research data circa. 2014, where we have a functioning Web of Linked (Open)
 Data, which is based on AWWW.

 Loosely coupling the print-friendly documents (PDFs, Latex etc.),
 http-browser friendly documents (HTML), and actual raw data references
 (which take the form of 5-Star Linked Open Data ) is a practical staring
 point. Adding experiment workflow (which is also becoming the norm in the
 bio informatics realm) is a nice bonus, as already demonstrated by examples
 provided by Hugh Glaser (see: this weekend's thread).

 Kingsley






 On Sun, Oct 5, 2014 at 5:43 PM, Mark Diggory mdigg...@atmire.com wrote:


 On Sun, Oct 5, 2014 at 2:39 PM, Mark Diggory mdigg...@atmire.com wrote:

 Hello Community,

  On Sun, Oct 5, 2014 at 1:19 PM, Luca Matteis lmatt...@gmail.com
 wrote:

 On Sun, Oct 5, 2014 at 4:34 PM, Ivan Herman i...@w3.org wrote:
  The real problem is still the missing tooling. Authors, even if
 technically savy like this community, want to do what they set up to do:
 write their papers as quickly as possible. They do not want to spend their
 time going through some esoteric CSS massaging, for example. Let us face
 it: we are not yet there. The tools for authoring are still very poor.

 But are they still very poor? I mean, I think there are more tools for
 rendering HTML than there are for rendering Latex. In fact there are
 probably more tools for rendering HTML than anything else out there,
 because HTML is used more than anything else. Because HTML powers the
 Web!


 You can write in Word, and export in HTML. You can write in Markdown
 and export in HTML. You can probably write in Latex and export in HTML
 as well :)


 The tools are not the problem. The problem to me is the printing
 afterwords. Conferences/workshops need to print the publications.
 Printing consistent Latex/PDF templates is a lot easier than printing
 inconsistent (layout wise) HTML pages.


   There are tools, for example, theres already a bit of work to provide
 a plugin for semantic markup in Microsoft Word (
 https://ucsdbiolit.codeplex.com/) and similar efforts on the Latex side
 (https://trac.kwarc.info/sTeX/)

  But, this is not a question of technology available to authors, but of
 requirements defined by publishers. If authors are too busy for this
 effort, then publishers facilitate that added value when it is in their
 best interest.

 For example, PLoS has a published format guidelines using Work and Latex
 (http://www.plosone.org/static/guidelines), a workflow for semantically
 structuring their resulting output and their final output is well
 structured and available in XML based on a known standard (
 http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd), PDF and
 the published HTML on their website (

Re: scientific publishing process (was Re: Cost and access)

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:
 I would be totally astonished if using htlatex as the main way to produce
 conference papers were as simple as this.

 I just tried htlatex on my ISWC paper, and the result was, to put it mildly,
 horrible.  (One of my AAAI papers was about the same, the other one caused an
 undefined control sequence and only produced one page of output.)   Several
 parts of the paper were rendered in fixed-width fonts.  There was no attempt
 to limit line length.  Footnotes were in separate files.


The footnote thing is pretty strange, I have to agree. Although
footnotes are a fairly alien concept wrt to the web. Probably hover
overs would be a reasonable presentation for this.


 Many non-scalable images were included, even for simple math.

It does MathML I think, which is then rendered client side. Or you could
drop math-mode straight through and render client side with mathjax.


 My carefully designed layout for examples was modified in ways that
 made the examples harder to understand. 

Perhaps this is a key difference between us. I don't care about the
layout, and want someone to do it for me; it's one of the reasons I use
latex as well.


 That said, the result was better than I expected.  If someone upgrades htlatex
 to work well I'm quite willing to use it, but I expect that a lot of work is
 going to be needed.

Which gets us back to the chicken and egg situation. I would probably do
this; but, at the moment, ESWC and ISWC won't let me submit it. So, I'll
end up with the PDF output anyway.

This is why it is important that web conferences allow HTML, which is
where the argument started. If you want something that prints just
right, PDF is the thing for you. If you you want to read your papers in
the bath, likewise, PDF is the thing for you. And that's fine by me (so
long as you don't mind me reading your papers in the bath!). But it
needs to not be the only option.

Phil

Re: scientific publishing process (was Re: Cost and access)




On 10/06/2014 08:38 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

I would be totally astonished if using htlatex as the main way to produce
conference papers were as simple as this.

I just tried htlatex on my ISWC paper, and the result was, to put it mildly,
horrible.  (One of my AAAI papers was about the same, the other one caused an
undefined control sequence and only produced one page of output.)   Several
parts of the paper were rendered in fixed-width fonts.  There was no attempt
to limit line length.  Footnotes were in separate files.



The footnote thing is pretty strange, I have to agree. Although
footnotes are a fairly alien concept wrt to the web. Probably hover
overs would be a reasonable presentation for this.



Many non-scalable images were included, even for simple math.


It does MathML I think, which is then rendered client side. Or you could
drop math-mode straight through and render client side with mathjax.


Well, somehow png files are being produced for some math, which is a failure. 
 I don't know what the way to do this right would be, I just know that the 
version of htlatex for Fedora 20 fails to reasonably handle the math in this 
paper.



My carefully designed layout for examples was modified in ways that
made the examples harder to understand.


Perhaps this is a key difference between us. I don't care about the
layout, and want someone to do it for me; it's one of the reasons I use
latex as well.


There are many cases where line breaks and indentation are important for 
understanding.  Getting this sort of presentation right in latex is a pain for 
starters, but when it has been done, having the htlatex toolchain mess it up 
is a failure.



That said, the result was better than I expected.  If someone upgrades htlatex
to work well I'm quite willing to use it, but I expect that a lot of work is
going to be needed.


Which gets us back to the chicken and egg situation. I would probably do
this; but, at the moment, ESWC and ISWC won't let me submit it. So, I'll
end up with the PDF output anyway.


Well, I'm with ESWC and ISWC here.  The review process should be designed to 
make reviewing easy for reviewers.  Until viewing HTML output is as 
trouble-free as viewing PDF output, then PDF should be the required format.



This is why it is important that web conferences allow HTML, which is
where the argument started. If you want something that prints just
right, PDF is the thing for you. If you you want to read your papers in
the bath, likewise, PDF is the thing for you. And that's fine by me (so
long as you don't mind me reading your papers in the bath!). But it
needs to not be the only option.


Why?  What are the benefits of HTML reviewing, right now?  What are the 
benefits of HTML publishing, right now?  If there were HTML-based tools that 
worked well for preparing, reviewing, and reading scientific papers, then 
maybe conferences would use them.  However, conference organizers and 
reviewers have limited time, and are thus going for the simplest solution that 
works well.


If some group thinks that a good HTML-based solution is possible, then let 
them produce this solution.  If the group can get pre-approval of some 
conference, then more power to them.  However, I'm not going to vote for any 
pre-approval of some future solution when the current situation is satisficing.



Phil


peter

Re: scientific publishing process (was Re: Cost and access)




On 10/06/2014 08:29 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

However, my point was not about looking good.  It was about being able to see
the paper in the way that the author intended.


Yes, I understand this. It's not something that I consider at all
important, which perhaps represents our different view points. Readers
have different preferences. I prefer reading in inverse video; I like to
be able to change font size to zoom in and out. I quite like fixed width
fonts. Other people like the two column thing. Other people want things
read to them.

Who cares what the authors intend? I mean, they are not reading the
paper, are they?


For reviewing, what the authors intend is extremely important.  Having 
different rendering of the paper interfere with the authors' message is 
something that should be avoided at all costs.  Similarly for reading papers, 
if the rendering of the paper interferes with the authors' message, that is a 
failure of the process.



I do write papers with considerable math in them, so my experience may
not be typical, but whenever I have tried to produce HTML versions of
my papers, I have ended up quite frustrated because even I cannot get
them to display the way I want them to.


I've been using mathjax on my website for a long time and it seems to
work well, although I am not maths heavy.



It may be that there are now good tools for producing HTML that carries the
intent of the author.  htlatex has been mentioned in this thread.  A solution
that uses htlatex would have the benefit of building on much of the work that
has been done to make latex a reasonable technology for producing papers.  If
someone wants to create the necessary infrastructure to make htlatex work as
well as pdflatex does, then feel free.


It's more to make htlatex work as well as lncs.sty works. htlatex
produces reasonable, if dull, HTML of the bat.


My experience is that htlatex produces very bad output.


Phil


peter

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Martynas Jusevičius

Dear Peter,

please show me how to query PDFs with SPARQL. Then I'll believe there
are no benefits of XHTML+RDFa over PDF.

Addressing the issue from the reviewer perspective only is too narrow,
don't you think?


Martynas

On Mon, Oct 6, 2014 at 6:08 PM, Peter F. Patel-Schneider
pfpschnei...@gmail.com wrote:


 On 10/06/2014 08:38 AM, Phillip Lord wrote:

 Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

 I would be totally astonished if using htlatex as the main way to produce
 conference papers were as simple as this.

 I just tried htlatex on my ISWC paper, and the result was, to put it
 mildly,
 horrible.  (One of my AAAI papers was about the same, the other one
 caused an
 undefined control sequence and only produced one page of output.)
 Several
 parts of the paper were rendered in fixed-width fonts.  There was no
 attempt
 to limit line length.  Footnotes were in separate files.



 The footnote thing is pretty strange, I have to agree. Although
 footnotes are a fairly alien concept wrt to the web. Probably hover
 overs would be a reasonable presentation for this.


 Many non-scalable images were included, even for simple math.


 It does MathML I think, which is then rendered client side. Or you could
 drop math-mode straight through and render client side with mathjax.


 Well, somehow png files are being produced for some math, which is a
 failure.  I don't know what the way to do this right would be, I just know
 that the version of htlatex for Fedora 20 fails to reasonably handle the
 math in this paper.

 My carefully designed layout for examples was modified in ways that
 made the examples harder to understand.


 Perhaps this is a key difference between us. I don't care about the
 layout, and want someone to do it for me; it's one of the reasons I use
 latex as well.


 There are many cases where line breaks and indentation are important for
 understanding.  Getting this sort of presentation right in latex is a pain
 for starters, but when it has been done, having the htlatex toolchain mess
 it up is a failure.

 That said, the result was better than I expected.  If someone upgrades
 htlatex
 to work well I'm quite willing to use it, but I expect that a lot of work
 is
 going to be needed.


 Which gets us back to the chicken and egg situation. I would probably do
 this; but, at the moment, ESWC and ISWC won't let me submit it. So, I'll
 end up with the PDF output anyway.


 Well, I'm with ESWC and ISWC here.  The review process should be designed to
 make reviewing easy for reviewers.  Until viewing HTML output is as
 trouble-free as viewing PDF output, then PDF should be the required format.

 This is why it is important that web conferences allow HTML, which is
 where the argument started. If you want something that prints just
 right, PDF is the thing for you. If you you want to read your papers in
 the bath, likewise, PDF is the thing for you. And that's fine by me (so
 long as you don't mind me reading your papers in the bath!). But it
 needs to not be the only option.


 Why?  What are the benefits of HTML reviewing, right now?  What are the
 benefits of HTML publishing, right now?  If there were HTML-based tools that
 worked well for preparing, reviewing, and reading scientific papers, then
 maybe conferences would use them.  However, conference organizers and
 reviewers have limited time, and are thus going for the simplest solution
 that works well.

 If some group thinks that a good HTML-based solution is possible, then let
 them produce this solution.  If the group can get pre-approval of some
 conference, then more power to them.  However, I'm not going to vote for any
 pre-approval of some future solution when the current situation is
 satisficing.

 Phil


 peter

Re: scientific publishing process (was Re: Cost and access)

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:
 It does MathML I think, which is then rendered client side. Or you could
 drop math-mode straight through and render client side with mathjax.

 Well, somehow png files are being produced for some math, which is a failure.

Yeah, you have to tell it to do mathml. The problem is that older
versions of the browsers don't render mathml, and image rendering was
the only option.

 I don't know what the way to do this right would be, I just know that the

 There are many cases where line breaks and indentation are important for
 understanding.  Getting this sort of presentation right in latex is a pain for
 starters, but when it has been done, having the htlatex toolchain mess it up
 is a failure.

Indeed. I believe that there are plans in future versions of HTML to
introduce a pre tag which prefers indentation and line breaks.


 Which gets us back to the chicken and egg situation. I would probably do
 this; but, at the moment, ESWC and ISWC won't let me submit it. So, I'll
 end up with the PDF output anyway.

 Well, I'm with ESWC and ISWC here.  The review process should be designed to
 make reviewing easy for reviewers.

I *only* use PDF when reviewing. I never use it for viewing anything
else. I only use it for reviewing since I am forced to. 

Experiences differ, so I find this a far from compelling argument.


 This is why it is important that web conferences allow HTML, which is
 where the argument started. 

 Why?  What are the benefits of HTML reviewing, right now?  What are the
 benefits of HTML publishing, right now?

Well, we've been through this before, so I'll not repeat myself.

Phil

Re: scientific publishing process (was Re: Cost and access)

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:
 Who cares what the authors intend? I mean, they are not reading the
 paper, are they?

 For reviewing, what the authors intend is extremely important.  Having
 different rendering of the paper interfere with the authors' message is
 something that should be avoided at all costs.

Really? So, for example, you think that a reviewer with impared vision
should, for example, be forced to review a paper using the authors
rendering, regardless of whether they can read it or not?

Of course, this is an extreme example, although not an unrealistic one.
It is fundamentally any different from my desire as I get older to be
able to change font size and refill paragraphs with ease. I see a
difference of scale, that is all.


 Similarly for reading papers, if the rendering of the paper interferes
 with the authors' message, that is a failure of the process.

Yes, I agree. Which is why, I believe, that the rendering of a paper
should be up to the reader.

Phil

Re: scientific publishing process (was Re: Cost and access)

It's not hard to query PDFs with SPARQL.  All you have to do is extract the 
metadata from the document and turn it into RDF, if needed.  Lots of programs 
extract and display this metadata already.


No, I don't think that viewing this issue from the reviewer perspective is too 
narrow.  Reviewers form  a vital part of the scientific publishing process. 
Anything that makes their jobs harder or the results that they produce worse 
is going to have to have very large benefits over the current setup.  In any 
case, I haven't been looking at the reviewer perspective only, even in the 
message quoted below.


peter

PS:  This is *not* to say that I think that the reviewing process is anywhere 
near ideal.  On the contrary, I think that the reviewing process has many 
problems, particularly as it is performed in CS conferences.



On 10/06/2014 09:19 AM, Martynas Jusevičius wrote:

Dear Peter,

please show me how to query PDFs with SPARQL. Then I'll believe there
are no benefits of XHTML+RDFa over PDF.

Addressing the issue from the reviewer perspective only is too narrow,
don't you think?


Martynas

On Mon, Oct 6, 2014 at 6:08 PM, Peter F. Patel-Schneider
pfpschnei...@gmail.com wrote:



On 10/06/2014 08:38 AM, Phillip Lord wrote:


Peter F. Patel-Schneider pfpschnei...@gmail.com writes:


I would be totally astonished if using htlatex as the main way to produce
conference papers were as simple as this.

I just tried htlatex on my ISWC paper, and the result was, to put it
mildly,
horrible.  (One of my AAAI papers was about the same, the other one
caused an
undefined control sequence and only produced one page of output.)
Several
parts of the paper were rendered in fixed-width fonts.  There was no
attempt
to limit line length.  Footnotes were in separate files.




The footnote thing is pretty strange, I have to agree. Although
footnotes are a fairly alien concept wrt to the web. Probably hover
overs would be a reasonable presentation for this.



Many non-scalable images were included, even for simple math.



It does MathML I think, which is then rendered client side. Or you could
drop math-mode straight through and render client side with mathjax.



Well, somehow png files are being produced for some math, which is a
failure.  I don't know what the way to do this right would be, I just know
that the version of htlatex for Fedora 20 fails to reasonably handle the
math in this paper.


My carefully designed layout for examples was modified in ways that
made the examples harder to understand.



Perhaps this is a key difference between us. I don't care about the
layout, and want someone to do it for me; it's one of the reasons I use
latex as well.



There are many cases where line breaks and indentation are important for
understanding.  Getting this sort of presentation right in latex is a pain
for starters, but when it has been done, having the htlatex toolchain mess
it up is a failure.


That said, the result was better than I expected.  If someone upgrades
htlatex
to work well I'm quite willing to use it, but I expect that a lot of work
is
going to be needed.



Which gets us back to the chicken and egg situation. I would probably do
this; but, at the moment, ESWC and ISWC won't let me submit it. So, I'll
end up with the PDF output anyway.



Well, I'm with ESWC and ISWC here.  The review process should be designed to
make reviewing easy for reviewers.  Until viewing HTML output is as
trouble-free as viewing PDF output, then PDF should be the required format.


This is why it is important that web conferences allow HTML, which is
where the argument started. If you want something that prints just
right, PDF is the thing for you. If you you want to read your papers in
the bath, likewise, PDF is the thing for you. And that's fine by me (so
long as you don't mind me reading your papers in the bath!). But it
needs to not be the only option.



Why?  What are the benefits of HTML reviewing, right now?  What are the
benefits of HTML publishing, right now?  If there were HTML-based tools that
worked well for preparing, reviewing, and reading scientific papers, then
maybe conferences would use them.  However, conference organizers and
reviewers have limited time, and are thus going for the simplest solution
that works well.

If some group thinks that a good HTML-based solution is possible, then let
them produce this solution.  If the group can get pre-approval of some
conference, then more power to them.  However, I'm not going to vote for any
pre-approval of some future solution when the current situation is
satisficing.


Phil



peter

Re: scientific publishing process (was Re: Cost and access)


On 10/06/2014 09:28 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

It does MathML I think, which is then rendered client side. Or you could
drop math-mode straight through and render client side with mathjax.


Well, somehow png files are being produced for some math, which is a failure.


Yeah, you have to tell it to do mathml. The problem is that older
versions of the browsers don't render mathml, and image rendering was
the only option.


Well, then someone is going to have to tell people how to do this.  What I saw 
for htlatex was that it just did the right thing.



I don't know what the way to do this right would be, I just know that the

There are many cases where line breaks and indentation are important for
understanding.  Getting this sort of presentation right in latex is a pain for
starters, but when it has been done, having the htlatex toolchain mess it up
is a failure.


Indeed. I believe that there are plans in future versions of HTML to
introduce a pre tag which prefers indentation and line breaks.



Which gets us back to the chicken and egg situation. I would probably do
this; but, at the moment, ESWC and ISWC won't let me submit it. So, I'll
end up with the PDF output anyway.


Well, I'm with ESWC and ISWC here.  The review process should be designed to
make reviewing easy for reviewers.


I *only* use PDF when reviewing. I never use it for viewing anything
else. I only use it for reviewing since I am forced to.

Experiences differ, so I find this a far from compelling argument.


It may not be a compelling argument when choosing between two new 
alternatives, but it is much more compelling argument against change.



This is why it is important that web conferences allow HTML, which is
where the argument started.



Why?  What are the benefits of HTML reviewing, right now?  What are the
benefits of HTML publishing, right now?


Well, we've been through this before, so I'll not repeat myself.

Phil



Yes, and I haven't seen any benefits using the current setup.

peter

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Martynas Jusevičius

Following the same logic, we still could have been using paper
submissions? All you have to do is to scan them to turn them into
PDFs.

It's been a while since I was in the university, but wasn't
dissemination an important part of science? What about dogfooding
after all?


Martynas

On Mon, Oct 6, 2014 at 6:48 PM, Peter F. Patel-Schneider
pfpschnei...@gmail.com wrote:
 It's not hard to query PDFs with SPARQL.  All you have to do is extract the
 metadata from the document and turn it into RDF, if needed.  Lots of
 programs extract and display this metadata already.

 No, I don't think that viewing this issue from the reviewer perspective is
 too narrow.  Reviewers form  a vital part of the scientific publishing
 process. Anything that makes their jobs harder or the results that they
 produce worse is going to have to have very large benefits over the current
 setup.  In any case, I haven't been looking at the reviewer perspective
 only, even in the message quoted below.

 peter

 PS:  This is *not* to say that I think that the reviewing process is
 anywhere near ideal.  On the contrary, I think that the reviewing process
 has many problems, particularly as it is performed in CS conferences.



 On 10/06/2014 09:19 AM, Martynas Jusevičius wrote:

 Dear Peter,

 please show me how to query PDFs with SPARQL. Then I'll believe there
 are no benefits of XHTML+RDFa over PDF.

 Addressing the issue from the reviewer perspective only is too narrow,
 don't you think?


 Martynas

 On Mon, Oct 6, 2014 at 6:08 PM, Peter F. Patel-Schneider
 pfpschnei...@gmail.com wrote:



 On 10/06/2014 08:38 AM, Phillip Lord wrote:


 Peter F. Patel-Schneider pfpschnei...@gmail.com writes:


 I would be totally astonished if using htlatex as the main way to
 produce
 conference papers were as simple as this.

 I just tried htlatex on my ISWC paper, and the result was, to put it
 mildly,
 horrible.  (One of my AAAI papers was about the same, the other one
 caused an
 undefined control sequence and only produced one page of output.)
 Several
 parts of the paper were rendered in fixed-width fonts.  There was no
 attempt
 to limit line length.  Footnotes were in separate files.




 The footnote thing is pretty strange, I have to agree. Although
 footnotes are a fairly alien concept wrt to the web. Probably hover
 overs would be a reasonable presentation for this.


 Many non-scalable images were included, even for simple math.



 It does MathML I think, which is then rendered client side. Or you could
 drop math-mode straight through and render client side with mathjax.



 Well, somehow png files are being produced for some math, which is a
 failure.  I don't know what the way to do this right would be, I just
 know
 that the version of htlatex for Fedora 20 fails to reasonably handle the
 math in this paper.

 My carefully designed layout for examples was modified in ways that
 made the examples harder to understand.



 Perhaps this is a key difference between us. I don't care about the
 layout, and want someone to do it for me; it's one of the reasons I use
 latex as well.



 There are many cases where line breaks and indentation are important for
 understanding.  Getting this sort of presentation right in latex is a
 pain
 for starters, but when it has been done, having the htlatex toolchain
 mess
 it up is a failure.

 That said, the result was better than I expected.  If someone upgrades
 htlatex
 to work well I'm quite willing to use it, but I expect that a lot of
 work
 is
 going to be needed.



 Which gets us back to the chicken and egg situation. I would probably do
 this; but, at the moment, ESWC and ISWC won't let me submit it. So, I'll
 end up with the PDF output anyway.



 Well, I'm with ESWC and ISWC here.  The review process should be designed
 to
 make reviewing easy for reviewers.  Until viewing HTML output is as
 trouble-free as viewing PDF output, then PDF should be the required
 format.

 This is why it is important that web conferences allow HTML, which is
 where the argument started. If you want something that prints just
 right, PDF is the thing for you. If you you want to read your papers in
 the bath, likewise, PDF is the thing for you. And that's fine by me (so
 long as you don't mind me reading your papers in the bath!). But it
 needs to not be the only option.



 Why?  What are the benefits of HTML reviewing, right now?  What are the
 benefits of HTML publishing, right now?  If there were HTML-based tools
 that
 worked well for preparing, reviewing, and reading scientific papers, then
 maybe conferences would use them.  However, conference organizers and
 reviewers have limited time, and are thus going for the simplest solution
 that works well.

 If some group thinks that a good HTML-based solution is possible, then
 let
 them produce this solution.  If the group can get pre-approval of some
 conference, then more power to them.  However, I'm not going to vote for
 any
 pre-approval of some future

Re: scientific publishing process (was Re: Cost and access)


On 10/06/2014 09:32 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

Who cares what the authors intend? I mean, they are not reading the
paper, are they?


For reviewing, what the authors intend is extremely important.  Having
different rendering of the paper interfere with the authors' message is
something that should be avoided at all costs.


Really? So, for example, you think that a reviewer with impared vision
should, for example, be forced to review a paper using the authors
rendering, regardless of whether they can read it or not?


No, but this is not what I was talking about.  I was talking about interfering 
with the authors' message via changes from the rendering that the authors' set up.



Of course, this is an extreme example, although not an unrealistic one.
It is fundamentally any different from my desire as I get older to be
able to change font size and refill paragraphs with ease. I see a
difference of scale, that is all.


I see these as completely different.  There are some aspects of rendering that 
generally do not interfere with intent.  There are other aspects of rendering 
that can easily interfere with intent.



Similarly for reading papers, if the rendering of the paper interferes
with the authors' message, that is a failure of the process.


Yes, I agree. Which is why, I believe, that the rendering of a paper
should be up to the reader


As this is why I believe that the authors' should be able to specify the 
rendering of their paper to the extent that they feel is needed to convey the 
intent of the paper.

.

Phil


peter

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Alexander Garcia Castro

I would be much more generic here,

show me how to query a bunch of PDFs with anything... of course, the answer
will go like you can extract the text and do A and the B and then get a
relatively decent text depending on A B and C. then someone else will
chime in and say and this is just because people dont know how to generate
PDFs, if one generates a PDF using ADOBE tools like A B and C then the PDF
will be perfect for text mining and bla bla bla

PDF is ok for a consistent layout, HTML is great for what it was created.
but neither of those formats, AFAIK were conceived, engineered for
scientific papers, executable, self describing, embedded within the web of
data, etc.

On Mon, Oct 6, 2014 at 9:19 AM, Martynas Jusevičius marty...@graphity.org
wrote:

 Dear Peter,

 please show me how to query PDFs with SPARQL. Then I'll believe there
 are no benefits of XHTML+RDFa over PDF.

 Addressing the issue from the reviewer perspective only is too narrow,
 don't you think?


 Martynas

 On Mon, Oct 6, 2014 at 6:08 PM, Peter F. Patel-Schneider
 pfpschnei...@gmail.com wrote:
 
 
  On 10/06/2014 08:38 AM, Phillip Lord wrote:
 
  Peter F. Patel-Schneider pfpschnei...@gmail.com writes:
 
  I would be totally astonished if using htlatex as the main way to
 produce
  conference papers were as simple as this.
 
  I just tried htlatex on my ISWC paper, and the result was, to put it
  mildly,
  horrible.  (One of my AAAI papers was about the same, the other one
  caused an
  undefined control sequence and only produced one page of output.)
  Several
  parts of the paper were rendered in fixed-width fonts.  There was no
  attempt
  to limit line length.  Footnotes were in separate files.
 
 
 
  The footnote thing is pretty strange, I have to agree. Although
  footnotes are a fairly alien concept wrt to the web. Probably hover
  overs would be a reasonable presentation for this.
 
 
  Many non-scalable images were included, even for simple math.
 
 
  It does MathML I think, which is then rendered client side. Or you could
  drop math-mode straight through and render client side with mathjax.
 
 
  Well, somehow png files are being produced for some math, which is a
  failure.  I don't know what the way to do this right would be, I just
 know
  that the version of htlatex for Fedora 20 fails to reasonably handle the
  math in this paper.
 
  My carefully designed layout for examples was modified in ways that
  made the examples harder to understand.
 
 
  Perhaps this is a key difference between us. I don't care about the
  layout, and want someone to do it for me; it's one of the reasons I use
  latex as well.
 
 
  There are many cases where line breaks and indentation are important for
  understanding.  Getting this sort of presentation right in latex is a
 pain
  for starters, but when it has been done, having the htlatex toolchain
 mess
  it up is a failure.
 
  That said, the result was better than I expected.  If someone upgrades
  htlatex
  to work well I'm quite willing to use it, but I expect that a lot of
 work
  is
  going to be needed.
 
 
  Which gets us back to the chicken and egg situation. I would probably do
  this; but, at the moment, ESWC and ISWC won't let me submit it. So, I'll
  end up with the PDF output anyway.
 
 
  Well, I'm with ESWC and ISWC here.  The review process should be
 designed to
  make reviewing easy for reviewers.  Until viewing HTML output is as
  trouble-free as viewing PDF output, then PDF should be the required
 format.
 
  This is why it is important that web conferences allow HTML, which is
  where the argument started. If you want something that prints just
  right, PDF is the thing for you. If you you want to read your papers in
  the bath, likewise, PDF is the thing for you. And that's fine by me (so
  long as you don't mind me reading your papers in the bath!). But it
  needs to not be the only option.
 
 
  Why?  What are the benefits of HTML reviewing, right now?  What are the
  benefits of HTML publishing, right now?  If there were HTML-based tools
 that
  worked well for preparing, reviewing, and reading scientific papers, then
  maybe conferences would use them.  However, conference organizers and
  reviewers have limited time, and are thus going for the simplest solution
  that works well.
 
  If some group thinks that a good HTML-based solution is possible, then
 let
  them produce this solution.  If the group can get pre-approval of some
  conference, then more power to them.  However, I'm not going to vote for
 any
  pre-approval of some future solution when the current situation is
  satisficing.
 
  Phil
 
 
  peter
 
 




-- 
Alexander Garcia
http://www.alexandergarcia.name/
http://www.usefilm.com/photographer/75943.html
http://www.linkedin.com/in/alexgarciac

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Alexander Garcia Castro

It's not hard to query PDFs with SPARQL.  All you have to do is extract
the metadata from the document and turn it into RDF, if needed.  Lots of
programs extract and display this metadata already.

in the age of the web of data why should I restrict my search just to
metadata? I want the full content, open access or not once I have the
document I should be able to mine the content of the document. I dont want
to limit my search just to simple metadata.

On Mon, Oct 6, 2014 at 9:48 AM, Peter F. Patel-Schneider 
pfpschnei...@gmail.com wrote:

 It's not hard to query PDFs with SPARQL.  All you have to do is extract
 the metadata from the document and turn it into RDF, if needed.  Lots of
 programs extract and display this metadata already.

 No, I don't think that viewing this issue from the reviewer perspective is
 too narrow.  Reviewers form  a vital part of the scientific publishing
 process. Anything that makes their jobs harder or the results that they
 produce worse is going to have to have very large benefits over the current
 setup.  In any case, I haven't been looking at the reviewer perspective
 only, even in the message quoted below.

 peter

 PS:  This is *not* to say that I think that the reviewing process is
 anywhere near ideal.  On the contrary, I think that the reviewing process
 has many problems, particularly as it is performed in CS conferences.



 On 10/06/2014 09:19 AM, Martynas Jusevičius wrote:

 Dear Peter,

 please show me how to query PDFs with SPARQL. Then I'll believe there
 are no benefits of XHTML+RDFa over PDF.

 Addressing the issue from the reviewer perspective only is too narrow,
 don't you think?


 Martynas

 On Mon, Oct 6, 2014 at 6:08 PM, Peter F. Patel-Schneider
 pfpschnei...@gmail.com wrote:



 On 10/06/2014 08:38 AM, Phillip Lord wrote:


 Peter F. Patel-Schneider pfpschnei...@gmail.com writes:


 I would be totally astonished if using htlatex as the main way to
 produce
 conference papers were as simple as this.

 I just tried htlatex on my ISWC paper, and the result was, to put it
 mildly,
 horrible.  (One of my AAAI papers was about the same, the other one
 caused an
 undefined control sequence and only produced one page of output.)
 Several
 parts of the paper were rendered in fixed-width fonts.  There was no
 attempt
 to limit line length.  Footnotes were in separate files.




 The footnote thing is pretty strange, I have to agree. Although
 footnotes are a fairly alien concept wrt to the web. Probably hover
 overs would be a reasonable presentation for this.


  Many non-scalable images were included, even for simple math.



 It does MathML I think, which is then rendered client side. Or you could
 drop math-mode straight through and render client side with mathjax.



 Well, somehow png files are being produced for some math, which is a
 failure.  I don't know what the way to do this right would be, I just
 know
 that the version of htlatex for Fedora 20 fails to reasonably handle the
 math in this paper.

  My carefully designed layout for examples was modified in ways that
 made the examples harder to understand.



 Perhaps this is a key difference between us. I don't care about the
 layout, and want someone to do it for me; it's one of the reasons I use
 latex as well.



 There are many cases where line breaks and indentation are important for
 understanding.  Getting this sort of presentation right in latex is a
 pain
 for starters, but when it has been done, having the htlatex toolchain
 mess
 it up is a failure.

  That said, the result was better than I expected.  If someone upgrades
 htlatex
 to work well I'm quite willing to use it, but I expect that a lot of
 work
 is
 going to be needed.



 Which gets us back to the chicken and egg situation. I would probably do
 this; but, at the moment, ESWC and ISWC won't let me submit it. So, I'll
 end up with the PDF output anyway.



 Well, I'm with ESWC and ISWC here.  The review process should be
 designed to
 make reviewing easy for reviewers.  Until viewing HTML output is as
 trouble-free as viewing PDF output, then PDF should be the required
 format.

  This is why it is important that web conferences allow HTML, which is
 where the argument started. If you want something that prints just
 right, PDF is the thing for you. If you you want to read your papers in
 the bath, likewise, PDF is the thing for you. And that's fine by me (so
 long as you don't mind me reading your papers in the bath!). But it
 needs to not be the only option.



 Why?  What are the benefits of HTML reviewing, right now?  What are the
 benefits of HTML publishing, right now?  If there were HTML-based tools
 that
 worked well for preparing, reviewing, and reading scientific papers, then
 maybe conferences would use them.  However, conference organizers and
 reviewers have limited time, and are thus going for the simplest solution
 that works well.

 If some group thinks that a good HTML-based solution is possible, then
 let

Re: scientific publishing process (was Re: Cost and access)

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

 On 10/06/2014 09:28 AM, Phillip Lord wrote:
 Peter F. Patel-Schneider pfpschnei...@gmail.com writes:
 It does MathML I think, which is then rendered client side. Or you could
 drop math-mode straight through and render client side with mathjax.

 Well, somehow png files are being produced for some math, which is a 
 failure.

 Yeah, you have to tell it to do mathml. The problem is that older
 versions of the browsers don't render mathml, and image rendering was
 the only option.

 Well, then someone is going to have to tell people how to do this.  What I saw
 for htlatex was that it just did the right thing.


So, htlatex is part of TeX4Ht which does HTML. 

If you do xhmlatex then you get XHTML with, indeed, math mode in MathML.
So, for example, this output comes with the default xhmlatex.

math 
 xmlns=http://www.w3.org/1998/Math/MathML;  
display=inline mi 
e/mi mo 
class=MathClass-rel=/mo mi 
m/mimsupmrow 
mi 
c/mi/mrowmrow 
mn2/mn/mrow/msup 
/math

tex4ht takes the slight strange approach of having an strange and
incomprehensible command line, and then lots of scripts which do default
options, of which xhmlatex is one. In my installation, they've only put
the basic ones into the path, so I ran this with
/usr/share/tex4ht/xhmlatex.


Phil

Re: scientific publishing process (was Re: Cost and access)

I don't think that scanning a printout retains any metadata that was in the 
electronic source so, no, this would not follow using the same logic.


I do agree that dissemination of results is one of the most important parts of 
the scientific process.  The argument here is, I think, what is the best way 
to support dissemination.


Eating your own dog food, is a separate matter, I think.  Eating your own dog 
good may help with uptake, but on the other hand it may interfere with 
dissemination, by making preparation of papers harder or making them harder to 
review or read.


peter




On 10/06/2014 10:09 AM, Martynas Jusevičius wrote:

Following the same logic, we still could have been using paper
submissions? All you have to do is to scan them to turn them into
PDFs.

It's been a while since I was in the university, but wasn't
dissemination an important part of science? What about dogfooding
after all?


Martynas

On Mon, Oct 6, 2014 at 6:48 PM, Peter F. Patel-Schneider
pfpschnei...@gmail.com wrote:

It's not hard to query PDFs with SPARQL.  All you have to do is extract the
metadata from the document and turn it into RDF, if needed.  Lots of
programs extract and display this metadata already.

No, I don't think that viewing this issue from the reviewer perspective is
too narrow.  Reviewers form  a vital part of the scientific publishing
process. Anything that makes their jobs harder or the results that they
produce worse is going to have to have very large benefits over the current
setup.  In any case, I haven't been looking at the reviewer perspective
only, even in the message quoted below.

peter

PS:  This is *not* to say that I think that the reviewing process is
anywhere near ideal.  On the contrary, I think that the reviewing process
has many problems, particularly as it is performed in CS conferences.



On 10/06/2014 09:19 AM, Martynas Jusevičius wrote:


Dear Peter,

please show me how to query PDFs with SPARQL. Then I'll believe there
are no benefits of XHTML+RDFa over PDF.

Addressing the issue from the reviewer perspective only is too narrow,
don't you think?


Martynas




[...]

Re: scientific publishing process (was Re: Cost and access)

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

 On 10/06/2014 09:32 AM, Phillip Lord wrote:
 Peter F. Patel-Schneider pfpschnei...@gmail.com writes:
 Who cares what the authors intend? I mean, they are not reading the
 paper, are they?

 For reviewing, what the authors intend is extremely important.  Having
 different rendering of the paper interfere with the authors' message is
 something that should be avoided at all costs.

 Really? So, for example, you think that a reviewer with impared vision
 should, for example, be forced to review a paper using the authors
 rendering, regardless of whether they can read it or not?

 No, but this is not what I was talking about. I was talking about
 interfering with the authors' message via changes from the rendering
 that the authors' set up.

It *is* exactly what you are talking about. If I want to render your
document to speech, then why should I not? What I am saying is that,
you, the author, should not wish to constrain the rendering, only really
the content. Effectively, if you are using latex, you are already doing
this, since latex defines the layout and not you.

But, I think we are talking in too abstract a term here. Should you be
able to constrain indentation for code blocks? Yes, of course, you
should. But, a quick look at the web shows that people do this all the
time.


 Similarly for reading papers, if the rendering of the paper interferes
 with the authors' message, that is a failure of the process.

 Yes, I agree. Which is why, I believe, that the rendering of a paper
 should be up to the reader

 As this is why I believe that the authors' should be able to specify the
 rendering of their paper to the extent that they feel is needed to convey the
 intent of the paper.

For scientific papers, I think this really is not very far. I mean, a
scientific paper is not a fashion store; it's a story designed to
persuade with data. 

I would like to see papers which are in the hands of the reader as much
as possible. Citation format should be for the reader. Math
presentation. Graphs should be interactive and zoomable, with the data
underneath as CSV. 

All of these are possible and routine with HTML now. I want to be free
to choose the organisation of my papers so that I can convey what I
want. At the moment, I cannot. The PDF is not reasonable for all, maybe
not even most of this. But some.

Phil

Re: scientific publishing process (was Re: Cost and access)

Sure.  So extract the text from the PDF and query that.  It also would be nice 
to have access to the LaTeX sources.


What HTML publishing *might* have that is better than the above is to more 
easily embed some extra information into papers that can be queried.  Is this 
just metadata that could also be easily injected into PDFs?  If given this 
capability will a significant number of authors use it?  Is it instead better 
to have a separate document that has the information and not use HTML for 
publishing?


peter




On 10/06/2014 10:42 AM, Alexander Garcia Castro wrote:

It's not hard to query PDFs with SPARQL.  All you have to do is extract the
metadata from the document and turn it into RDF, if needed.  Lots of programs
extract and display this metadata already.

in the age of the web of data why should I restrict my search just to
metadata? I want the full content, open access or not once I have the document
I should be able to mine the content of the document. I dont want to limit my
search just to simple metadata.

On Mon, Oct 6, 2014 at 9:48 AM, Peter F. Patel-Schneider
pfpschnei...@gmail.com mailto:pfpschnei...@gmail.com wrote:

It's not hard to query PDFs with SPARQL.  All you have to do is extract
the metadata from the document and turn it into RDF, if needed.  Lots of
programs extract and display this metadata already.

No, I don't think that viewing this issue from the reviewer perspective is
too narrow.  Reviewers form  a vital part of the scientific publishing
process. Anything that makes their jobs harder or the results that they
produce worse is going to have to have very large benefits over the
current setup.  In any case, I haven't been looking at the reviewer
perspective only, even in the message quoted below.

peter

PS:  This is *not* to say that I think that the reviewing process is
anywhere near ideal.  On the contrary, I think that the reviewing process
has many problems, particularly as it is performed in CS conferences.



On 10/06/2014 09:19 AM, Martynas Jusevičius wrote:

Dear Peter,

please show me how to query PDFs with SPARQL. Then I'll believe there
are no benefits of XHTML+RDFa over PDF.

Addressing the issue from the reviewer perspective only is too narrow,
don't you think?


Martynas

On Mon, Oct 6, 2014 at 6:08 PM, Peter F. Patel-Schneider
pfpschnei...@gmail.com mailto:pfpschnei...@gmail.com wrote:



On 10/06/2014 08:38 AM, Phillip Lord wrote:


Peter F. Patel-Schneider pfpschnei...@gmail.com
mailto:pfpschnei...@gmail.com writes:


I would be totally astonished if using htlatex as the main
way to produce
conference papers were as simple as this.

I just tried htlatex on my ISWC paper, and the result was,
to put it
mildly,
horrible.  (One of my AAAI papers was about the same, the
other one
caused an
undefined control sequence and only produced one page of
output.)
Several
parts of the paper were rendered in fixed-width fonts.
There was no
attempt
to limit line length.  Footnotes were in separate files.




The footnote thing is pretty strange, I have to agree. Although
footnotes are a fairly alien concept wrt to the web.
Probably hover
overs would be a reasonable presentation for this.


Many non-scalable images were included, even for simple 
math.



It does MathML I think, which is then rendered client side. Or
you could
drop math-mode straight through and render client side with
mathjax.



Well, somehow png files are being produced for some math, which is a
failure.  I don't know what the way to do this right would be, I
just know
that the version of htlatex for Fedora 20 fails to reasonably
handle the
math in this paper.

My carefully designed layout for examples was modified in
ways that
made the examples harder to understand.



Perhaps this is a key difference between us. I don't care
about the
layout, and want someone to do it for me; it's one of the
reasons I use
latex as well.



There are many cases where line breaks and indentation are
important for
understanding.  Getting this sort of presentation right in latex
is a pain
for starters, but when it has been done,

Re: scientific publishing process (was Re: Cost and access)


On 10/6/14 12:48 PM, Peter F. Patel-Schneider wrote:
It's not hard to query PDFs with SPARQL.  All you have to do is 
extract the metadata from the document and turn it into RDF, if 
needed. Lots of programs extract and display this metadata already. 


Peter,

Having had 200+ (some-non-rdf-doc} to RDF document transformers built 
under my direct guidance, there are issues with your claim above:


1. The extractors are platform specific -- AWWW is about platform 
agnosticism (I don't want to mandate an OS for experiencing the power of 
Linked Open Data transformers / rdfizers)


2. It isn't solely about metadata  -- we also have raw data inside these 
documents confined to Tables, paragraphs of sentences


3. If querying a PDF was marginally simple, I would be demonstrating 
that using a SPARQL results URL in response to this post :-)


Possible != Simple and Productive.

We want to leverage the productivity and simplicity that AWWW brings to 
data representation, access, interaction, and integration.


--
Regards,

Kingsley Idehen 
Founder  CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this




smime.p7s
Description: S/MIME Cryptographic Signature

Re: scientific publishing process (was Re: Cost and access)




On 10/06/2014 10:44 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:


On 10/06/2014 09:28 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

It does MathML I think, which is then rendered client side. Or you could
drop math-mode straight through and render client side with mathjax.


Well, somehow png files are being produced for some math, which is a failure.


Yeah, you have to tell it to do mathml. The problem is that older
versions of the browsers don't render mathml, and image rendering was
the only option.


Well, then someone is going to have to tell people how to do this.  What I saw
for htlatex was that it just did the right thing.



So, htlatex is part of TeX4Ht which does HTML.

If you do xhmlatex then you get XHTML with, indeed, math mode in MathML.
So, for example, this output comes with the default xhmlatex.

math
  xmlns=http://www.w3.org/1998/Math/MathML;
display=inline mi

e/mi mo

class=MathClass-rel=/mo mi

m/mimsupmrow
mi
c/mi/mrowmrow
mn2/mn/mrow/msup
/math


tex4ht takes the slight strange approach of having an strange and
incomprehensible command line, and then lots of scripts which do default
options, of which xhmlatex is one. In my installation, they've only put
the basic ones into the path, so I ran this with
/usr/share/tex4ht/xhmlatex.


Phil



So someone has to package this up so that it can be easily used.  Before then, 
how can it be required for conferences?


I have tex4ht installed, but there is no xhmlatex file to be found.  I managed 
to find what appears to be a good command line


htlatex schema-org-analysis.tex xhtml,mathml  -cunihtf -cvalidate

This looks better when viewed, but the resultant HTML is unintelligible.

There is definitely more work needed here before this can be considered as a 
potential solution.


peter

Re: scientific publishing process (was Re: Cost and access)




On 10/06/2014 11:00 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:


On 10/06/2014 09:32 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

Who cares what the authors intend? I mean, they are not reading the
paper, are they?


For reviewing, what the authors intend is extremely important.  Having
different rendering of the paper interfere with the authors' message is
something that should be avoided at all costs.


Really? So, for example, you think that a reviewer with impared vision
should, for example, be forced to review a paper using the authors
rendering, regardless of whether they can read it or not?


No, but this is not what I was talking about. I was talking about
interfering with the authors' message via changes from the rendering
that the authors' set up.


It *is* exactly what you are talking about.


Well, maybe I was not being clear, but I thought that I was talking about 
rendering  changes interfering with comprehension of the authors' intent.


peter

[...]

Re: scientific publishing process (was Re: Cost and access)




On 10/06/2014 11:00 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:


On 10/06/2014 09:32 AM, Phillip Lord wrote:

Peter F. Patel-Schneider pfpschnei...@gmail.com writes:

Who cares what the authors intend? I mean, they are not reading the
paper, are they?


For reviewing, what the authors intend is extremely important.  Having
different rendering of the paper interfere with the authors' message is
something that should be avoided at all costs.


Really? So, for example, you think that a reviewer with impared vision
should, for example, be forced to review a paper using the authors
rendering, regardless of whether they can read it or not?


No, but this is not what I was talking about. I was talking about
interfering with the authors' message via changes from the rendering
that the authors' set up.


It *is* exactly what you are talking about. If I want to render your
document to speech, then why should I not? What I am saying is that,
you, the author, should not wish to constrain the rendering, only really
the content. Effectively, if you are using latex, you are already doing
this, since latex defines the layout and not you.

But, I think we are talking in too abstract a term here. Should you be
able to constrain indentation for code blocks? Yes, of course, you
should. But, a quick look at the web shows that people do this all the
time.


Sure, and htlatex appears to interfere with this indentation. At least it does 
in my ISWC paper.



Similarly for reading papers, if the rendering of the paper interferes
with the authors' message, that is a failure of the process.


Yes, I agree. Which is why, I believe, that the rendering of a paper
should be up to the reader


As this is why I believe that the authors' should be able to specify the
rendering of their paper to the extent that they feel is needed to convey the
intent of the paper.


For scientific papers, I think this really is not very far. I mean, a
scientific paper is not a fashion store; it's a story designed to
persuade with data.

I would like to see papers which are in the hands of the reader as much
as possible. Citation format should be for the reader. Math
presentation. Graphs should be interactive and zoomable, with the data
underneath as CSV.

All of these are possible and routine with HTML now. I want to be free
to choose the organisation of my papers so that I can convey what I
want. At the moment, I cannot. The PDF is not reasonable for all, maybe
not even most of this. But some.

Phil


So, you believe that there is an excellent set of tools for preparing, 
reviewing, and reading scientific publishing.


Package them up and make them widely available.  If they are good, people will 
use them.


Convince those who run conferences.  If these people are convinced, then they 
will allow their use in conferences or maybe even require their use.


I'm not convinced by what I'm seeing right now, however.

peter

Re: scientific publishing process (was Re: Cost and access)




On 10/06/2014 11:03 AM, Kingsley Idehen wrote:

On 10/6/14 12:48 PM, Peter F. Patel-Schneider wrote:

It's not hard to query PDFs with SPARQL.  All you have to do is extract the
metadata from the document and turn it into RDF, if needed. Lots of programs
extract and display this metadata already.


Peter,

Having had 200+ (some-non-rdf-doc} to RDF document transformers built under my
direct guidance, there are issues with your claim above:


Huh?  Every single PDF reader that I use can extract the PDF metadata and 
display it.  The metadata that I see in PDF documents uses a core set of 
properties that are easy to transform into RDF.  Of course, this core set is 
very small (title, author, and a few other things) so you don't get all that 
much out of the core set.




1. The extractors are platform specific -- AWWW is about platform agnosticism
(I don't want to mandate an OS for experiencing the power of Linked Open Data
transformers / rdfizers)


Well, the extractors would be specific to PDF, but that's hardly surprising, I 
think.



2. It isn't solely about metadata  -- we also have raw data inside these
documents confined to Tables, paragraphs of sentences


Well, sure, but is extracting information directly from the figures or tables 
or text being considered here?  I sure would like this to be possible.  How 
would it work in an HTML context?



3. If querying a PDF was marginally simple, I would be demonstrating that
using a SPARQL results URL in response to this post :-)


I'm not saying that it is so simple.  You do have to find the metadata block 
in the PDF and then look for the /Title, /Author, ... stuff.



Possible != Simple and Productive.


Yes, but there are lots of tools that display PDF metadata, so there are some 
who believe that the benefit is greater than the cost.



We want to leverage the productivity and simplicity that AWWW brings to data
representation, access, interaction, and integration.


Sure, but the additional costs, if any, on paper authors, reviewers, and 
readers have to be considered.  If these costs are eliminated or at least 
minimized then this good is much more likely to be realized.


peter

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Ivan Shmakov

 Luca Matteis lmatt...@gmail.com writes:
 On Mon, Oct 6, 2014 at 5:29 PM, Phillip Lord wrote:

  Who cares what the authors intend?  I mean, they are not reading the
  paper, are they?

  Authors might have adjusted things that way specifically to deliver
  their message.  I think being able to have consistent layouts *as the
  authors intend it* is a very important thing.  It's also important on
  the Web: people want their site to look  feel in a very specific and
  consistent way.

Well, it’s also why we now have the things like the Stylish and
Greasemonkey add-ons for Firefox, and the http://userstyles.org/
resource on the Web (not to mention the whole world of “unusual”
Web browsers, such as Lynx.)  That is: the /readers/ too want to
tailor that “look and feel” to /their/ tastes, to get rid of the
poor design choices of the Web publishers, – and to thus improve
their “Web reading experience.”

-- 
FSF associate member #7257  http://boycottsystemd.org/  … 3013 B6A0 230E 334A

Re: scientific publishing process (was Re: Cost and access)


On 10/6/14 2:19 PM, Alexander Garcia Castro wrote:
querying PDFs is NOT simple and requires a lot of work -and usually 
produces lots of errors.


Yes, I believe I indicated that in my response to Peter i.e., it isn't 
simple or productive.



just querying metadata is not enough.


Yes, I said that too i.e., we want access to raw data.

As I said before, I understand the PDF as something that gives me a 
uniform layout. that is ok and necessary, but not enough or sufficient 
within the context of the web of data and scientific publications. I 
would like to have the content readily available for mining purposes. 
if I pay for the publication I should get access to the publication in 
every format it is available. the content should be presented in a way 
so that it makes sense within the web of data.  if it is the full 
content of the paper represented in RDF or XML fine. also, I would 
like to have well annotated content, this is simple and something that 
could quite easily be part of existing publication workflows. it may 
also be part of the guidelines for authors -for instance, identify and 
annotate rhetorical structures.


Modulo any confusing typos in my earlier posts, I don't see where we are 
disagreeing :-)



Kingsley


On Mon, Oct 6, 2014 at 11:03 AM, Kingsley Idehen 
kide...@openlinksw.com mailto:kide...@openlinksw.com wrote:


On 10/6/14 12:48 PM, Peter F. Patel-Schneider wrote:

It's not hard to query PDFs with SPARQL.  All you have to do
is extract the metadata from the document and turn it into
RDF, if needed. Lots of programs extract and display this
metadata already.


Peter,

Having had 200+ (some-non-rdf-doc} to RDF document transformers
built under my direct guidance, there are issues with your claim
above:

1. The extractors are platform specific -- AWWW is about platform
agnosticism (I don't want to mandate an OS for experiencing the
power of Linked Open Data transformers / rdfizers)

2. It isn't solely about metadata  -- we also have raw data inside
these documents confined to Tables, paragraphs of sentences

3. If querying a PDF was marginally simple, I would be
demonstrating that using a SPARQL results URL in response to this
post :-)

Possible != Simple and Productive.

We want to leverage the productivity and simplicity that AWWW
brings to data representation, access, interaction, and integration.


-- 
Regards,


Kingsley Idehen
Founder  CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
http://www.openlinksw.com/blog/%7Ekidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID:
http://kingsley.idehen.net/dataspace/person/kidehen#this





--
Alexander Garcia
http://www.alexandergarcia.name/
http://www.usefilm.com/photographer/75943.html
http://www.linkedin.com/in/alexgarciac




--
Regards,

Kingsley Idehen 
Founder  CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this



smime.p7s
Description: S/MIME Cryptographic Signature

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Norman Gray


Greetings.

On 2014 Oct 6, at 19:19, Alexander Garcia Castro alexgarc...@gmail.com wrote:

 querying PDFs is NOT simple and requires a lot of work -and usually
 produces lots of errors. just querying metadata is not enough. As I said
 before, I understand the PDF as something that gives me a uniform layout.
 that is ok and necessary, but not enough or sufficient within the context
 of the web of data and scientific publications. I would like to have the
 content readily available for mining purposes. if I pay for the publication
 I should get access to the publication in every format it is available. the
 content should be presented in a way so that it makes sense within the web
 of data.  if it is the full content of the paper represented in RDF or XML
 fine. also, I would like to have well annotated content, this is simple and
 something that could quite easily be part of existing publication
 workflows. it may also be part of the guidelines for authors -for instance,
 identify and annotate rhetorical structures.


The following might add something to this conversation.

It illustrates getting the metadata from a LaTeX file, putting it into an XMP 
packet in a PDF, and getting it out of the PDF as RDF.  Pace Peter's mention of 
/Author, /Title, etc, this just focuses on the XMP packet.

This has the document metadata, the abstract, and an illustrative bit of 
argumentation.  Adding details about the document structure, and (RDF) pointers 
to any figures would be feasible, as would, I suspect, incorporating CSV files 
directly into the PDF.  Incorporating \begin{tabular} tables would be rather 
tricky, but not impossible.  I can't help feeling that the XHTML+RDFa 
equivalent would be longer and need more documentation to instruct the author 
where to put the RDFa magic.

It's not very fancy, and still has rough edges, but it only took me 100 
minutes, from a standing start.

Generating and querying this PDF seems pretty simple to me.



$ cat test-xmp.tex
\documentclass{article}

\usepackage{xmp-management}

\title{This is a test file}
\author{Norman Gray}
\date{2014 October 6}

\begin{document}

\maketitle

\abstract{It's easy to include metadata in \LaTeX\ files.

That's because there's plenty of metadata in there already.}

There is text and metatext within files.

\section{Further details}

In this section we could potentially discuss moving information
around.  I think we can assert that \claim{it is easy to move
  information around}, and, further, that \claim{making metadata
  readily available is a Good Thing}.  I hope that clears that up.
\end{document}
$ cat xmp-management.sty 
\ProvidesPackage{xmp-management}[2014/10/06]

\newwrite\xmp@ttlfile
\def\xmp@open{\immediate\openout\xmp@ttlfile \jobname.ttl
  \let\xmp@open\relax}
\long\def\xmp@stmt#1#2{%
  \xmp@open
  \write\xmp@ttlfile{ #1 #2.}}
\let\xmp@origtitle\title
\def\title#1{\xmp@stmt{dc:title}{#1}\xmp@origtitle{#1}}
\let\xmp@origauthor\author
\def\author#1{\xmp@stmt{dc:creator}{#1}\xmp@origauthor{#1}}
\let\xmp@origdate\date
\def\date#1{\xmp@stmt{dc:created}{#1}\xmp@origdate{#1}}

\long\def\abstract#1{
  \xmp@stmt{dc:abstract}{#1}
  \begin{quotation}\textbf{Abstract:} #1\end{quotation}}
\def\claim#1{
  \xmp@stmt{xmpinfo:claim}{#1}
  \emph{#1}}

\let\xmp@origsection\section
\def\section#1{\xmp@stmt{xmpinfo:has_section}{#1}
  \xmp@origsection{#1}}

\usepackage{xmpincl}
\AtBeginDocument{\includexmp{info}}
$ pdflatex test-xmp 
This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012)
 restricted \write18 enabled.
entering extended mode
(./test-xmp.tex
LaTeX2e 2011/06/27
[...BLAH...]
Output written on test-xmp.pdf (1 page, 75667 bytes).
Transcript written on test-xmp.log.
$ cat test-xmp.ttl
 dc:title This is a test file.
 dc:creator Norman Gray.
 dc:created 2014 October 6.
 dc:abstract It's easy to include metadata in \LaTeX  \ files. \par That's 
because there's plenty of metadata in there already..
 xmpinfo:has_section Further details.
 xmpinfo:claim it is easy to move information around.
 xmpinfo:claim making metadata readily available is a Good Thing.
$ make info.xmp
sed 's/\\//g' test-xmp.ttl | \
  cat prefix.ttl - | \
  rapper -iturtle -ordfxml-xmp -q - file:test-xmp.pdf | \
  sed '/\?xpacket/d' info.xmp.tmp  mv info.xmp.tmp info.xmp
$ pdflatex test-xmp 
This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012)
 restricted \write18 enabled.
entering extended mode
(./test-xmp.tex
LaTeX2e 2011/06/27
[...BLAH...]
Output written on test-xmp.pdf (1 page, 77069 bytes).
Transcript written on test-xmp.log.
$ make extract-xmp   
cc -Wall -o extract-xmp extract-xmp.c
$ ./extract-xmp test-xmp.pdf
rdf:RDF xmlns:cc=http://creativecommons.org/ns#; 
xmlns:dc=http://purl.org/dc/elements/1.1/; 
xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#; 
xmlns:xapRights=http://ns.adobe.com/xap/1.0/rights/; 
xmlns:xmpinfo=http://example.org/xmpinfo; 
xml:base=file:test-xmp.pdf 
rdf:Description rdf:about= 
cc:license

Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Luca Matteis

Sorry to jump into this once again but when it comes to typesetting
nothing really comes close to Latex/PDF:
http://tex.stackexchange.com/questions/120271/alternatives-to-latex -
not even HTML/CSS/JavaScript

On Tue, Oct 7, 2014 at 12:18 AM, Norman Gray nor...@astro.gla.ac.uk wrote:

 Greetings.

 On 2014 Oct 6, at 19:19, Alexander Garcia Castro alexgarc...@gmail.com 
 wrote:

 querying PDFs is NOT simple and requires a lot of work -and usually
 produces lots of errors. just querying metadata is not enough. As I said
 before, I understand the PDF as something that gives me a uniform layout.
 that is ok and necessary, but not enough or sufficient within the context
 of the web of data and scientific publications. I would like to have the
 content readily available for mining purposes. if I pay for the publication
 I should get access to the publication in every format it is available. the
 content should be presented in a way so that it makes sense within the web
 of data.  if it is the full content of the paper represented in RDF or XML
 fine. also, I would like to have well annotated content, this is simple and
 something that could quite easily be part of existing publication
 workflows. it may also be part of the guidelines for authors -for instance,
 identify and annotate rhetorical structures.


 The following might add something to this conversation.

 It illustrates getting the metadata from a LaTeX file, putting it into an XMP 
 packet in a PDF, and getting it out of the PDF as RDF.  Pace Peter's mention 
 of /Author, /Title, etc, this just focuses on the XMP packet.

 This has the document metadata, the abstract, and an illustrative bit of 
 argumentation.  Adding details about the document structure, and (RDF) 
 pointers to any figures would be feasible, as would, I suspect, incorporating 
 CSV files directly into the PDF.  Incorporating \begin{tabular} tables would 
 be rather tricky, but not impossible.  I can't help feeling that the 
 XHTML+RDFa equivalent would be longer and need more documentation to instruct 
 the author where to put the RDFa magic.

 It's not very fancy, and still has rough edges, but it only took me 100 
 minutes, from a standing start.

 Generating and querying this PDF seems pretty simple to me.

 

 $ cat test-xmp.tex
 \documentclass{article}

 \usepackage{xmp-management}

 \title{This is a test file}
 \author{Norman Gray}
 \date{2014 October 6}

 \begin{document}

 \maketitle

 \abstract{It's easy to include metadata in \LaTeX\ files.

 That's because there's plenty of metadata in there already.}

 There is text and metatext within files.

 \section{Further details}

 In this section we could potentially discuss moving information
 around.  I think we can assert that \claim{it is easy to move
   information around}, and, further, that \claim{making metadata
   readily available is a Good Thing}.  I hope that clears that up.
 \end{document}
 $ cat xmp-management.sty
 \ProvidesPackage{xmp-management}[2014/10/06]

 \newwrite\xmp@ttlfile
 \def\xmp@open{\immediate\openout\xmp@ttlfile \jobname.ttl
   \let\xmp@open\relax}
 \long\def\xmp@stmt#1#2{%
   \xmp@open
   \write\xmp@ttlfile{ #1 #2.}}
 \let\xmp@origtitle\title
 \def\title#1{\xmp@stmt{dc:title}{#1}\xmp@origtitle{#1}}
 \let\xmp@origauthor\author
 \def\author#1{\xmp@stmt{dc:creator}{#1}\xmp@origauthor{#1}}
 \let\xmp@origdate\date
 \def\date#1{\xmp@stmt{dc:created}{#1}\xmp@origdate{#1}}

 \long\def\abstract#1{
   \xmp@stmt{dc:abstract}{#1}
   \begin{quotation}\textbf{Abstract:} #1\end{quotation}}
 \def\claim#1{
   \xmp@stmt{xmpinfo:claim}{#1}
   \emph{#1}}

 \let\xmp@origsection\section
 \def\section#1{\xmp@stmt{xmpinfo:has_section}{#1}
   \xmp@origsection{#1}}

 \usepackage{xmpincl}
 \AtBeginDocument{\includexmp{info}}
 $ pdflatex test-xmp
 This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012)
  restricted \write18 enabled.
 entering extended mode
 (./test-xmp.tex
 LaTeX2e 2011/06/27
 [...BLAH...]
 Output written on test-xmp.pdf (1 page, 75667 bytes).
 Transcript written on test-xmp.log.
 $ cat test-xmp.ttl
  dc:title This is a test file.
  dc:creator Norman Gray.
  dc:created 2014 October 6.
  dc:abstract It's easy to include metadata in \LaTeX  \ files. \par 
 That's because there's plenty of metadata in there already..
  xmpinfo:has_section Further details.
  xmpinfo:claim it is easy to move information around.
  xmpinfo:claim making metadata readily available is a Good Thing.
 $ make info.xmp
 sed 's/\\//g' test-xmp.ttl | \
   cat prefix.ttl - | \
   rapper -iturtle -ordfxml-xmp -q - file:test-xmp.pdf | \
   sed '/\?xpacket/d' info.xmp.tmp  mv info.xmp.tmp info.xmp
 $ pdflatex test-xmp
 This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012)
  restricted \write18 enabled.
 entering extended mode
 (./test-xmp.tex
 LaTeX2e 2011/06/27
 [...BLAH...]
 Output written on test-xmp.pdf (1 page, 77069 bytes).
 Transcript written on test-xmp.log.
 $ make extract-xmp
 cc -Wall -o

Re: scientific publishing process (was Re: Cost and access)

Neat.  This could be extended to putting a full table of contents into the 
metadata, and in lots of other ways.  The other nice thing about it is that it 
would be possible to push the same data through a LaTeX to HTML toolchain for 
those who want HTML output.


peter

On 10/06/2014 03:18 PM, Norman Gray wrote:


Greetings.

On 2014 Oct 6, at 19:19, Alexander Garcia Castro alexgarc...@gmail.com wrote:


querying PDFs is NOT simple and requires a lot of work -and usually
produces lots of errors. just querying metadata is not enough. As I said
before, I understand the PDF as something that gives me a uniform layout.
that is ok and necessary, but not enough or sufficient within the context
of the web of data and scientific publications. I would like to have the
content readily available for mining purposes. if I pay for the publication
I should get access to the publication in every format it is available. the
content should be presented in a way so that it makes sense within the web
of data.  if it is the full content of the paper represented in RDF or XML
fine. also, I would like to have well annotated content, this is simple and
something that could quite easily be part of existing publication
workflows. it may also be part of the guidelines for authors -for instance,
identify and annotate rhetorical structures.



The following might add something to this conversation.

It illustrates getting the metadata from a LaTeX file, putting it into an XMP 
packet in a PDF, and getting it out of the PDF as RDF.  Pace Peter's mention of 
/Author, /Title, etc, this just focuses on the XMP packet.

This has the document metadata, the abstract, and an illustrative bit of 
argumentation.  Adding details about the document structure, and (RDF) pointers 
to any figures would be feasible, as would, I suspect, incorporating CSV files 
directly into the PDF.  Incorporating \begin{tabular} tables would be rather 
tricky, but not impossible.  I can't help feeling that the XHTML+RDFa 
equivalent would be longer and need more documentation to instruct the author 
where to put the RDFa magic.

It's not very fancy, and still has rough edges, but it only took me 100 
minutes, from a standing start.

Generating and querying this PDF seems pretty simple to me.




[...]

Re: scientific publishing process (was Re: Cost and access)