Re: eLyXer for Document Parsing

2012-02-09 Thread Alex Fernandez
Hi Steve,

On 2/5/12, Rob Oakes lyx-de...@oak-tree.us wrote:
 Extremely good point, I'm also more comfortable with the HTML export
 available in LyX. I initially was interested in eLyXer because I thought I
 might be able to use it to help with an import filter as well. I'm not sure
 that it can, though. As you note in your email, it doesn't create a document
 model.

I am not sure what you mean by document model. For the record,
eLyXer creates an in-memory representation of the complete LyX
document since version 0.36 (released back in 2009):
  http://elyxer.nongnu.org/changelog.html
When using the --lowmem option, this in-memory representation is
created and flushed for each document block independently. Otherwise
you load the entire document in memory.

Alex.


Re: eLyXer for Document Parsing

2012-02-09 Thread Steve Litt
On Thu, 9 Feb 2012 15:13:48 +0100
Alex Fernandez alejandro...@gmail.com wrote:

 Hi Steve,
 
 On 2/5/12, Rob Oakes lyx-de...@oak-tree.us wrote:
  Extremely good point, I'm also more comfortable with the HTML export
  available in LyX. I initially was interested in eLyXer because I
  thought I might be able to use it to help with an import filter as
  well. I'm not sure that it can, though. As you note in your email,
  it doesn't create a document model.
 
 I am not sure what you mean by document model. For the record,
 eLyXer creates an in-memory representation of the complete LyX
 document since version 0.36 (released back in 2009):
   http://elyxer.nongnu.org/changelog.html
 When using the --lowmem option, this in-memory representation is
 created and flushed for each document block independently. Otherwise
 you load the entire document in memory.
 
 Alex.

I'm pretty sure he meant Document Object Model, or DOM, the object
hierarchy used to express HTML web pages.

http://en.wikipedia.org/wiki/Document_Object_Model

http://www.w3.org/DOM/

DOM forms an in-memory tree, and has functions to locate the current
node's first child, last child, next sibling, previous sibling, and
parent. Using those, if you think of the current node as a checker
you can move throughout the tree, you can get anywhere using those five
functions. What's also ultra-cool about it is you can recurse the tree
with an iterative loop -- very cool. Using DOM, instead of doing
complex algorithms with stacks in order to do everything in one pass,
you can have a second, third, fourth, infinitieth byte of the apple,
traversing the tree over and over again, doing one little task each
time, often with later traverses dependent on the changes made by the
earlier.

But each of these nodes takes memory, and using DOM as opposed to SAX
(an event driven parse method for HTML/XML) will outrun memory at a
certain document size.

HTH

SteveT


Re: eLyXer for Document Parsing

2012-02-09 Thread Alex Fernandez
On 2/9/12, Steve Litt sl...@troubleshooters.com wrote:
 I am not sure what you mean by document model. For the record,
 eLyXer creates an in-memory representation of the complete LyX
 document since version 0.36 (released back in 2009):

 I'm pretty sure he meant Document Object Model, or DOM, the object
 hierarchy used to express HTML web pages.

Ah, OK. Always hated DOM. eLyXer's in-memory representation is for the
LyX document, not of the resulting HTML document. Much tighter this
way, IMHO.

Alex.


Re: eLyXer for Document Parsing

2012-02-09 Thread Rob Oakes
On 2/9/2012 11:42 AM, Alex Fernandez wrote:
 Ah, OK. Always hated DOM. eLyXer's in-memory representation is for the
 LyX document, not of the resulting HTML document. Much tighter this
 way, IMHO.

Is there an example of how I might be able to access the in-memory
representation for the LyX document? If possible, I'd like to be able to
get some sort of iterable object that could be used to translate the
structure into the XML structure used by Microsoft Word.

Cheers,

Rob


Re: eLyXer for Document Parsing

2012-02-09 Thread Alex Fernandez
On 2/9/12, Rob Oakes lyx-de...@oak-tree.us wrote:
 On 2/9/2012 11:42 AM, Alex Fernandez wrote:
 Ah, OK. Always hated DOM. eLyXer's in-memory representation is for the
 LyX document, not of the resulting HTML document. Much tighter this
 way, IMHO.

 Is there an example of how I might be able to access the in-memory
 representation for the LyX document? If possible, I'd like to be able to
 get some sort of iterable object that could be used to translate the
 structure into the XML structure used by Microsoft Word.

I don't know of any examples outside eLyXer. The source code should be
quite readable. I would point you to main.convert.eLyXerConverter and
proc.process.Processor as starting points; there are actually a few
ways to do what you want (iterate over Containers). If you want I can
give you further explanations offlist.

Alex.


Re: eLyXer for Document Parsing

2012-02-09 Thread Alex Fernandez
Hi Steve,

On 2/5/12, Rob Oakes lyx-de...@oak-tree.us wrote:
 Extremely good point, I'm also more comfortable with the HTML export
 available in LyX. I initially was interested in eLyXer because I thought I
 might be able to use it to help with an import filter as well. I'm not sure
 that it can, though. As you note in your email, it doesn't create a document
 model.

I am not sure what you mean by document model. For the record,
eLyXer creates an in-memory representation of the complete LyX
document since version 0.36 (released back in 2009):
  http://elyxer.nongnu.org/changelog.html
When using the --lowmem option, this in-memory representation is
created and flushed for each document block independently. Otherwise
you load the entire document in memory.

Alex.


Re: eLyXer for Document Parsing

2012-02-09 Thread Steve Litt
On Thu, 9 Feb 2012 15:13:48 +0100
Alex Fernandez alejandro...@gmail.com wrote:

 Hi Steve,
 
 On 2/5/12, Rob Oakes lyx-de...@oak-tree.us wrote:
  Extremely good point, I'm also more comfortable with the HTML export
  available in LyX. I initially was interested in eLyXer because I
  thought I might be able to use it to help with an import filter as
  well. I'm not sure that it can, though. As you note in your email,
  it doesn't create a document model.
 
 I am not sure what you mean by document model. For the record,
 eLyXer creates an in-memory representation of the complete LyX
 document since version 0.36 (released back in 2009):
   http://elyxer.nongnu.org/changelog.html
 When using the --lowmem option, this in-memory representation is
 created and flushed for each document block independently. Otherwise
 you load the entire document in memory.
 
 Alex.

I'm pretty sure he meant Document Object Model, or DOM, the object
hierarchy used to express HTML web pages.

http://en.wikipedia.org/wiki/Document_Object_Model

http://www.w3.org/DOM/

DOM forms an in-memory tree, and has functions to locate the current
node's first child, last child, next sibling, previous sibling, and
parent. Using those, if you think of the current node as a checker
you can move throughout the tree, you can get anywhere using those five
functions. What's also ultra-cool about it is you can recurse the tree
with an iterative loop -- very cool. Using DOM, instead of doing
complex algorithms with stacks in order to do everything in one pass,
you can have a second, third, fourth, infinitieth byte of the apple,
traversing the tree over and over again, doing one little task each
time, often with later traverses dependent on the changes made by the
earlier.

But each of these nodes takes memory, and using DOM as opposed to SAX
(an event driven parse method for HTML/XML) will outrun memory at a
certain document size.

HTH

SteveT


Re: eLyXer for Document Parsing

2012-02-09 Thread Alex Fernandez
On 2/9/12, Steve Litt sl...@troubleshooters.com wrote:
 I am not sure what you mean by document model. For the record,
 eLyXer creates an in-memory representation of the complete LyX
 document since version 0.36 (released back in 2009):

 I'm pretty sure he meant Document Object Model, or DOM, the object
 hierarchy used to express HTML web pages.

Ah, OK. Always hated DOM. eLyXer's in-memory representation is for the
LyX document, not of the resulting HTML document. Much tighter this
way, IMHO.

Alex.


Re: eLyXer for Document Parsing

2012-02-09 Thread Rob Oakes
On 2/9/2012 11:42 AM, Alex Fernandez wrote:
 Ah, OK. Always hated DOM. eLyXer's in-memory representation is for the
 LyX document, not of the resulting HTML document. Much tighter this
 way, IMHO.

Is there an example of how I might be able to access the in-memory
representation for the LyX document? If possible, I'd like to be able to
get some sort of iterable object that could be used to translate the
structure into the XML structure used by Microsoft Word.

Cheers,

Rob


Re: eLyXer for Document Parsing

2012-02-09 Thread Alex Fernandez
On 2/9/12, Rob Oakes lyx-de...@oak-tree.us wrote:
 On 2/9/2012 11:42 AM, Alex Fernandez wrote:
 Ah, OK. Always hated DOM. eLyXer's in-memory representation is for the
 LyX document, not of the resulting HTML document. Much tighter this
 way, IMHO.

 Is there an example of how I might be able to access the in-memory
 representation for the LyX document? If possible, I'd like to be able to
 get some sort of iterable object that could be used to translate the
 structure into the XML structure used by Microsoft Word.

I don't know of any examples outside eLyXer. The source code should be
quite readable. I would point you to main.convert.eLyXerConverter and
proc.process.Processor as starting points; there are actually a few
ways to do what you want (iterate over Containers). If you want I can
give you further explanations offlist.

Alex.


Re: eLyXer for Document Parsing

2012-02-09 Thread Alex Fernandez
Hi Steve,

On 2/5/12, Rob Oakes  wrote:
> Extremely good point, I'm also more comfortable with the HTML export
> available in LyX. I initially was interested in eLyXer because I thought I
> might be able to use it to help with an import filter as well. I'm not sure
> that it can, though. As you note in your email, it doesn't create a document
> model.

I am not sure what you mean by "document model". For the record,
eLyXer creates an in-memory representation of the complete LyX
document since version 0.36 (released back in 2009):
  http://elyxer.nongnu.org/changelog.html
When using the --lowmem option, this in-memory representation is
created and flushed for each document block independently. Otherwise
you load the entire document in memory.

Alex.


Re: eLyXer for Document Parsing

2012-02-09 Thread Steve Litt
On Thu, 9 Feb 2012 15:13:48 +0100
Alex Fernandez  wrote:

> Hi Steve,
> 
> On 2/5/12, Rob Oakes  wrote:
> > Extremely good point, I'm also more comfortable with the HTML export
> > available in LyX. I initially was interested in eLyXer because I
> > thought I might be able to use it to help with an import filter as
> > well. I'm not sure that it can, though. As you note in your email,
> > it doesn't create a document model.
> 
> I am not sure what you mean by "document model". For the record,
> eLyXer creates an in-memory representation of the complete LyX
> document since version 0.36 (released back in 2009):
>   http://elyxer.nongnu.org/changelog.html
> When using the --lowmem option, this in-memory representation is
> created and flushed for each document block independently. Otherwise
> you load the entire document in memory.
> 
> Alex.

I'm pretty sure he meant "Document Object Model", or DOM, the object
hierarchy used to express HTML web pages.

http://en.wikipedia.org/wiki/Document_Object_Model

http://www.w3.org/DOM/

DOM forms an in-memory tree, and has functions to locate the current
node's first child, last child, next sibling, previous sibling, and
parent. Using those, if you think of the current node as a "checker"
you can move throughout the tree, you can get anywhere using those five
functions. What's also ultra-cool about it is you can recurse the tree
with an iterative loop -- very cool. Using DOM, instead of doing
complex algorithms with stacks in order to do everything in one pass,
you can have a second, third, fourth, infinitieth byte of the apple,
traversing the tree over and over again, doing one little task each
time, often with later traverses dependent on the changes made by the
earlier.

But each of these nodes takes memory, and using DOM as opposed to SAX
(an event driven parse method for HTML/XML) will outrun memory at a
certain document size.

HTH

SteveT


Re: eLyXer for Document Parsing

2012-02-09 Thread Alex Fernandez
On 2/9/12, Steve Litt  wrote:
>> I am not sure what you mean by "document model". For the record,
>> eLyXer creates an in-memory representation of the complete LyX
>> document since version 0.36 (released back in 2009):

> I'm pretty sure he meant "Document Object Model", or DOM, the object
> hierarchy used to express HTML web pages.

Ah, OK. Always hated DOM. eLyXer's in-memory representation is for the
LyX document, not of the resulting HTML document. Much tighter this
way, IMHO.

Alex.


Re: eLyXer for Document Parsing

2012-02-09 Thread Rob Oakes
On 2/9/2012 11:42 AM, Alex Fernandez wrote:
> Ah, OK. Always hated DOM. eLyXer's in-memory representation is for the
> LyX document, not of the resulting HTML document. Much tighter this
> way, IMHO.

Is there an example of how I might be able to access the in-memory
representation for the LyX document? If possible, I'd like to be able to
get some sort of iterable object that could be used to translate the
structure into the XML structure used by Microsoft Word.

Cheers,

Rob


Re: eLyXer for Document Parsing

2012-02-09 Thread Alex Fernandez
On 2/9/12, Rob Oakes  wrote:
> On 2/9/2012 11:42 AM, Alex Fernandez wrote:
>> Ah, OK. Always hated DOM. eLyXer's in-memory representation is for the
>> LyX document, not of the resulting HTML document. Much tighter this
>> way, IMHO.
>
> Is there an example of how I might be able to access the in-memory
> representation for the LyX document? If possible, I'd like to be able to
> get some sort of iterable object that could be used to translate the
> structure into the XML structure used by Microsoft Word.

I don't know of any examples outside eLyXer. The source code should be
quite readable. I would point you to main.convert.eLyXerConverter and
proc.process.Processor as starting points; there are actually a few
ways to do what you want (iterate over Containers). If you want I can
give you further explanations offlist.

Alex.


Re: eLyXer for Document Parsing

2012-02-05 Thread Abdelrazak Younes

On 04/02/2012 19:07, slitt wrote:

One more question: You sure you want to go in-memory? What happens if a
guy has a 1200 page book with 100 chapters each containing 10 sections,
each containing 10 subsections, and tries to parse it on a machine with 512 MB 
RAM?
You in a heap of
trouble son.


I am almost sure LyX is able to do that.

Abdel.


Re: eLyXer for Document Parsing

2012-02-05 Thread Abdelrazak Younes

On 04/02/2012 18:03, Rob Oakes wrote:

Dear eLyXer Users and Developers,

I'm still at work on the import/export module for Microsoft Word documents. I'm 
making pretty good progress. I've got a rough prototype that works pretty well 
and I'm now starting to refine it.

My approach up to now has been to use regular expressions to match portions of 
the document and then use a library to translate those to the corresponding 
Word XML structures. It's working pretty well with my simple test documents.

Before going too far with this approach, though, I wanted to post (another 
general query).

In the eLyXer library, there is already a robust set of tools used for 
converting LyX documents to HTML. Does anyone know if the library is written in 
such as way that getting a generic in-memory representation of the document 
would be possible? It would be awesome to re-use as much existing code for the 
Word document export as possible. That would allow me to support a broader 
number of features, and gives me a framework for working with maths.
Strong suggestion: use LyX proper. I am quite sure you already know that 
because I saw some patches from you in this area but I'll explain 
anyway: LyX's html own export is so good and fast because it effectively 
knows the in-memory representation of the document. You can't be faster 
nor more accurate than that. I mean, unless you want to rewrite LyX in 
python.


IIUC you want a single module in python for both import and export in 
python. But I don't think this is a valid argument. As for the word to 
lyx format conversion, if you want to use this epub library there must 
be a way to use that in C++ I'm sure...



Any thoughts Alex (and others)? I've downloaded the sources and have begun to 
work through them, but before spending hours to days trying to wrap my head 
around them, I thought I would ask.


AFAIK, eLyXer doesn't construct a document model. So you'd better spend 
this time reading the C++ code for exporting to html/xhtml ;-)


Abdel.



Re: eLyXer for Document Parsing

2012-02-05 Thread Alex Fernandez
Hi all,

I am currently travelling so excuse my android top-posting. Actually
building a reusable in-memory representation for Python scripting of LyX
documents was a requisite for eLyXer. You should not have trouble with
large documents as my puny netbook eats 1000 page documents for lunch. Look
at the Container class, and best of luck! Please ask in private any further
questions.

Alex.
El 04/02/2012 18:03, Rob Oakes lyx-de...@oak-tree.us escribió:

 Dear eLyXer Users and Developers,

 I'm still at work on the import/export module for Microsoft Word
 documents. I'm making pretty good progress. I've got a rough prototype that
 works pretty well and I'm now starting to refine it.

 My approach up to now has been to use regular expressions to match
 portions of the document and then use a library to translate those to the
 corresponding Word XML structures. It's working pretty well with my simple
 test documents.

 Before going too far with this approach, though, I wanted to post (another
 general query).

 In the eLyXer library, there is already a robust set of tools used for
 converting LyX documents to HTML. Does anyone know if the library is
 written in such as way that getting a generic in-memory representation of
 the document would be possible? It would be awesome to re-use as much
 existing code for the Word document export as possible. That would allow me
 to support a broader number of features, and gives me a framework for
 working with maths.

 Any thoughts Alex (and others)? I've downloaded the sources and have begun
 to work through them, but before spending hours to days trying to wrap my
 head around them, I thought I would ask.

 Cheers,

 Rob


Re: eLyXer for Document Parsing

2012-02-05 Thread Rob Oakes

On Feb 5, 2012, at 2:04 AM, Abdelrazak Younes wrote:

 Strong suggestion: use LyX proper. I am quite sure you already know that 
 because I saw some patches from you in this area but I'll explain anyway: 
 LyX's html own export is so good and fast because it effectively knows the 
 in-memory representation of the document. You can't be faster nor more 
 accurate than that. I mean, unless you want to rewrite LyX in python.

Extremely good point, I'm also more comfortable with the HTML export available 
in LyX. I initially was interested in eLyXer because I thought I might be able 
to use it to help with an import filter as well. I'm not sure that it can, 
though. As you note in your email, it doesn't create a document model.

 IIUC you want a single module in python for both import and export in python. 
 But I don't think this is a valid argument. As for the word to lyx format 
 conversion, if you want to use this epub library there must be a way to use 
 that in C++ I'm sure…

I though about using Python because I'd found a tool capable of generating docx 
for me. After working with it a little more, though, I'm less enamored with it. 
 docx is a pretty straightforward file format, and there's quite a few things 
that are sloppily implemented.

 AFAIK, eLyXer doesn't construct a document model. So you'd better spend this 
 time reading the C++ code for exporting to html/xhtml ;-)

Following Steve's suggestion, I decided to try the easy way and directly 
parse the XHTML created by eLyXer. Turns out that it's not only easy, but 
probably the best way forward. There are some excellent libraries for reading 
XML in python. Using lxml, in particular, looks like a good solution. You 
generate the XHTML, parse it with lxml, and then iterate over the elements, 
translating as you go. My current script is about 50 lines long, and can be 
used with either native XHTML or eLyXer. To add new features, you add 
additional cases describing how to translate the XHTML.

Which brings us to an important point: there's already a pretty good LyX - 
XHTML - LibreOffice - Word pathway for translating documents. Unless I 
directly implement Word as another backend (which, while a lot of work, isn't 
difficult), I'm not sure there's much reason for a direct MS Word export. The 
real need seems to be for an MS Word import, anyway.

Cheers,

Rob

Re: eLyXer for Document Parsing

2012-02-05 Thread Abdelrazak Younes

On 05/02/2012 17:48, Rob Oakes wrote:

  My current script is about 50 lines long, and can be used with either native 
XHTML or eLyXer. To add new features, you add additional cases describing how 
to translate the XHTML.

Which brings us to an important point: there's already a pretty good LyX -  XHTML 
-  LibreOffice -  Word pathway for translating documents. Unless I directly 
implement Word as another backend (which, while a lot of work, isn't difficult), I'm not 
sure there's much reason for a direct MS Word export. The real need seems to be for an 
MS Word import, anyway.


The native MSWord backend would be very interesting to have and useful 
and much better than anything you could produce with your python script. 
But I agree with you that the docx import looks more useful. And if the 
thing can be extended to pptx, it will be even more useful :-)


Cheers,
Abdel.



Re: eLyXer for Document Parsing

2012-02-05 Thread Abdelrazak Younes

On 04/02/2012 19:07, slitt wrote:

One more question: You sure you want to go in-memory? What happens if a
guy has a 1200 page book with 100 chapters each containing 10 sections,
each containing 10 subsections, and tries to parse it on a machine with 512 MB 
RAM?
You in a heap of
trouble son.


I am almost sure LyX is able to do that.

Abdel.


Re: eLyXer for Document Parsing

2012-02-05 Thread Abdelrazak Younes

On 04/02/2012 18:03, Rob Oakes wrote:

Dear eLyXer Users and Developers,

I'm still at work on the import/export module for Microsoft Word documents. I'm 
making pretty good progress. I've got a rough prototype that works pretty well 
and I'm now starting to refine it.

My approach up to now has been to use regular expressions to match portions of 
the document and then use a library to translate those to the corresponding 
Word XML structures. It's working pretty well with my simple test documents.

Before going too far with this approach, though, I wanted to post (another 
general query).

In the eLyXer library, there is already a robust set of tools used for 
converting LyX documents to HTML. Does anyone know if the library is written in 
such as way that getting a generic in-memory representation of the document 
would be possible? It would be awesome to re-use as much existing code for the 
Word document export as possible. That would allow me to support a broader 
number of features, and gives me a framework for working with maths.
Strong suggestion: use LyX proper. I am quite sure you already know that 
because I saw some patches from you in this area but I'll explain 
anyway: LyX's html own export is so good and fast because it effectively 
knows the in-memory representation of the document. You can't be faster 
nor more accurate than that. I mean, unless you want to rewrite LyX in 
python.


IIUC you want a single module in python for both import and export in 
python. But I don't think this is a valid argument. As for the word to 
lyx format conversion, if you want to use this epub library there must 
be a way to use that in C++ I'm sure...



Any thoughts Alex (and others)? I've downloaded the sources and have begun to 
work through them, but before spending hours to days trying to wrap my head 
around them, I thought I would ask.


AFAIK, eLyXer doesn't construct a document model. So you'd better spend 
this time reading the C++ code for exporting to html/xhtml ;-)


Abdel.



Re: eLyXer for Document Parsing

2012-02-05 Thread Alex Fernandez
Hi all,

I am currently travelling so excuse my android top-posting. Actually
building a reusable in-memory representation for Python scripting of LyX
documents was a requisite for eLyXer. You should not have trouble with
large documents as my puny netbook eats 1000 page documents for lunch. Look
at the Container class, and best of luck! Please ask in private any further
questions.

Alex.
El 04/02/2012 18:03, Rob Oakes lyx-de...@oak-tree.us escribió:

 Dear eLyXer Users and Developers,

 I'm still at work on the import/export module for Microsoft Word
 documents. I'm making pretty good progress. I've got a rough prototype that
 works pretty well and I'm now starting to refine it.

 My approach up to now has been to use regular expressions to match
 portions of the document and then use a library to translate those to the
 corresponding Word XML structures. It's working pretty well with my simple
 test documents.

 Before going too far with this approach, though, I wanted to post (another
 general query).

 In the eLyXer library, there is already a robust set of tools used for
 converting LyX documents to HTML. Does anyone know if the library is
 written in such as way that getting a generic in-memory representation of
 the document would be possible? It would be awesome to re-use as much
 existing code for the Word document export as possible. That would allow me
 to support a broader number of features, and gives me a framework for
 working with maths.

 Any thoughts Alex (and others)? I've downloaded the sources and have begun
 to work through them, but before spending hours to days trying to wrap my
 head around them, I thought I would ask.

 Cheers,

 Rob


Re: eLyXer for Document Parsing

2012-02-05 Thread Rob Oakes

On Feb 5, 2012, at 2:04 AM, Abdelrazak Younes wrote:

 Strong suggestion: use LyX proper. I am quite sure you already know that 
 because I saw some patches from you in this area but I'll explain anyway: 
 LyX's html own export is so good and fast because it effectively knows the 
 in-memory representation of the document. You can't be faster nor more 
 accurate than that. I mean, unless you want to rewrite LyX in python.

Extremely good point, I'm also more comfortable with the HTML export available 
in LyX. I initially was interested in eLyXer because I thought I might be able 
to use it to help with an import filter as well. I'm not sure that it can, 
though. As you note in your email, it doesn't create a document model.

 IIUC you want a single module in python for both import and export in python. 
 But I don't think this is a valid argument. As for the word to lyx format 
 conversion, if you want to use this epub library there must be a way to use 
 that in C++ I'm sure…

I though about using Python because I'd found a tool capable of generating docx 
for me. After working with it a little more, though, I'm less enamored with it. 
 docx is a pretty straightforward file format, and there's quite a few things 
that are sloppily implemented.

 AFAIK, eLyXer doesn't construct a document model. So you'd better spend this 
 time reading the C++ code for exporting to html/xhtml ;-)

Following Steve's suggestion, I decided to try the easy way and directly 
parse the XHTML created by eLyXer. Turns out that it's not only easy, but 
probably the best way forward. There are some excellent libraries for reading 
XML in python. Using lxml, in particular, looks like a good solution. You 
generate the XHTML, parse it with lxml, and then iterate over the elements, 
translating as you go. My current script is about 50 lines long, and can be 
used with either native XHTML or eLyXer. To add new features, you add 
additional cases describing how to translate the XHTML.

Which brings us to an important point: there's already a pretty good LyX - 
XHTML - LibreOffice - Word pathway for translating documents. Unless I 
directly implement Word as another backend (which, while a lot of work, isn't 
difficult), I'm not sure there's much reason for a direct MS Word export. The 
real need seems to be for an MS Word import, anyway.

Cheers,

Rob

Re: eLyXer for Document Parsing

2012-02-05 Thread Abdelrazak Younes

On 05/02/2012 17:48, Rob Oakes wrote:

  My current script is about 50 lines long, and can be used with either native 
XHTML or eLyXer. To add new features, you add additional cases describing how 
to translate the XHTML.

Which brings us to an important point: there's already a pretty good LyX -  XHTML 
-  LibreOffice -  Word pathway for translating documents. Unless I directly 
implement Word as another backend (which, while a lot of work, isn't difficult), I'm not 
sure there's much reason for a direct MS Word export. The real need seems to be for an 
MS Word import, anyway.


The native MSWord backend would be very interesting to have and useful 
and much better than anything you could produce with your python script. 
But I agree with you that the docx import looks more useful. And if the 
thing can be extended to pptx, it will be even more useful :-)


Cheers,
Abdel.



Re: eLyXer for Document Parsing

2012-02-05 Thread Abdelrazak Younes

On 04/02/2012 19:07, slitt wrote:

One more question: You sure you want to go in-memory? What happens if a
guy has a 1200 page book with 100 chapters each containing 10 sections,
each containing 10 subsections, and tries to parse it on a machine with 512 MB 
RAM?
You in a heap of
trouble son.


I am almost sure LyX is able to do that.

Abdel.


Re: eLyXer for Document Parsing

2012-02-05 Thread Abdelrazak Younes

On 04/02/2012 18:03, Rob Oakes wrote:

Dear eLyXer Users and Developers,

I'm still at work on the import/export module for Microsoft Word documents. I'm 
making pretty good progress. I've got a rough prototype that works pretty well 
and I'm now starting to refine it.

My approach up to now has been to use regular expressions to match portions of 
the document and then use a library to translate those to the corresponding 
Word XML structures. It's working pretty well with my simple test documents.

Before going too far with this approach, though, I wanted to post (another 
general query).

In the eLyXer library, there is already a robust set of tools used for 
converting LyX documents to HTML. Does anyone know if the library is written in 
such as way that getting a generic in-memory representation of the document 
would be possible? It would be awesome to re-use as much existing code for the 
Word document export as possible. That would allow me to support a broader 
number of features, and gives me a framework for working with maths.
Strong suggestion: use LyX proper. I am quite sure you already know that 
because I saw some patches from you in this area but I'll explain 
anyway: LyX's html own export is so good and fast because it effectively 
knows the in-memory representation of the document. You can't be faster 
nor more accurate than that. I mean, unless you want to rewrite LyX in 
python.


IIUC you want a single module in python for both import and export in 
python. But I don't think this is a valid argument. As for the word to 
lyx format conversion, if you want to use this epub library there must 
be a way to use that in C++ I'm sure...



Any thoughts Alex (and others)? I've downloaded the sources and have begun to 
work through them, but before spending hours to days trying to wrap my head 
around them, I thought I would ask.


AFAIK, eLyXer doesn't construct a document model. So you'd better spend 
this time reading the C++ code for exporting to html/xhtml ;-)


Abdel.



Re: eLyXer for Document Parsing

2012-02-05 Thread Alex Fernandez
Hi all,

I am currently travelling so excuse my android top-posting. Actually
building a reusable in-memory representation for Python scripting of LyX
documents was a requisite for eLyXer. You should not have trouble with
large documents as my puny netbook eats 1000 page documents for lunch. Look
at the Container class, and best of luck! Please ask in private any further
questions.

Alex.
El 04/02/2012 18:03, "Rob Oakes"  escribió:

> Dear eLyXer Users and Developers,
>
> I'm still at work on the import/export module for Microsoft Word
> documents. I'm making pretty good progress. I've got a rough prototype that
> works pretty well and I'm now starting to refine it.
>
> My approach up to now has been to use regular expressions to match
> portions of the document and then use a library to translate those to the
> corresponding Word XML structures. It's working pretty well with my simple
> test documents.
>
> Before going too far with this approach, though, I wanted to post (another
> general query).
>
> In the eLyXer library, there is already a robust set of tools used for
> converting LyX documents to HTML. Does anyone know if the library is
> written in such as way that getting a generic in-memory representation of
> the document would be possible? It would be awesome to re-use as much
> existing code for the Word document export as possible. That would allow me
> to support a broader number of features, and gives me a framework for
> working with maths.
>
> Any thoughts Alex (and others)? I've downloaded the sources and have begun
> to work through them, but before spending hours to days trying to wrap my
> head around them, I thought I would ask.
>
> Cheers,
>
> Rob


Re: eLyXer for Document Parsing

2012-02-05 Thread Rob Oakes

On Feb 5, 2012, at 2:04 AM, Abdelrazak Younes wrote:

> Strong suggestion: use LyX proper. I am quite sure you already know that 
> because I saw some patches from you in this area but I'll explain anyway: 
> LyX's html own export is so good and fast because it effectively knows the 
> in-memory representation of the document. You can't be faster nor more 
> accurate than that. I mean, unless you want to rewrite LyX in python.

Extremely good point, I'm also more comfortable with the HTML export available 
in LyX. I initially was interested in eLyXer because I thought I might be able 
to use it to help with an import filter as well. I'm not sure that it can, 
though. As you note in your email, it doesn't create a document model.

> IIUC you want a single module in python for both import and export in python. 
> But I don't think this is a valid argument. As for the word to lyx format 
> conversion, if you want to use this epub library there must be a way to use 
> that in C++ I'm sure…

I though about using Python because I'd found a tool capable of generating docx 
for me. After working with it a little more, though, I'm less enamored with it. 
 docx is a pretty straightforward file format, and there's quite a few things 
that are sloppily implemented.

> AFAIK, eLyXer doesn't construct a document model. So you'd better spend this 
> time reading the C++ code for exporting to html/xhtml ;-)

Following Steve's suggestion, I decided to try the "easy" way and directly 
parse the XHTML created by eLyXer. Turns out that it's not only easy, but 
probably the best way forward. There are some excellent libraries for reading 
XML in python. Using lxml, in particular, looks like a good solution. You 
generate the XHTML, parse it with lxml, and then iterate over the elements, 
translating as you go. My current script is about 50 lines long, and can be 
used with either native XHTML or eLyXer. To add new features, you add 
additional cases describing how to translate the XHTML.

Which brings us to an important point: there's already a pretty good LyX -> 
XHTML -> LibreOffice -> Word pathway for translating documents. Unless I 
directly implement Word as another backend (which, while a lot of work, isn't 
difficult), I'm not sure there's much reason for a direct MS Word export. The 
real need seems to be for an MS Word import, anyway.

Cheers,

Rob

Re: eLyXer for Document Parsing

2012-02-05 Thread Abdelrazak Younes

On 05/02/2012 17:48, Rob Oakes wrote:

  My current script is about 50 lines long, and can be used with either native 
XHTML or eLyXer. To add new features, you add additional cases describing how 
to translate the XHTML.

Which brings us to an important point: there's already a pretty good LyX ->  XHTML 
->  LibreOffice ->  Word pathway for translating documents. Unless I directly 
implement Word as another backend (which, while a lot of work, isn't difficult), I'm not 
sure there's much reason for a direct MS Word export. The real need seems to be for an 
MS Word import, anyway.


The native MSWord backend would be very interesting to have and useful 
and much better than anything you could produce with your python script. 
But I agree with you that the docx import looks more useful. And if the 
thing can be extended to pptx, it will be even more useful :-)


Cheers,
Abdel.



eLyXer for Document Parsing

2012-02-04 Thread Rob Oakes
Dear eLyXer Users and Developers,

I'm still at work on the import/export module for Microsoft Word documents. I'm 
making pretty good progress. I've got a rough prototype that works pretty well 
and I'm now starting to refine it.

My approach up to now has been to use regular expressions to match portions of 
the document and then use a library to translate those to the corresponding 
Word XML structures. It's working pretty well with my simple test documents.

Before going too far with this approach, though, I wanted to post (another 
general query).

In the eLyXer library, there is already a robust set of tools used for 
converting LyX documents to HTML. Does anyone know if the library is written in 
such as way that getting a generic in-memory representation of the document 
would be possible? It would be awesome to re-use as much existing code for the 
Word document export as possible. That would allow me to support a broader 
number of features, and gives me a framework for working with maths.

Any thoughts Alex (and others)? I've downloaded the sources and have begun to 
work through them, but before spending hours to days trying to wrap my head 
around them, I thought I would ask.

Cheers,

Rob

Re: eLyXer for Document Parsing

2012-02-04 Thread slitt
On Sat, 4 Feb 2012 10:03:00 -0700
Rob Oakes lyx-de...@oak-tree.us wrote:

 Dear eLyXer Users and Developers,
 
 I'm still at work on the import/export module for Microsoft Word
 documents. I'm making pretty good progress. I've got a rough
 prototype that works pretty well and I'm now starting to refine it.
 
 My approach up to now has been to use regular expressions to match
 portions of the document and then use a library to translate those to
 the corresponding Word XML structures. It's working pretty well with
 my simple test documents.
 
 Before going too far with this approach, though, I wanted to post
 (another general query).
 
 In the eLyXer library, there is already a robust set of tools used
 for converting LyX documents to HTML. Does anyone know if the library
 is written in such as way that getting a generic in-memory
 representation of the document would be possible? It would be awesome
 to re-use as much existing code for the Word document export as
 possible. That would allow me to support a broader number of
 features, and gives me a framework for working with maths.
 
 Any thoughts Alex (and others)? I've downloaded the sources and have
 begun to work through them, but before spending hours to days trying
 to wrap my head around them, I thought I would ask.


This is obviously an Alex question, so I'll go ahead and answer it :-)

Not only possible but easy if you do things the Steve Litt way. eLyXer
quickly punches out HTML that's clean enough to read with an XML
parser, I think. So, eLyXer converts to HTML, and then your program's
DOMbuilder module converts that HTML to in-memory DOM. No muss, no
fuss, no bother, no picking apart eLyXer code (it's big and not
immediately obvious, not a single weekend task).

One more question: You sure you want to go in-memory? What happens if a
guy has a 1200 page book with 100 chapters each containing 10 sections,
each containing 10 subsections, and tries to parse it on a machine with 512 MB 
RAM? 
You in a heap of
trouble son. He'll be swapped half way into the next century. If
instead you used an event parser (e.g SAX) with a few stacks, it will
probably be slower, and it will be much more hard to write, but for
practical purposes there won't be an upper limit on input file size.

SteveT


Re: eLyXer for Document Parsing

2012-02-04 Thread Rob Oakes
Hi Steve,

 Not only possible but easy if you do things the Steve Litt way. eLyXer
 quickly punches out HTML that's clean enough to read with an XML
 parser, I think. So, eLyXer converts to HTML, and then your program's
 DOMbuilder module converts that HTML to in-memory DOM. No muss, no
 fuss, no bother, no picking apart eLyXer code (it's big and not
 immediately obvious, not a single weekend task).

Thanks for the recommendations. I'll need to look into this further. It's 
definitely the easiest way to go, and easy is usually the best. So says the Zen 
of Python (sort of):

If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

I was hoping for a slightly more direct route, though. That would allow me to 
maintain some of the internal data, such as cross-links. But, as I don't have 
months to implement, easy is always better than hard.

 One more question: You sure you want to go in-memory? What happens if a
 guy has a 1200 page book with 100 chapters each containing 10 sections,
 each containing 10 subsections, and tries to parse it on a machine with 512 
 MB RAM? 

I pity this poor man's decision to convert the whole mess to Word, rather than 
splitting it out into individual chapters.

But, I appreciate the voice for reason answer sanity and best practice. Short 
answer, no, not convinced that I want to go in memory. My first pass was to 
just to become comfortable with eLyXer to see if it might meet my needs. I'm 
still try to get comfortable with the structure of LyX documents and .docx 
documents. I've found a nice little python library with support for basic docx 
features and was going to try and refine that to something slightly more usable.

 You in a heap of trouble son. He'll be swapped half way into the next 
 century. If
 instead you used an event parser (e.g SAX) with a few stacks, it will
 probably be slower, and it will be much more hard to write, but for
 practical purposes there won't be an upper limit on input file size.

Good points. The python library makes use of lxml, which supports sax. After 
I've got a better handle on my constraints, I'll spend the time required to 
design something more robust. 

Cheers,

Rob

Re: eLyXer for Document Parsing

2012-02-04 Thread slitt
On Sat, 4 Feb 2012 14:00:24 -0700
Rob Oakes lyx-de...@oak-tree.us wrote:

 Hi Steve,
[clip]
  One more question: You sure you want to go in-memory? What happens
  if a guy has a 1200 page book with 100 chapters each containing 10
  sections, each containing 10 subsections, and tries to parse it on
  a machine with 512 MB RAM? 
 
 I pity this poor man's decision to convert the whole mess to Word,
 rather than splitting it out into individual chapters.
 
 But, I appreciate the voice for reason answer sanity and best
 practice. Short answer, no, not convinced that I want to go in
 memory. My first pass was to just to become comfortable with eLyXer
 to see if it might meet my needs. I'm still try to get comfortable
 with the structure of LyX documents and .docx documents. I've found a
 nice little python library with support for basic docx features and
 was going to try and refine that to something slightly more usable.
 
  You in a heap of trouble son. He'll be swapped half way into the
  next century. If instead you used an event parser (e.g SAX) with a
  few stacks, it will probably be slower, and it will be much more
  hard to write, but for practical purposes there won't be an upper
  limit on input file size.
 
 Good points. The python library makes use of lxml, which supports
 sax. After I've got a better handle on my constraints, I'll spend the
 time required to design something more robust. 

On my lyx2kindle program
(http://www.troubleshooters.com/projects/lyx2kindle/) I used Python's
HTMLParser XML event parser tool. It was easy, though I think your lxml
idea is faster with big documents. For my 11K word book Rules of the
Happiness Highway, conversion was maybe a second. Anyway, my
lyx2kindle.py illustrates use of HTMLParser, illustrates the use of a
stack to keep levels and maintain a poor man's state machine, and also
another part of it implements the kludge of the century.

SteveT


eLyXer for Document Parsing

2012-02-04 Thread Rob Oakes
Dear eLyXer Users and Developers,

I'm still at work on the import/export module for Microsoft Word documents. I'm 
making pretty good progress. I've got a rough prototype that works pretty well 
and I'm now starting to refine it.

My approach up to now has been to use regular expressions to match portions of 
the document and then use a library to translate those to the corresponding 
Word XML structures. It's working pretty well with my simple test documents.

Before going too far with this approach, though, I wanted to post (another 
general query).

In the eLyXer library, there is already a robust set of tools used for 
converting LyX documents to HTML. Does anyone know if the library is written in 
such as way that getting a generic in-memory representation of the document 
would be possible? It would be awesome to re-use as much existing code for the 
Word document export as possible. That would allow me to support a broader 
number of features, and gives me a framework for working with maths.

Any thoughts Alex (and others)? I've downloaded the sources and have begun to 
work through them, but before spending hours to days trying to wrap my head 
around them, I thought I would ask.

Cheers,

Rob

Re: eLyXer for Document Parsing

2012-02-04 Thread slitt
On Sat, 4 Feb 2012 10:03:00 -0700
Rob Oakes lyx-de...@oak-tree.us wrote:

 Dear eLyXer Users and Developers,
 
 I'm still at work on the import/export module for Microsoft Word
 documents. I'm making pretty good progress. I've got a rough
 prototype that works pretty well and I'm now starting to refine it.
 
 My approach up to now has been to use regular expressions to match
 portions of the document and then use a library to translate those to
 the corresponding Word XML structures. It's working pretty well with
 my simple test documents.
 
 Before going too far with this approach, though, I wanted to post
 (another general query).
 
 In the eLyXer library, there is already a robust set of tools used
 for converting LyX documents to HTML. Does anyone know if the library
 is written in such as way that getting a generic in-memory
 representation of the document would be possible? It would be awesome
 to re-use as much existing code for the Word document export as
 possible. That would allow me to support a broader number of
 features, and gives me a framework for working with maths.
 
 Any thoughts Alex (and others)? I've downloaded the sources and have
 begun to work through them, but before spending hours to days trying
 to wrap my head around them, I thought I would ask.


This is obviously an Alex question, so I'll go ahead and answer it :-)

Not only possible but easy if you do things the Steve Litt way. eLyXer
quickly punches out HTML that's clean enough to read with an XML
parser, I think. So, eLyXer converts to HTML, and then your program's
DOMbuilder module converts that HTML to in-memory DOM. No muss, no
fuss, no bother, no picking apart eLyXer code (it's big and not
immediately obvious, not a single weekend task).

One more question: You sure you want to go in-memory? What happens if a
guy has a 1200 page book with 100 chapters each containing 10 sections,
each containing 10 subsections, and tries to parse it on a machine with 512 MB 
RAM? 
You in a heap of
trouble son. He'll be swapped half way into the next century. If
instead you used an event parser (e.g SAX) with a few stacks, it will
probably be slower, and it will be much more hard to write, but for
practical purposes there won't be an upper limit on input file size.

SteveT


Re: eLyXer for Document Parsing

2012-02-04 Thread Rob Oakes
Hi Steve,

 Not only possible but easy if you do things the Steve Litt way. eLyXer
 quickly punches out HTML that's clean enough to read with an XML
 parser, I think. So, eLyXer converts to HTML, and then your program's
 DOMbuilder module converts that HTML to in-memory DOM. No muss, no
 fuss, no bother, no picking apart eLyXer code (it's big and not
 immediately obvious, not a single weekend task).

Thanks for the recommendations. I'll need to look into this further. It's 
definitely the easiest way to go, and easy is usually the best. So says the Zen 
of Python (sort of):

If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

I was hoping for a slightly more direct route, though. That would allow me to 
maintain some of the internal data, such as cross-links. But, as I don't have 
months to implement, easy is always better than hard.

 One more question: You sure you want to go in-memory? What happens if a
 guy has a 1200 page book with 100 chapters each containing 10 sections,
 each containing 10 subsections, and tries to parse it on a machine with 512 
 MB RAM? 

I pity this poor man's decision to convert the whole mess to Word, rather than 
splitting it out into individual chapters.

But, I appreciate the voice for reason answer sanity and best practice. Short 
answer, no, not convinced that I want to go in memory. My first pass was to 
just to become comfortable with eLyXer to see if it might meet my needs. I'm 
still try to get comfortable with the structure of LyX documents and .docx 
documents. I've found a nice little python library with support for basic docx 
features and was going to try and refine that to something slightly more usable.

 You in a heap of trouble son. He'll be swapped half way into the next 
 century. If
 instead you used an event parser (e.g SAX) with a few stacks, it will
 probably be slower, and it will be much more hard to write, but for
 practical purposes there won't be an upper limit on input file size.

Good points. The python library makes use of lxml, which supports sax. After 
I've got a better handle on my constraints, I'll spend the time required to 
design something more robust. 

Cheers,

Rob

Re: eLyXer for Document Parsing

2012-02-04 Thread slitt
On Sat, 4 Feb 2012 14:00:24 -0700
Rob Oakes lyx-de...@oak-tree.us wrote:

 Hi Steve,
[clip]
  One more question: You sure you want to go in-memory? What happens
  if a guy has a 1200 page book with 100 chapters each containing 10
  sections, each containing 10 subsections, and tries to parse it on
  a machine with 512 MB RAM? 
 
 I pity this poor man's decision to convert the whole mess to Word,
 rather than splitting it out into individual chapters.
 
 But, I appreciate the voice for reason answer sanity and best
 practice. Short answer, no, not convinced that I want to go in
 memory. My first pass was to just to become comfortable with eLyXer
 to see if it might meet my needs. I'm still try to get comfortable
 with the structure of LyX documents and .docx documents. I've found a
 nice little python library with support for basic docx features and
 was going to try and refine that to something slightly more usable.
 
  You in a heap of trouble son. He'll be swapped half way into the
  next century. If instead you used an event parser (e.g SAX) with a
  few stacks, it will probably be slower, and it will be much more
  hard to write, but for practical purposes there won't be an upper
  limit on input file size.
 
 Good points. The python library makes use of lxml, which supports
 sax. After I've got a better handle on my constraints, I'll spend the
 time required to design something more robust. 

On my lyx2kindle program
(http://www.troubleshooters.com/projects/lyx2kindle/) I used Python's
HTMLParser XML event parser tool. It was easy, though I think your lxml
idea is faster with big documents. For my 11K word book Rules of the
Happiness Highway, conversion was maybe a second. Anyway, my
lyx2kindle.py illustrates use of HTMLParser, illustrates the use of a
stack to keep levels and maintain a poor man's state machine, and also
another part of it implements the kludge of the century.

SteveT


eLyXer for Document Parsing

2012-02-04 Thread Rob Oakes
Dear eLyXer Users and Developers,

I'm still at work on the import/export module for Microsoft Word documents. I'm 
making pretty good progress. I've got a rough prototype that works pretty well 
and I'm now starting to refine it.

My approach up to now has been to use regular expressions to match portions of 
the document and then use a library to translate those to the corresponding 
Word XML structures. It's working pretty well with my simple test documents.

Before going too far with this approach, though, I wanted to post (another 
general query).

In the eLyXer library, there is already a robust set of tools used for 
converting LyX documents to HTML. Does anyone know if the library is written in 
such as way that getting a generic in-memory representation of the document 
would be possible? It would be awesome to re-use as much existing code for the 
Word document export as possible. That would allow me to support a broader 
number of features, and gives me a framework for working with maths.

Any thoughts Alex (and others)? I've downloaded the sources and have begun to 
work through them, but before spending hours to days trying to wrap my head 
around them, I thought I would ask.

Cheers,

Rob

Re: eLyXer for Document Parsing

2012-02-04 Thread slitt
On Sat, 4 Feb 2012 10:03:00 -0700
Rob Oakes  wrote:

> Dear eLyXer Users and Developers,
> 
> I'm still at work on the import/export module for Microsoft Word
> documents. I'm making pretty good progress. I've got a rough
> prototype that works pretty well and I'm now starting to refine it.
> 
> My approach up to now has been to use regular expressions to match
> portions of the document and then use a library to translate those to
> the corresponding Word XML structures. It's working pretty well with
> my simple test documents.
> 
> Before going too far with this approach, though, I wanted to post
> (another general query).
> 
> In the eLyXer library, there is already a robust set of tools used
> for converting LyX documents to HTML. Does anyone know if the library
> is written in such as way that getting a generic in-memory
> representation of the document would be possible? It would be awesome
> to re-use as much existing code for the Word document export as
> possible. That would allow me to support a broader number of
> features, and gives me a framework for working with maths.
> 
> Any thoughts Alex (and others)? I've downloaded the sources and have
> begun to work through them, but before spending hours to days trying
> to wrap my head around them, I thought I would ask.


This is obviously an Alex question, so I'll go ahead and answer it :-)

Not only possible but easy if you do things the Steve Litt way. eLyXer
quickly punches out HTML that's clean enough to read with an XML
parser, I think. So, eLyXer converts to HTML, and then your program's
DOMbuilder module converts that HTML to in-memory DOM. No muss, no
fuss, no bother, no picking apart eLyXer code (it's big and not
immediately obvious, not a single weekend task).

One more question: You sure you want to go in-memory? What happens if a
guy has a 1200 page book with 100 chapters each containing 10 sections,
each containing 10 subsections, and tries to parse it on a machine with 512 MB 
RAM? 
You in a heap of
trouble son. He'll be swapped half way into the next century. If
instead you used an event parser (e.g SAX) with a few stacks, it will
probably be slower, and it will be much more hard to write, but for
practical purposes there won't be an upper limit on input file size.

SteveT


Re: eLyXer for Document Parsing

2012-02-04 Thread Rob Oakes
Hi Steve,

> Not only possible but easy if you do things the Steve Litt way. eLyXer
> quickly punches out HTML that's clean enough to read with an XML
> parser, I think. So, eLyXer converts to HTML, and then your program's
> DOMbuilder module converts that HTML to in-memory DOM. No muss, no
> fuss, no bother, no picking apart eLyXer code (it's big and not
> immediately obvious, not a single weekend task).

Thanks for the recommendations. I'll need to look into this further. It's 
definitely the easiest way to go, and easy is usually the best. So says the Zen 
of Python (sort of):

If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

I was hoping for a slightly more direct route, though. That would allow me to 
maintain some of the internal data, such as cross-links. But, as I don't have 
months to implement, easy is always better than hard.

> One more question: You sure you want to go in-memory? What happens if a
> guy has a 1200 page book with 100 chapters each containing 10 sections,
> each containing 10 subsections, and tries to parse it on a machine with 512 
> MB RAM? 

I pity this poor man's decision to convert the whole mess to Word, rather than 
splitting it out into individual chapters.

But, I appreciate the voice for reason answer sanity and best practice. Short 
answer, no, not convinced that I want to go in memory. My first pass was to 
just to become comfortable with eLyXer to see if it might meet my needs. I'm 
still try to get comfortable with the structure of LyX documents and .docx 
documents. I've found a nice little python library with support for basic docx 
features and was going to try and refine that to something slightly more usable.

> You in a heap of trouble son. He'll be swapped half way into the next 
> century. If
> instead you used an event parser (e.g SAX) with a few stacks, it will
> probably be slower, and it will be much more hard to write, but for
> practical purposes there won't be an upper limit on input file size.

Good points. The python library makes use of lxml, which supports sax. After 
I've got a better handle on my constraints, I'll spend the time required to 
design something more robust. 

Cheers,

Rob

Re: eLyXer for Document Parsing

2012-02-04 Thread slitt
On Sat, 4 Feb 2012 14:00:24 -0700
Rob Oakes  wrote:

> Hi Steve,
[clip]
> > One more question: You sure you want to go in-memory? What happens
> > if a guy has a 1200 page book with 100 chapters each containing 10
> > sections, each containing 10 subsections, and tries to parse it on
> > a machine with 512 MB RAM? 
> 
> I pity this poor man's decision to convert the whole mess to Word,
> rather than splitting it out into individual chapters.
> 
> But, I appreciate the voice for reason answer sanity and best
> practice. Short answer, no, not convinced that I want to go in
> memory. My first pass was to just to become comfortable with eLyXer
> to see if it might meet my needs. I'm still try to get comfortable
> with the structure of LyX documents and .docx documents. I've found a
> nice little python library with support for basic docx features and
> was going to try and refine that to something slightly more usable.
> 
> > You in a heap of trouble son. He'll be swapped half way into the
> > next century. If instead you used an event parser (e.g SAX) with a
> > few stacks, it will probably be slower, and it will be much more
> > hard to write, but for practical purposes there won't be an upper
> > limit on input file size.
> 
> Good points. The python library makes use of lxml, which supports
> sax. After I've got a better handle on my constraints, I'll spend the
> time required to design something more robust. 

On my lyx2kindle program
(http://www.troubleshooters.com/projects/lyx2kindle/) I used Python's
HTMLParser XML event parser tool. It was easy, though I think your lxml
idea is faster with big documents. For my 11K word book "Rules of the
Happiness Highway", conversion was maybe a second. Anyway, my
lyx2kindle.py illustrates use of HTMLParser, illustrates the use of a
stack to keep levels and maintain a poor man's state machine, and also
another part of it implements the kludge of the century.

SteveT