Re: aligning SGML to text

2006-06-19 Thread Gerard Flanagan

Steven Bethard wrote:
 Gerard Flanagan wrote:
  Steven Bethard wrote:
  I have some plain text data and some SGML markup for that text that I
  need to align.  (The SGML doesn't maintain the original whitespace, so I
  have to do some alignment; I can't just calculate the indices directly.)
For example, some of my text looks like:
[...]
 
  Steve
 
  This is probably an abuse of itertools...
 
[snip hammering]

 Thanks for taking a look.  Yeah, the alignment's a big part of the
 problem.  It'd be really nice if the thing that gives me SGML didn't add
 whitespace haphazardly. ;-)

 STeVe

I see, the problem was different than I thought. When all you have is a
hammer... :-)

Gerard

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: aligning SGML to text

2006-06-18 Thread Gerard Flanagan
Steven Bethard wrote:
 I have some plain text data and some SGML markup for that text that I
 need to align.  (The SGML doesn't maintain the original whitespace, so I
 have to do some alignment; I can't just calculate the indices directly.)
   For example, some of my text looks like:

 TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
 cytoplasmic translocation and concomitant formation of an intracellular
 signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.

 And the corresponding SGML looks like:

 PROTEIN TNF /PROTEIN binding induces release of PROTEIN AIP1
 /PROTEIN ( PROTEIN DAB2IP /PROTEIN ) from PROTEIN TNFR1
 /PROTEIN , resulting in cytoplasmic translocation and concomitant
 formation of an PROTEIN intracellular signaling complex /PROTEIN
 comprised of PROTEIN TRADD /PROTEIN , PROTEIN RIP1 /PROTEIN ,
 PROTEIN TRAF2 /PROTEIN , and AIPl .

 Note that the SGML inserts spaces not only within the SGML elements, but
 also around punctuation.


 I need to determine the indices in the original text that each SGML
 element corresponds to.  Here's some working code to do this, based on a
 suggestion for a related problem by Fredrik Lundh[1]::

  def align(text, sgml):
  sgml = sgml.replace('', 'amp;')
  tree = etree.fromstring('xml%s/xml' % sgml)
  words = []
  if tree.text is not None:
  words.extend(tree.text.split())
  word_indices = []
  for elem in tree:
  elem_words = elem.text.split()
  start = len(words)
  end = start + len(elem_words)
  word_indices.append((start, end, elem.tag))
  words.extend(elem_words)
  if elem.tail is not None:
  words.extend(elem.tail.split())
  expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
  match = re.match(expr, text)
  assert match is not None
  for word_start, word_end, label in word_indices:
  start = match.start(word_start + 1)
  end = match.end(word_end)
  yield label, start, end

[...]
   list(align(text, sgml))
  [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
  ('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
  ('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]

 The problem is, this doesn't work when my text is long (which it is)
 because regular expressions are limited to 100 groups.  I get an error
 like::
[...]

Steve

This is probably an abuse of itertools...

---8---
text = '''TNF binding induces release of AIP1 (DAB2IP) from
TNFR1, resulting in cytoplasmic translocation and concomitant
formation of an intracellular signaling complex comprised of TRADD,
RIP1, TRAF2, and AIPl.'''

sgml = '''PROTEIN TNF /PROTEIN binding induces release of
PROTEIN AIP1 /PROTEIN ( PROTEIN DAB2IP /PROTEIN ) from
PROTEIN TNFR1 /PROTEIN , resulting in cytoplasmic translocation
and concomitant formation of an PROTEIN intracellular signaling
complex /PROTEIN comprised of PROTEIN TRADD /PROTEIN ,
PROTEIN RIP1 /PROTEIN , PROTEIN TRAF2 /PROTEIN , and AIPl .
'''

import itertools as it
import string

def scan(line):
if not line: return
line = line.strip()
parts = string.split(line, '', maxsplit=1)
return parts[0]

def align(txt,sml):
i = 0
for k,g in it.groupby(sml.split(''),scan):
g = list(g)
if not g[0]: continue
text = g[0].split('')[1]#.replace('\n','')
if k.startswith('/'):
i += len(text)
else:
offset = len(text.strip())
yield k, i, i+offset
i += offset

print list(align(text,sgml))



[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 38, 44),
('PROTEIN', 52, 57), ('PROTEIN', 131, 162), ('PROTEIN', 176, 181),
('PROTEIN', 184, 188), ('PROTEIN', 191, 196)]

It's off because of the punctuation possibly, can't figure it out.
maybe you can tweak it?

hth

Gerard

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: aligning SGML to text

2006-06-18 Thread Steven Bethard
Gerard Flanagan wrote:
 Steven Bethard wrote:
 I have some plain text data and some SGML markup for that text that I
 need to align.  (The SGML doesn't maintain the original whitespace, so I
 have to do some alignment; I can't just calculate the indices directly.)
   For example, some of my text looks like:

 TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
 cytoplasmic translocation and concomitant formation of an intracellular
 signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.

 And the corresponding SGML looks like:

 PROTEIN TNF /PROTEIN binding induces release of PROTEIN AIP1
 /PROTEIN ( PROTEIN DAB2IP /PROTEIN ) from PROTEIN TNFR1
 /PROTEIN , resulting in cytoplasmic translocation and concomitant
 formation of an PROTEIN intracellular signaling complex /PROTEIN
 comprised of PROTEIN TRADD /PROTEIN , PROTEIN RIP1 /PROTEIN ,
 PROTEIN TRAF2 /PROTEIN , and AIPl .

 Note that the SGML inserts spaces not only within the SGML elements, but
 also around punctuation.


 I need to determine the indices in the original text that each SGML
 element corresponds to.  Here's some working code to do this, based on a
 suggestion for a related problem by Fredrik Lundh[1]::

  def align(text, sgml):
  sgml = sgml.replace('', 'amp;')
  tree = etree.fromstring('xml%s/xml' % sgml)
  words = []
  if tree.text is not None:
  words.extend(tree.text.split())
  word_indices = []
  for elem in tree:
  elem_words = elem.text.split()
  start = len(words)
  end = start + len(elem_words)
  word_indices.append((start, end, elem.tag))
  words.extend(elem_words)
  if elem.tail is not None:
  words.extend(elem.tail.split())
  expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
  match = re.match(expr, text)
  assert match is not None
  for word_start, word_end, label in word_indices:
  start = match.start(word_start + 1)
  end = match.end(word_end)
  yield label, start, end

 [...]
   list(align(text, sgml))
  [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
  ('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
  ('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]

 The problem is, this doesn't work when my text is long (which it is)
 because regular expressions are limited to 100 groups.  I get an error
 like::
 [...]
 
 Steve
 
 This is probably an abuse of itertools...
 
 ---8---
 text = '''TNF binding induces release of AIP1 (DAB2IP) from
 TNFR1, resulting in cytoplasmic translocation and concomitant
 formation of an intracellular signaling complex comprised of TRADD,
 RIP1, TRAF2, and AIPl.'''
 
 sgml = '''PROTEIN TNF /PROTEIN binding induces release of
 PROTEIN AIP1 /PROTEIN ( PROTEIN DAB2IP /PROTEIN ) from
 PROTEIN TNFR1 /PROTEIN , resulting in cytoplasmic translocation
 and concomitant formation of an PROTEIN intracellular signaling
 complex /PROTEIN comprised of PROTEIN TRADD /PROTEIN ,
 PROTEIN RIP1 /PROTEIN , PROTEIN TRAF2 /PROTEIN , and AIPl .
 '''
 
 import itertools as it
 import string
 
 def scan(line):
 if not line: return
 line = line.strip()
 parts = string.split(line, '', maxsplit=1)
 return parts[0]
 
 def align(txt,sml):
 i = 0
 for k,g in it.groupby(sml.split(''),scan):
 g = list(g)
 if not g[0]: continue
 text = g[0].split('')[1]#.replace('\n','')
 if k.startswith('/'):
 i += len(text)
 else:
 offset = len(text.strip())
 yield k, i, i+offset
 i += offset
 
 print list(align(text,sgml))
 
 
 
 [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 38, 44),
 ('PROTEIN', 52, 57), ('PROTEIN', 131, 162), ('PROTEIN', 176, 181),
 ('PROTEIN', 184, 188), ('PROTEIN', 191, 196)]
 
 It's off because of the punctuation possibly, can't figure it out.

Thanks for taking a look.  Yeah, the alignment's a big part of the 
problem.  It'd be really nice if the thing that gives me SGML didn't add 
whitespace haphazardly. ;-)

STeVe
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: aligning SGML to text

2006-06-18 Thread Steven Bethard
Steven Bethard wrote:
 I have some plain text data and some SGML markup for that text that I 
 need to align.  (The SGML doesn't maintain the original whitespace, so I 
 have to do some alignment; I can't just calculate the indices directly.) 
[snip]
 Note that the SGML inserts spaces not only within the SGML elements, but 
 also around punctuation.
[snip]
 I need to determine the indices in the original text that each SGML 
 element corresponds to.

Ok, below is a working version that doesn't use regular expressions. 
It's far from concise, but at least it doesn't fail like re does when I 
have more than 100 words. =)

  import elementtree.ElementTree as etree
  def align(text, sgml):
... # convert SGML tree to words, and assemble a list of the
... # start word index and end word index for each SGML element
... sgml = sgml.replace('', 'amp;')
... tree = etree.fromstring('xml%s/xml' % sgml)
... words = []
... if tree.text is not None:
... words.extend(tree.text.split())
... word_spans = []
... for elem in tree:
... elem_words = elem.text.split()
... start = len(words)
... end = start + len(elem_words)
... word_spans.append((start, end, elem.tag))
... words.extend(elem_words)
... if elem.tail is not None:
... words.extend(elem.tail.split())
... # determine the start character index and end character index
... # for each word from the SGML
... char_spans = []
... start = 0
... for word in words:
... while text[start:start + 1].isspace():
... start += 1
... end = start + len(word)
... assert text[start:end] == word, (text[start:end], word)
... char_spans.append((start, end))
... start = end
... # convert the word indices for each SGML element to
... # character indices
... for word_start, word_end, label in word_spans:
... start, _ = char_spans[word_start]
... _, end = char_spans[word_end - 1]
... yield label, start, end
...
  text = '''TNF binding induces release of AIP1 (DAB2IP) from TNFR1, 
resulting in cytoplasmic translocation and concomitant formation of an 
intracellular signaling complex comprised of TRADD, RIP1, TRAF2, and 
AIPl.'''
  sgml = '''PROTEIN TNF /PROTEIN binding induces release of 
PROTEIN AIP1 /PROTEIN ( PROTEIN DAB2IP /PROTEIN ) from PROTEIN 
TNFR1 /PROTEIN , resulting in cytoplasmic translocation and 
concomitant formation of an PROTEIN intracellular signaling complex 
/PROTEIN comprised of PROTEIN TRADD /PROTEIN , PROTEIN RIP1 
/PROTEIN , PROTEIN TRAF2 /PROTEIN , and AIPl .
... '''
  list(align(text, sgml))
[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43), 
('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178), 
('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]

STeVe
-- 
http://mail.python.org/mailman/listinfo/python-list