Re: Suggestions for how to approach this problem?

2007-05-10 Thread John Salerno
James Stroud wrote:

 I included code in my previous post that will parse the entire bib, 
 making use of the numbering and eliminating the most probable, but still 
 fairly rare, potential ambiguity. You might want to check out that code, 
 as my testing it showed that it worked with your example.

Thanks. It looked a little involved so I hadn't started to work through 
it yet, but I'll do that now before I actually try to write something 
from scratch. :)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Suggestions for how to approach this problem?

2007-05-10 Thread John Salerno
James Stroud wrote:

 import re
 records = []
 record = None
 counter = 1
 regex = re.compile(r'^(\d+)\. (.*)')
 for aline in lines:
   m = regex.search(aline)
   if m is not None:
 recnum, aline = m.groups()
 if int(recnum) == counter:
   if record is not None:
 records.append(record)
   record = [aline.strip()]
   counter += 1
 continue
   record.append(aline.strip())
 
 if record is not None:
   records.append(record)
 
 records = [ .join(r) for r in records]

What do I need to do to get this to run against the text that I have? Is 
'lines' meant to be a list of the lines from the original citation file?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Suggestions for how to approach this problem?

2007-05-09 Thread John Salerno
Necmettin Begiter wrote:

 Is this how the text looks like:
 
 123
 some information
 
 124 some other information
 
 126(tab here)something else
 
 If this is the case (the numbers are at the beginning, and after the numbers 
 there is either a newline or a tab, the logic might be this simple:

They all seem to be a little different. One consistency is that each 
number is followed by two spaces. There is nothing separating each 
reference except a single newline, which I want to preserve. But within 
each reference there might be a combination of spaces, tabs, or newlines.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Suggestions for how to approach this problem?

2007-05-09 Thread John Salerno
Dave Hansen wrote:

 Questions:
 
 1) Do the citation numbers always begin in column 1?

Yes, that's one consistency at least. :)

 2) Are the citation numbers always followed by a period and then at
 least one whitespace character?

Yes, it seems to be either one or two whitespaces.

 find the beginning of each cite.  then I would output each cite
 through a state machine that would reduce consecutive whitespace
 characters (space, tab, newline) into a single character, separating
 each cite with a newline.

Interesting idea! I'm not sure what state machine is, but it sounds 
like you are suggesting that I more or less separate each reference, 
process it, and then rewrite it to a new file in the cleaner format? 
That might work pretty well.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Suggestions for how to approach this problem?

2007-05-09 Thread John Salerno
James Stroud wrote:

 If you can count on the person not skipping any numbers in the 
 citations, you can take an AI approach to hopefully weed out the rare 
 circumstance that a number followed by a period starts a line in the 
 middle of the citation.

I don't think any numbers are skipped, but there are some cases where a 
number is followed by a period within a citation. But this might not 
matter since each reference number begins at the start of the line, so I 
could use the RE to start at the beginning.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Suggestions for how to approach this problem?

2007-05-09 Thread John Salerno
John Salerno wrote:

 So I need to remove the line breaks too, but of course not *all* of them 
 because each reference still needs a line break between it.

After doing a bit of search and replace for tabs with my text editor, I
think I've narrowed down the problem to just this:

I need to remove all newline characters that are not at the end of a
citation (and replace them with a single space). That is, those that are
not followed by the start of a new numbered citation. This seems to
involve a look-ahead RE, but I'm not sure how to write those. This is
what I came up with:


\n(?=(\d)+)

(I can never remember if I need parentheses around '\d' or if the + 
should be inside it or not!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Suggestions for how to approach this problem?

2007-05-09 Thread James Stroud
John Salerno wrote:
 John Salerno wrote:
 
 So I need to remove the line breaks too, but of course not *all* of 
 them because each reference still needs a line break between it.
 
 
 After doing a bit of search and replace for tabs with my text editor, I
 think I've narrowed down the problem to just this:
 
 I need to remove all newline characters that are not at the end of a
 citation (and replace them with a single space). That is, those that are
 not followed by the start of a new numbered citation. This seems to
 involve a look-ahead RE, but I'm not sure how to write those. This is
 what I came up with:
 
 
 \n(?=(\d)+)
 
 (I can never remember if I need parentheses around '\d' or if the + 
 should be inside it or not!

I included code in my previous post that will parse the entire bib, 
making use of the numbering and eliminating the most probable, but still 
fairly rare, potential ambiguity. You might want to check out that code, 
as my testing it showed that it worked with your example.

James
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Suggestions for how to approach this problem?

2007-05-08 Thread John Salerno
John Salerno wrote:


 typed, there are often line breaks at the end of each line

Also, there are sometimes tabs used to indent the subsequent lines of 
citation, but I assume with that I can just replace the tab with a space.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Suggestions for how to approach this problem?

2007-05-08 Thread Marc 'BlackJack' Rintsch
In [EMAIL PROTECTED], John Salerno wrote:

 I have a large list of publication citations that are numbered. The 
 numbers are simply typed in with the rest of the text. What I want to do 
 is remove the numbers and then put bullets instead. Now, this alone 
 would be easy enough, with a little Python and a little work by hand, 
 but the real issue is that because of the way these citations were 
 typed, there are often line breaks at the end of each line -- in other 
 words, the person didn't just let the line flow to the next line, they 
 manually pressed Enter. So inserting bullets at this point would put a 
 bullet at each line break.
 
 So I need to remove the line breaks too, but of course not *all* of them 
 because each reference still needs a line break between it. So I'm 
 hoping I could get an idea or two for approaching this. I figure regular 
 expressions will be needed, and maybe it would be good to remove the 
 line breaks first and *not* remove a line break that comes before the 
 numbers (because that would be the proper place for one), and then 
 finally remove the numbers.

I think I have vague idea how the input looks like, but it would be
helpful if you show some example input and wanted output.

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Suggestions for how to approach this problem?

2007-05-08 Thread John Salerno
Marc 'BlackJack' Rintsch wrote:

 I think I have vague idea how the input looks like, but it would be
 helpful if you show some example input and wanted output.

Good idea. Here's what it looks like now:

1.  Levy, S.B. (1964)  Isologous interference with ultraviolet and X-ray 
irradiated
bacteriophage T2.  J. Bacteriol. 87:1330-1338.
2.  Levy, S.B. and T. Watanabe (1966)  Mepacrine and transfer of R 
factor.  Lancet 2:1138.
3.  Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966)  Episomic 
resistance factors in
Enterobacteriaceae.  34.  The specific effects of the inhibitors of DNA 
synthesis on the
transfer of R factor and F factor.  Med. Biol. (Tokyo)  73:79-83.
4.  Levy, S.B. (1967)  Blood safari into Kenya.  The New Physician 16:50-54.
5.  Levy, S.B., W.T. Fitts and J.B. Leach (1967)  Surgical treatment of 
diverticular disease of the
colon:  Evaluation of an eleven-year period.  Annals Surg.  166:947-955.

As you can see, any single citation is broken over several lines as a 
result of a line break. I want it to look like this:

1.  Levy, S.B. (1964)  Isologous interference with ultraviolet and X-ray 

 irradiated bacteriophage T2.  J. Bacteriol. 87:1330-1338.
2.  Levy, S.B. and T. Watanabe (1966)  Mepacrine and transfer of R
 factor.  Lancet 2:1138.
3.  Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966)  Episomic
 resistance factors in Enterobacteriaceae.  34.  The specific effects
 of the inhibitors of DNA synthesis on the
 transfer of R factor and F factor.  Med. Biol. (Tokyo)  73:79-83.
4.  Levy, S.B. (1967)  Blood safari into Kenya.  The New Physician
 16:50-54.
5.  Levy, S.B., W.T. Fitts and J.B. Leach (1967)  Surgical treatment of
 diverticular disease of the colon:  Evaluation of an eleven-year
 period.  Annals Surg.  166:947-955.

Now, since this is pasted, it might not even look good to you. But in 
the second example, the numbers are meant to be bullets and so the 
indentation would happen automatically (in Word). But for now they are 
just typed.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Suggestions for how to approach this problem?

2007-05-08 Thread Necmettin Begiter
On Tuesday 08 May 2007 22:23:31 John Salerno wrote:
 John Salerno wrote:
  typed, there are often line breaks at the end of each line

 Also, there are sometimes tabs used to indent the subsequent lines of
 citation, but I assume with that I can just replace the tab with a space.

Is this how the text looks like:

123
some information

124 some other information

126(tab here)something else

If this is the case (the numbers are at the beginning, and after the numbers 
there is either a newline or a tab, the logic might be this simple:

get the numbers at the beginning of the line. Check for \n and \t after the 
number, if either exists, remove them or replace them with a space or 
whatever you prefer, and there you have it. Also, how are the records 
seperated? By empty lines? If so, \n\n is an empty line in a string, like 
this:

some text here\n
\n
some other text here\n

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Suggestions for how to approach this problem?

2007-05-08 Thread Dave Hansen
On May 8, 3:00 pm, John Salerno [EMAIL PROTECTED] wrote:
 Marc 'BlackJack' Rintsch wrote:
  I think I have vague idea how the input looks like, but it would be
  helpful if you show some example input and wanted output.

 Good idea. Here's what it looks like now:

 1.  Levy, S.B. (1964)  Isologous interference with ultraviolet and X-ray
 irradiated
 bacteriophage T2.  J. Bacteriol. 87:1330-1338.
 2.  Levy, S.B. and T. Watanabe (1966)  Mepacrine and transfer of R
 factor.  Lancet 2:1138.
 3.  Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966)  Episomic
 resistance factors in
 Enterobacteriaceae.  34.  The specific effects of the inhibitors of DNA
 synthesis on the
 transfer of R factor and F factor.  Med. Biol. (Tokyo)  73:79-83.

Questions:

1) Do the citation numbers always begin in column 1?

2) Are the citation numbers always followed by a period and then at
least one whitespace character?

If so, I'd probably use a regular expression like ^[0-9]+\.[ \t] to
find the beginning of each cite.  then I would output each cite
through a state machine that would reduce consecutive whitespace
characters (space, tab, newline) into a single character, separating
each cite with a newline.

Final formatting can be done with paragraph styles in Word.

HTH,
   -=Dave


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Suggestions for how to approach this problem?

2007-05-08 Thread James Stroud
John Salerno wrote:
 Marc 'BlackJack' Rintsch wrote:
 Here's what it looks like now:
 
 1.  Levy, S.B. (1964)  Isologous interference with ultraviolet and X-ray 
 irradiated
 bacteriophage T2.  J. Bacteriol. 87:1330-1338.
 2.  Levy, S.B. and T. Watanabe (1966)  Mepacrine and transfer of R 
 factor.  Lancet 2:1138.
 3.  Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966)  Episomic 
 resistance factors in
 Enterobacteriaceae.  34.  The specific effects of the inhibitors of DNA 
 synthesis on the
 transfer of R factor and F factor.  Med. Biol. (Tokyo)  73:79-83.
 4.  Levy, S.B. (1967)  Blood safari into Kenya.  The New Physician 
 16:50-54.
 5.  Levy, S.B., W.T. Fitts and J.B. Leach (1967)  Surgical treatment of 
 diverticular disease of the
 colon:  Evaluation of an eleven-year period.  Annals Surg.  166:947-955.
 
 As you can see, any single citation is broken over several lines as a 
 result of a line break. I want it to look like this:
 
 1.  Levy, S.B. (1964)  Isologous interference with ultraviolet and X-ray
 irradiated bacteriophage T2.  J. Bacteriol. 87:1330-1338.
 2.  Levy, S.B. and T. Watanabe (1966)  Mepacrine and transfer of R
 factor.  Lancet 2:1138.
 3.  Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966)  Episomic
 resistance factors in Enterobacteriaceae.  34.  The specific effects
 of the inhibitors of DNA synthesis on the
 transfer of R factor and F factor.  Med. Biol. (Tokyo)  73:79-83.
 4.  Levy, S.B. (1967)  Blood safari into Kenya.  The New Physician
 16:50-54.
 5.  Levy, S.B., W.T. Fitts and J.B. Leach (1967)  Surgical treatment of
 diverticular disease of the colon:  Evaluation of an eleven-year
 period.  Annals Surg.  166:947-955.
 
 Now, since this is pasted, it might not even look good to you. But in 
 the second example, the numbers are meant to be bullets and so the 
 indentation would happen automatically (in Word). But for now they are 
 just typed.

If you can count on the person not skipping any numbers in the 
citations, you can take an AI approach to hopefully weed out the rare 
circumstance that a number followed by a period starts a line in the 
middle of the citation. This is not failsafe, say if you were on 
citation 33 and it was in chapter 34 and that 34 happend to start a new 
line. But, then again, even a human would take a little time to figure 
that one out--and probably wouldn't be 100% accurate either. I'm sure 
there is an AI word for the type of parser that could parse something 
like this unambiguously and I'm sure that it has been proven to be 
impossible to create:

import re
records = []
record = None
counter = 1
regex = re.compile(r'^(\d+)\. (.*)')
for aline in lines:
   m = regex.search(aline)
   if m is not None:
 recnum, aline = m.groups()
 if int(recnum) == counter:
   if record is not None:
 records.append(record)
   record = [aline.strip()]
   counter += 1
 continue
   record.append(aline.strip())

if record is not None:
   records.append(record)

records = [ .join(r) for r in records]


py import re
py records = []
py record = None
py counter = 1
py regex = re.compile(r'^(\d+)\. (.*)')
py for aline in lines:
...   m = regex.search(aline)
...   if m is not None:
... recnum, aline = m.groups()
... if int(recnum) == counter:
...   if record is not None:
... records.append(record)
...   record = [aline.strip()]
...   counter += 1
... continue
...   record.append(aline.strip())
...
py if record is not None:
...   records.append(record)
...
py records = [ .join(r) for r in records]
py records

['Levy, S.B. (1964)  Isologous interference with ultraviolet and X-ray 
irradiated bacteriophage T2.  J. Bacteriol. 87:1330-1338.',
  'Levy, S.B. and T. Watanabe (1966)  Mepacrine and transfer of R 
factor.  Lancet 2:1138.',
  'Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966)  Episomic 
resistance factors in Enterobacteriaceae.  34.  The specific effects of 
the inhibitors of DNA synthesis on the transfer of R factor and F 
factor.  Med. Biol. (Tokyo)  73:79-83.',
  'Levy, S.B. (1967)  Blood safari into Kenya.  The New Physician 
16:50-54.',
  'Levy, S.B., W.T. Fitts and J.B. Leach (1967)  Surgical treatment of 
diverticular disease of the colon:  Evaluation of an eleven-year period. 
  Annals Surg.  166:947-955.']


James
-- 
http://mail.python.org/mailman/listinfo/python-list