Re: how to split this kind of text into sections

2014-04-26 Thread oyster
First of all, thank you all for your answers. I received python
mail-list in a daily digest, so it is not easy for me to quote your
mail separately.

I will try to explain my situation to my best, but English is not my
native language, I don't know whether I can make it clear at last.

Every SECTION starts with 2 special lines; these 2 lines is special
because they have some same characters (the length is not const for
different section) at the beginning; these same characters is called
the KEY for this section. For every 2 neighbor sections, they have
different KEYs.

After these 2 special lines, some paragraph is followed. Paragraph
does not have any KEYs.

So, a section = 2 special lines with KEYs at the beginning + some
paragraph without KEYs

However there maybe some paragraph before the first section, which I
do not need and want to drop it

I need a method to split the whole text into SECTIONs and to know all the KEYs

I have tried to solve this problem via re module, but failed. Maybe I
can make you understand me clearly by showing the regular expression
object
reobj = 
re.compile(r(?Pbookname[^\r\n]*?)[^\r\n]*?\r\n(?P=bookname)[^\r\n]*?\r\n.*?,
re.DOTALL)
which can get the first 2 lines of a section, but fail to get the rest
of a section which does not have any KEYs at the begin. The hard part
for me is to express paragraph does not have KEYs.

Even I can get the first 2 line, I think regular expression is
expensive for my text.

That is all. I hope get some more suggestions. Thanks.

[demo text starts]
a line we do not need
I am section axax
I am section bbb
(and here goes many other text)...

let's continue to
let's continue, yeah
.(and here goes many other text)...

I am using python
I am using perl
.(and here goes many other text)...

Programming is hard
Programming is easy
How do you thing?
I do’t know
[demo text ends]

the above text should be splited to a LIST with 4 items, and I also
need to know the KEY for LIST is ['I am section ', 'let's continue',
'I am using ', ' Programming is ']:
lst=[
'''a line we do not need
I am section axax
I am section bbb
(and here goes many other text)... ''',

'''let's continue to
let's continue, yeah
.(and here goes many other text)... ''',

'''I am using python
I am using perl
.(and here goes many other text)... ''',

'''Programming is hard
Programming is easy
How do you thing?
I do’t know'''
]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: how to split this kind of text into sections

2014-04-26 Thread Tim Chase
On 2014-04-26 23:53, oyster wrote:
 I will try to explain my situation to my best, but English is not my
 native language, I don't know whether I can make it clear at last.

Your follow-up reply made much more sense and your written English is
far better than many native speakers'. :-)

 Every SECTION starts with 2 special lines; these 2 lines is special
 because they have some same characters (the length is not const for
 different section) at the beginning; these same characters is called
 the KEY for this section. For every 2 neighbor sections, they have
 different KEYs.

I suspect you have a minimum number of characters (or words) to
consider, otherwise a single character duplicated at the beginning of
the line would delimit a section, such as

 abcd
 afgh

because they share the commonality of an a.  The code I provided
earlier should give you what you describe.  I've tweaked and tested,
and provided it below.  Note that I require a minimum overlap of 6
characters (MIN_LEN).  It also gathers the initial stuff (that you
want to discard) under the empty key, so you can either delete that,
or ignore it.

 I need a method to split the whole text into SECTIONs and to know
 all the KEYs
 
 I have tried to solve this problem via re module

I don't think the re module will be as much help here.

-tkc


from collections import defaultdict
import itertools as it
MIN_LEN = 6
def overlap(s1, s2):
Given 2 strings, return the initial overlap between them
return ''.join(
c1
for c1, c2
in it.takewhile(
lambda pair: pair[0] == pair[1],
it.izip(s1, s2)
)
)
prevline =  # the initial key under which preamble gets stored
output = defaultdict(list)
key = None
with open(data.txt) as f:
for line in f:
if len(line) = MIN_LEN and prevline[:MIN_LEN] == line[:MIN_LEN]:
key = overlap(prevline, line)
output[key].append(line)
prevline = line
for k,v in output.items():
print str(k).center(60,'=')
print ''.join(v)








.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: how to split this kind of text into sections

2014-04-26 Thread Steven D'Aprano
On Sat, 26 Apr 2014 23:53:14 +0800, oyster wrote:

 Every SECTION starts with 2 special lines; these 2 lines is special
 because they have some same characters (the length is not const for
 different section) at the beginning; these same characters is called the
 KEY for this section. For every 2 neighbor sections, they have different
 KEYs.
 
 After these 2 special lines, some paragraph is followed. Paragraph does
 not have any KEYs.
 
 So, a section = 2 special lines with KEYs at the beginning + some
 paragraph without KEYs
 
 However there maybe some paragraph before the first section, which I do
 not need and want to drop it
 
 I need a method to split the whole text into SECTIONs and to know all
 the KEYs

Let me try to describe how I would solve this, in English.

I would look at each pair of lines (1st + 2nd, 2nd + 3rd, 3rd + 4th, 
etc.) looking for a pair of lines with matching prefixes. E.g.:

This line matches the next
This line matches the previous

do match, because they both start with This line matches the .

Question: how many characters in common counts as a match?

This line matches the next
That previous line matches this line

have a common prefix of Th, two characters. Is that a match?

So let me start with a function to extract the matching prefix, if there 
is one. It returns '' if there is no match, and the prefix (the KEY) if 
there is one:

def extract_key(line1, line2):
Return the key from two matching lines, or '' if not matching.
# Assume they need five characters in common.
if line1[:5] == line2[:5]:
return line1[:5]
return ''


I'm pretty much guessing that this is how you decide there's a match. I 
don't know if five characters is too many or two few, or if you need a 
more complicated test. It seems that you want to match as many characters 
as possible. I'll leave you to adjust this function to work exactly as 
needed.

Now we iterate over the text in pairs of lines. We need somewhere to hold 
the the lines in each section, so I'm going to use a dict of lists of 
lines. As a bonus, I'm going to collect the ignored lines using a key of 
None. However, I do assume that all keys are unique. It should be easy 
enough to adjust the following to handle non-unique keys. (Use a list of 
lists, rather than a dict, and save the keys in a separate list.)

Lastly, the way it handles lines at the beginning of a section is not 
exactly the way you want it. This puts the *first* line of the section as 
the *last* line of the previous section. I will leave you to sort out 
that problem.


from collections import OrderedDict
section = []
sections = OrderedDict()
sections[None] = section
lines = iter(text.split('\n'))
prev_line = ''
for line in lines:
key = extract_key(prev_line, line)
if key == '':
# No match, so we're still in the same section as before.
section.append(line)
else:
# Match, so we start a new section.
section = [line]
sections[key] = section
prev_line = line



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


how to split this kind of text into sections

2014-04-25 Thread oyster
I have a long text, which should be splitted into some sections, where
all sections have a pattern like following with different KEY. And the /n/r
can not be used to split

I don't know whether this can be done easily, for example by using RE module

[demo text starts]
a line we do not need
I am section axax
I am section bbb, we can find that the first 2 lines of this section all
startswith 'I am section'
.(and here goes many other text)...
let's continue to
 let's continue, yeah
 .(and here goes many other text)...
I am using python
I am using perl
 .(and here goes many other text)...
[demo text ends]

the above text should be splitted as a LIST with 3 items, and I also need
to know the KEY for LIST is ['I am section', 'let's continue', 'I am
using']:
lst=[
 '''I am section axax
I am section bbb, we can find that the first 2 lines of this section all
startswith 'I am section'
.(and here goes many other text)...''',

'''let's continue to
 let's continue, yeah
 .(and here goes many other text)...''',


'''I am using python
I am using perl
 .(and here goes many other text)...'''
]

I hope I have state myself clear.

Regards
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: how to split this kind of text into sections

2014-04-25 Thread Roy Smith
In article mailman.9492.1398431281.18130.python-l...@python.org,
 oyster lepto.pyt...@gmail.com wrote:

 I have a long text, which should be splitted into some sections, where
 all sections have a pattern like following with different KEY. And the /n/r
 can not be used to split
 
 I don't know whether this can be done easily, for example by using RE module
 
 [demo text starts]
 a line we do not need
 I am section axax
 I am section bbb, we can find that the first 2 lines of this section all
 startswith 'I am section'
 .(and here goes many other text)...
 let's continue to
  let's continue, yeah
  .(and here goes many other text)...
 I am using python
 I am using perl
  .(and here goes many other text)...
 [demo text ends]

This kind of looks like a standard INI file.  Check out 
https://docs.python.org/2/library/configparser.html, it may do what you 
need.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: how to split this kind of text into sections

2014-04-25 Thread Chris Angelico
On Fri, Apr 25, 2014 at 11:07 PM, oyster lepto.pyt...@gmail.com wrote:
 the above text should be splitted as a LIST with 3 items, and I also need to
 know the KEY for LIST is ['I am section', 'let's continue', 'I am using']:

It's not perfectly clear, but I think I have some idea of what you're
trying to do. Let me restate what I think you want, and you can tell
be if it's correct.

You have a file which consists of a number of lines. Some of those
lines begin with the string I am section, others begin let's
continue, and others begin I am using. You want to collect those
three sets of lines; inside each collection, every line will have that
same prefix.

Is that correct? If so, we can certainly help you with that. If not,
please clarify. :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: how to split this kind of text into sections

2014-04-25 Thread Tim Chase
On 2014-04-25 23:31, Chris Angelico wrote:
 On Fri, Apr 25, 2014 at 11:07 PM, oyster lepto.pyt...@gmail.com
 wrote:
  the above text should be splitted as a LIST with 3 items, and I
  also need to know the KEY for LIST is ['I am section', 'let's
  continue', 'I am using']:
 
 It's not perfectly clear, but I think I have some idea of what
 you're trying to do. Let me restate what I think you want, and you
 can tell be if it's correct.
 
 You have a file which consists of a number of lines. Some of those
 lines begin with the string I am section, others begin let's
 continue, and others begin I am using. You want to collect those
 three sets of lines; inside each collection, every line will have
 that same prefix.
 
 Is that correct? If so, we can certainly help you with that. If not,
 please clarify. :)

My reading of it (and it took me several tries) was that two
subsequent lines would begin with the same N words.  Something like
the following regexp:

  ^(\w.{8,}).*\n\1.*

as the delimiter (choosing 6 arbitrarily as an indication of a
minimum match length to).

A naive (and untested) bit of code might look something like

  MIN_LEN = 6
  def overlap(s1, s2):
chars = []
for c1, c2 in zip(s1,s2):
  if c1 != c2: break
  chars.append(c1)
return ''.join(chars)
  prevline = 
  output_number = 1
  output = defaultdict(list)
  key = None
  with open(input.txt) as f:
for line in f:
  if len(line) = MIN_LEN and prevline[:MIN_LEN] == line[:MIN_LEN]: 
key = overlap(prevline, line)
  output[key].append(line)
  prevline = line

There are some edge-cases such as when multiple sections are
delimited by the same overlap, but this should build up a defaultdict
keyed by the delimiters with the corresponding lines as the values.

-tkc



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: how to split this kind of text into sections

2014-04-25 Thread Jussi Piitulainen
oyster writes:

 I have a long text, which should be splitted into some sections, where
 all sections have a pattern like following with different KEY.

itertools.groupby, if you know how to extract a key from a given line.

 And the /n/r can not be used to split

Yet you seem to want to have each line as a unit? You could group
lines straight from some file object using itertools.groupby and then
''.join each group.

(It's \n and \r, and \r\n when they are both there, but you can just
let Python read the lines.)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: how to split this kind of text into sections

2014-04-25 Thread Steven D'Aprano
On Fri, 25 Apr 2014 21:07:53 +0800, oyster wrote:

 I have a long text, which should be splitted into some sections, where
 all sections have a pattern like following with different KEY. And the
 /n/r can not be used to split
 
 I don't know whether this can be done easily, for example by using RE
 module

[... snip example ...]

 I hope I have state myself clear.

Clear as mud.

I'm afraid I have no idea what you mean. Can you explain the decision 
that you make to decide whether a line is included, or excluded, or part 
of a section?



 [demo text starts]
 a line we do not need

How do we decide whether the line is ignored? Is it the literal text a 
line we do not need?

for line in lines:
if line == a line we do not need\n:
# ignore this line
continue


 I am section axax
 I am section bbb, we can find that the first 2 lines of this section all
 startswith 'I am section'


Again, is this the *literal* text that you expect?

 .(and here goes many other text)... let's continue to
  let's continue, yeah
  .(and here goes many other text)...
 I am using python
 I am using perl
  .(and here goes many other text)...
 [demo text ends]
 
 the above text should be splitted as a LIST with 3 items, and I also
 need to know the KEY for LIST is ['I am section', 'let's continue', 'I
 am using']:

How do you decide that they are the keys?


 lst=[
  '''I am section axax
 I am section bbb, we can find that the first 2 lines of this section all
 startswith 'I am section'
 .(and here goes many other text)...''',
 
 '''let's continue to
  let's continue, yeah
  .(and here goes many other text)...''',
 
 
 '''I am using python
 I am using perl
  .(and here goes many other text)...'''
 ]

Perhaps it would be better if you show a more realistic example.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: how to split this kind of text into sections

2014-04-25 Thread Steven D'Aprano
On Fri, 25 Apr 2014 09:18:22 -0400, Roy Smith wrote:

 In article mailman.9492.1398431281.18130.python-l...@python.org,
  oyster lepto.pyt...@gmail.com wrote:

 [demo text starts]
 a line we do not need
 I am section axax
 I am section bbb, we can find that the first 2 lines of this section
 all startswith 'I am section'
 .(and here goes many other text)... let's continue to
  let's continue, yeah
  .(and here goes many other text)...
 I am using python
 I am using perl
  .(and here goes many other text)...
 [demo text ends]
 
 This kind of looks like a standard INI file.

I don't think so. INI files are a collection of KEY=VALUE or KEY:VALUE 
pairs, and the example above shows nothing like that. The only thing 
which is even vaguely ini-like is the header [demo text starts], and my 
reading of that is that it is *not* part of the file, but just an 
indication that the OP is giving a demo.




-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: how to split this kind of text into sections

2014-04-25 Thread Terry Reedy

On 4/25/2014 9:07 AM, oyster wrote:

I have a long text, which should be splitted into some sections, where
all sections have a pattern like following with different KEY.


Computers are worse at reading your mind than humans. If you can write 
rules that another person could follow, THEN we could help you translate 
the rules to Python.


If you have 1 moderate length file or a few short files, I would edit 
them by hand to remove ignore lines and put a blank line between 
sections. A program to do the rest would then be easy.



the above text should be splitted as a LIST with 3 items, and I also
need to know the KEY for LIST is ['I am section', 'let's continue', 'I
am using']:


This suggests that the rule for keys is 'first 3 words of a line, with 
contractions counted as 2 words'. Correct?


Another possible rule is 'a member of the following list: ...', as you 
gave above but presumably expanded.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list