Re: how to split this kind of text into sections
First of all, thank you all for your answers. I received python mail-list in a daily digest, so it is not easy for me to quote your mail separately. I will try to explain my situation to my best, but English is not my native language, I don't know whether I can make it clear at last. Every SECTION starts with 2 special lines; these 2 lines is special because they have some same characters (the length is not const for different section) at the beginning; these same characters is called the KEY for this section. For every 2 neighbor sections, they have different KEYs. After these 2 special lines, some paragraph is followed. Paragraph does not have any KEYs. So, a section = 2 special lines with KEYs at the beginning + some paragraph without KEYs However there maybe some paragraph before the first section, which I do not need and want to drop it I need a method to split the whole text into SECTIONs and to know all the KEYs I have tried to solve this problem via re module, but failed. Maybe I can make you understand me clearly by showing the regular expression object reobj = re.compile(r(?Pbookname[^\r\n]*?)[^\r\n]*?\r\n(?P=bookname)[^\r\n]*?\r\n.*?, re.DOTALL) which can get the first 2 lines of a section, but fail to get the rest of a section which does not have any KEYs at the begin. The hard part for me is to express paragraph does not have KEYs. Even I can get the first 2 line, I think regular expression is expensive for my text. That is all. I hope get some more suggestions. Thanks. [demo text starts] a line we do not need I am section axax I am section bbb (and here goes many other text)... let's continue to let's continue, yeah .(and here goes many other text)... I am using python I am using perl .(and here goes many other text)... Programming is hard Programming is easy How do you thing? I do’t know [demo text ends] the above text should be splited to a LIST with 4 items, and I also need to know the KEY for LIST is ['I am section ', 'let's continue', 'I am using ', ' Programming is ']: lst=[ '''a line we do not need I am section axax I am section bbb (and here goes many other text)... ''', '''let's continue to let's continue, yeah .(and here goes many other text)... ''', '''I am using python I am using perl .(and here goes many other text)... ''', '''Programming is hard Programming is easy How do you thing? I do’t know''' ] -- https://mail.python.org/mailman/listinfo/python-list
Re: how to split this kind of text into sections
On 2014-04-26 23:53, oyster wrote: I will try to explain my situation to my best, but English is not my native language, I don't know whether I can make it clear at last. Your follow-up reply made much more sense and your written English is far better than many native speakers'. :-) Every SECTION starts with 2 special lines; these 2 lines is special because they have some same characters (the length is not const for different section) at the beginning; these same characters is called the KEY for this section. For every 2 neighbor sections, they have different KEYs. I suspect you have a minimum number of characters (or words) to consider, otherwise a single character duplicated at the beginning of the line would delimit a section, such as abcd afgh because they share the commonality of an a. The code I provided earlier should give you what you describe. I've tweaked and tested, and provided it below. Note that I require a minimum overlap of 6 characters (MIN_LEN). It also gathers the initial stuff (that you want to discard) under the empty key, so you can either delete that, or ignore it. I need a method to split the whole text into SECTIONs and to know all the KEYs I have tried to solve this problem via re module I don't think the re module will be as much help here. -tkc from collections import defaultdict import itertools as it MIN_LEN = 6 def overlap(s1, s2): Given 2 strings, return the initial overlap between them return ''.join( c1 for c1, c2 in it.takewhile( lambda pair: pair[0] == pair[1], it.izip(s1, s2) ) ) prevline = # the initial key under which preamble gets stored output = defaultdict(list) key = None with open(data.txt) as f: for line in f: if len(line) = MIN_LEN and prevline[:MIN_LEN] == line[:MIN_LEN]: key = overlap(prevline, line) output[key].append(line) prevline = line for k,v in output.items(): print str(k).center(60,'=') print ''.join(v) . -- https://mail.python.org/mailman/listinfo/python-list
Re: how to split this kind of text into sections
On Sat, 26 Apr 2014 23:53:14 +0800, oyster wrote: Every SECTION starts with 2 special lines; these 2 lines is special because they have some same characters (the length is not const for different section) at the beginning; these same characters is called the KEY for this section. For every 2 neighbor sections, they have different KEYs. After these 2 special lines, some paragraph is followed. Paragraph does not have any KEYs. So, a section = 2 special lines with KEYs at the beginning + some paragraph without KEYs However there maybe some paragraph before the first section, which I do not need and want to drop it I need a method to split the whole text into SECTIONs and to know all the KEYs Let me try to describe how I would solve this, in English. I would look at each pair of lines (1st + 2nd, 2nd + 3rd, 3rd + 4th, etc.) looking for a pair of lines with matching prefixes. E.g.: This line matches the next This line matches the previous do match, because they both start with This line matches the . Question: how many characters in common counts as a match? This line matches the next That previous line matches this line have a common prefix of Th, two characters. Is that a match? So let me start with a function to extract the matching prefix, if there is one. It returns '' if there is no match, and the prefix (the KEY) if there is one: def extract_key(line1, line2): Return the key from two matching lines, or '' if not matching. # Assume they need five characters in common. if line1[:5] == line2[:5]: return line1[:5] return '' I'm pretty much guessing that this is how you decide there's a match. I don't know if five characters is too many or two few, or if you need a more complicated test. It seems that you want to match as many characters as possible. I'll leave you to adjust this function to work exactly as needed. Now we iterate over the text in pairs of lines. We need somewhere to hold the the lines in each section, so I'm going to use a dict of lists of lines. As a bonus, I'm going to collect the ignored lines using a key of None. However, I do assume that all keys are unique. It should be easy enough to adjust the following to handle non-unique keys. (Use a list of lists, rather than a dict, and save the keys in a separate list.) Lastly, the way it handles lines at the beginning of a section is not exactly the way you want it. This puts the *first* line of the section as the *last* line of the previous section. I will leave you to sort out that problem. from collections import OrderedDict section = [] sections = OrderedDict() sections[None] = section lines = iter(text.split('\n')) prev_line = '' for line in lines: key = extract_key(prev_line, line) if key == '': # No match, so we're still in the same section as before. section.append(line) else: # Match, so we start a new section. section = [line] sections[key] = section prev_line = line -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
how to split this kind of text into sections
I have a long text, which should be splitted into some sections, where all sections have a pattern like following with different KEY. And the /n/r can not be used to split I don't know whether this can be done easily, for example by using RE module [demo text starts] a line we do not need I am section axax I am section bbb, we can find that the first 2 lines of this section all startswith 'I am section' .(and here goes many other text)... let's continue to let's continue, yeah .(and here goes many other text)... I am using python I am using perl .(and here goes many other text)... [demo text ends] the above text should be splitted as a LIST with 3 items, and I also need to know the KEY for LIST is ['I am section', 'let's continue', 'I am using']: lst=[ '''I am section axax I am section bbb, we can find that the first 2 lines of this section all startswith 'I am section' .(and here goes many other text)...''', '''let's continue to let's continue, yeah .(and here goes many other text)...''', '''I am using python I am using perl .(and here goes many other text)...''' ] I hope I have state myself clear. Regards -- https://mail.python.org/mailman/listinfo/python-list
Re: how to split this kind of text into sections
In article mailman.9492.1398431281.18130.python-l...@python.org, oyster lepto.pyt...@gmail.com wrote: I have a long text, which should be splitted into some sections, where all sections have a pattern like following with different KEY. And the /n/r can not be used to split I don't know whether this can be done easily, for example by using RE module [demo text starts] a line we do not need I am section axax I am section bbb, we can find that the first 2 lines of this section all startswith 'I am section' .(and here goes many other text)... let's continue to let's continue, yeah .(and here goes many other text)... I am using python I am using perl .(and here goes many other text)... [demo text ends] This kind of looks like a standard INI file. Check out https://docs.python.org/2/library/configparser.html, it may do what you need. -- https://mail.python.org/mailman/listinfo/python-list
Re: how to split this kind of text into sections
On Fri, Apr 25, 2014 at 11:07 PM, oyster lepto.pyt...@gmail.com wrote: the above text should be splitted as a LIST with 3 items, and I also need to know the KEY for LIST is ['I am section', 'let's continue', 'I am using']: It's not perfectly clear, but I think I have some idea of what you're trying to do. Let me restate what I think you want, and you can tell be if it's correct. You have a file which consists of a number of lines. Some of those lines begin with the string I am section, others begin let's continue, and others begin I am using. You want to collect those three sets of lines; inside each collection, every line will have that same prefix. Is that correct? If so, we can certainly help you with that. If not, please clarify. :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: how to split this kind of text into sections
On 2014-04-25 23:31, Chris Angelico wrote: On Fri, Apr 25, 2014 at 11:07 PM, oyster lepto.pyt...@gmail.com wrote: the above text should be splitted as a LIST with 3 items, and I also need to know the KEY for LIST is ['I am section', 'let's continue', 'I am using']: It's not perfectly clear, but I think I have some idea of what you're trying to do. Let me restate what I think you want, and you can tell be if it's correct. You have a file which consists of a number of lines. Some of those lines begin with the string I am section, others begin let's continue, and others begin I am using. You want to collect those three sets of lines; inside each collection, every line will have that same prefix. Is that correct? If so, we can certainly help you with that. If not, please clarify. :) My reading of it (and it took me several tries) was that two subsequent lines would begin with the same N words. Something like the following regexp: ^(\w.{8,}).*\n\1.* as the delimiter (choosing 6 arbitrarily as an indication of a minimum match length to). A naive (and untested) bit of code might look something like MIN_LEN = 6 def overlap(s1, s2): chars = [] for c1, c2 in zip(s1,s2): if c1 != c2: break chars.append(c1) return ''.join(chars) prevline = output_number = 1 output = defaultdict(list) key = None with open(input.txt) as f: for line in f: if len(line) = MIN_LEN and prevline[:MIN_LEN] == line[:MIN_LEN]: key = overlap(prevline, line) output[key].append(line) prevline = line There are some edge-cases such as when multiple sections are delimited by the same overlap, but this should build up a defaultdict keyed by the delimiters with the corresponding lines as the values. -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: how to split this kind of text into sections
oyster writes: I have a long text, which should be splitted into some sections, where all sections have a pattern like following with different KEY. itertools.groupby, if you know how to extract a key from a given line. And the /n/r can not be used to split Yet you seem to want to have each line as a unit? You could group lines straight from some file object using itertools.groupby and then ''.join each group. (It's \n and \r, and \r\n when they are both there, but you can just let Python read the lines.) -- https://mail.python.org/mailman/listinfo/python-list
Re: how to split this kind of text into sections
On Fri, 25 Apr 2014 21:07:53 +0800, oyster wrote: I have a long text, which should be splitted into some sections, where all sections have a pattern like following with different KEY. And the /n/r can not be used to split I don't know whether this can be done easily, for example by using RE module [... snip example ...] I hope I have state myself clear. Clear as mud. I'm afraid I have no idea what you mean. Can you explain the decision that you make to decide whether a line is included, or excluded, or part of a section? [demo text starts] a line we do not need How do we decide whether the line is ignored? Is it the literal text a line we do not need? for line in lines: if line == a line we do not need\n: # ignore this line continue I am section axax I am section bbb, we can find that the first 2 lines of this section all startswith 'I am section' Again, is this the *literal* text that you expect? .(and here goes many other text)... let's continue to let's continue, yeah .(and here goes many other text)... I am using python I am using perl .(and here goes many other text)... [demo text ends] the above text should be splitted as a LIST with 3 items, and I also need to know the KEY for LIST is ['I am section', 'let's continue', 'I am using']: How do you decide that they are the keys? lst=[ '''I am section axax I am section bbb, we can find that the first 2 lines of this section all startswith 'I am section' .(and here goes many other text)...''', '''let's continue to let's continue, yeah .(and here goes many other text)...''', '''I am using python I am using perl .(and here goes many other text)...''' ] Perhaps it would be better if you show a more realistic example. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: how to split this kind of text into sections
On Fri, 25 Apr 2014 09:18:22 -0400, Roy Smith wrote: In article mailman.9492.1398431281.18130.python-l...@python.org, oyster lepto.pyt...@gmail.com wrote: [demo text starts] a line we do not need I am section axax I am section bbb, we can find that the first 2 lines of this section all startswith 'I am section' .(and here goes many other text)... let's continue to let's continue, yeah .(and here goes many other text)... I am using python I am using perl .(and here goes many other text)... [demo text ends] This kind of looks like a standard INI file. I don't think so. INI files are a collection of KEY=VALUE or KEY:VALUE pairs, and the example above shows nothing like that. The only thing which is even vaguely ini-like is the header [demo text starts], and my reading of that is that it is *not* part of the file, but just an indication that the OP is giving a demo. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: how to split this kind of text into sections
On 4/25/2014 9:07 AM, oyster wrote: I have a long text, which should be splitted into some sections, where all sections have a pattern like following with different KEY. Computers are worse at reading your mind than humans. If you can write rules that another person could follow, THEN we could help you translate the rules to Python. If you have 1 moderate length file or a few short files, I would edit them by hand to remove ignore lines and put a blank line between sections. A program to do the rest would then be easy. the above text should be splitted as a LIST with 3 items, and I also need to know the KEY for LIST is ['I am section', 'let's continue', 'I am using']: This suggests that the rule for keys is 'first 3 words of a line, with contractions counted as 2 words'. Correct? Another possible rule is 'a member of the following list: ...', as you gave above but presumably expanded. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list