Re: efficiently splitting up strings based on substrings
On Sep 5, 5:29 pm, per perfr...@gmail.com wrote: On Sep 5, 7:07 pm, Rhodri James rho...@wildebst.demon.co.uk wrote: On Sat, 05 Sep 2009 23:54:08 +0100, per perfr...@gmail.com wrote: On Sep 5, 6:42 pm, Rhodri James rho...@wildebst.demon.co.uk wrote: On Sat, 05 Sep 2009 22:54:41 +0100, per perfr...@gmail.com wrote: I'm trying to efficiently split strings based on what substrings they are made up of. i have a set of strings that are comprised of known substrings. For example, a, b, and c are substrings that are not identical to each other, e.g.: a = 0 * 5 b = 1 * 5 c = 2 * 5 Then my_string might be: my_string = a + b + c i am looking for an efficient way to solve the following problem. suppose i have a short string x that is a substring of my_string. I want to split the string x into blocks based on what substrings (i.e. a, b, or c) chunks of s fall into. to illustrate this, suppose x = 00111. Then I can detect where x starts in my_string using my_string.find(x). But I don't know how to partition x into blocks depending on the substrings. What I want to get out in this case is: 00, 111. If x were 00122, I'd want to get out 00,1, 22. is there an easy way to do this? i can't simply split x on a, b, or c because these might not be contained in x. I want to avoid doing something inefficient like looking at all substrings of my_string etc. i wouldn't mind using regular expressions for this but i cannot think of an easy regular expression for this problem. I looked at the string module in the library but did not see anything that seemd related but i might have missed it. I'm not sure I understand your question exactly. You seem to imply that the order of the substrings of x is consistent. If that's the case, this ought to help: import re x = 00122 m = re.match(r(0*)(1*)(2*), x) m.groups() ('00', '1', '22') y = 00111 m = re.match(r(0*)(1*)(2*), y) m.groups() ('00', '111', '') You'll have to filter out the empty groups for yourself, but that's no great problem. The order of the substrings is consistent but what if it's not 0, 1, 2 but a more complicated string? e.g. a = 1030405, b = 1babcf, c = fUUIUP then the substring x might be 4051ba, in which case using a regexp with (1*) will not work since both a and b substrings begin with the character 1. Right. This looks approximately nothing like what I thought your problem was. Would I be right in thinking that you want to match substrings of your potential substrings against the string x? I'm sufficiently confused that I think I'd like to see what your use case actually is before I make more of a fool of myself. -- Rhodri James *-* Wildebeest Herder to the Masses it's exactly the same problem, except there are no constraints on the strings. so the problem is, like you say, matching the substrings against the string x. in other words, finding out where x aligns to the ordered substrings abc, and then determine what chunk of x belongs to a, what chunk belongs to b, and what chunk belongs to c. so in the example i gave above, the substrings are: a = 1030405, b = 1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP given a substring like 4051ba, i'd want to split it into the chunks a, b, and c. in this case, i'd want the result to be: [405, 1ba] -- i.e. 405 is the chunk of x that belongs to a, and 1ba the chunk that belongs to be. in this case, there are no chunks of c. if x instead were 4051babcffUU, the right output is: [405, 1babcf, fUU], which are the corresponding chunks of a, b, and c that make up x respectively. i'm not sure how to approach this. any ideas/tips would be greatly appreciated. thanks again. a = 1030405 b = 1babcf c = fUUIUP abc = 10304051babcffUUIUP data = 4051babcffU data_start = abc.find(data) b_start = abc.find(b) - data_start c_start = abc.find(c) - data_start print data[:b_start] print data[b_start:c_start] print data[c_start:] --output:-- 405 1babcf fU -- http://mail.python.org/mailman/listinfo/python-list
Re: efficiently splitting up strings based on substrings
On Sep 6, 1:14 am, 7stud bbxx789_0...@yahoo.com wrote: On Sep 5, 5:29 pm, per perfr...@gmail.com wrote: On Sep 5, 7:07 pm, Rhodri James rho...@wildebst.demon.co.uk wrote: On Sat, 05 Sep 2009 23:54:08 +0100, per perfr...@gmail.com wrote: On Sep 5, 6:42 pm, Rhodri James rho...@wildebst.demon.co.uk wrote: On Sat, 05 Sep 2009 22:54:41 +0100, per perfr...@gmail.com wrote: I'm trying to efficiently split strings based on what substrings they are made up of. i have a set of strings that are comprised of known substrings. For example, a, b, and c are substrings that are not identical to each other, e.g.: a = 0 * 5 b = 1 * 5 c = 2 * 5 Then my_string might be: my_string = a + b + c i am looking for an efficient way to solve the following problem. suppose i have a short string x that is a substring of my_string. I want to split the string x into blocks based on what substrings (i.e. a, b, or c) chunks of s fall into. to illustrate this, suppose x = 00111. Then I can detect where x starts in my_string using my_string.find(x). But I don't know how to partition x into blocks depending on the substrings. What I want to get out in this case is: 00, 111. If x were 00122, I'd want to get out 00,1, 22. is there an easy way to do this? i can't simply split x on a, b, or c because these might not be contained in x. I want to avoid doing something inefficient like looking at all substrings of my_string etc. i wouldn't mind using regular expressions for this but i cannot think of an easy regular expression for this problem. I looked at the string module in the library but did not see anything that seemd related but i might have missed it. I'm not sure I understand your question exactly. You seem to imply that the order of the substrings of x is consistent. If that's the case, this ought to help: import re x = 00122 m = re.match(r(0*)(1*)(2*), x) m.groups() ('00', '1', '22') y = 00111 m = re.match(r(0*)(1*)(2*), y) m.groups() ('00', '111', '') You'll have to filter out the empty groups for yourself, but that's no great problem. The order of the substrings is consistent but what if it's not 0, 1, 2 but a more complicated string? e.g. a = 1030405, b = 1babcf, c = fUUIUP then the substring x might be 4051ba, in which case using a regexp with (1*) will not work since both a and b substrings begin with the character 1. Right. This looks approximately nothing like what I thought your problem was. Would I be right in thinking that you want to match substrings of your potential substrings against the string x? I'm sufficiently confused that I think I'd like to see what your use case actually is before I make more of a fool of myself. -- Rhodri James *-* Wildebeest Herder to the Masses it's exactly the same problem, except there are no constraints on the strings. so the problem is, like you say, matching the substrings against the string x. in other words, finding out where x aligns to the ordered substrings abc, and then determine what chunk of x belongs to a, what chunk belongs to b, and what chunk belongs to c. so in the example i gave above, the substrings are: a = 1030405, b = 1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP given a substring like 4051ba, i'd want to split it into the chunks a, b, and c. in this case, i'd want the result to be: [405, 1ba] -- i.e. 405 is the chunk of x that belongs to a, and 1ba the chunk that belongs to be. in this case, there are no chunks of c. if x instead were 4051babcffUU, the right output is: [405, 1babcf, fUU], which are the corresponding chunks of a, b, and c that make up x respectively. i'm not sure how to approach this. any ideas/tips would be greatly appreciated. thanks again. a = 1030405 b = 1babcf c = fUUIUP abc = 10304051babcffUUIUP data = 4051babcffU data_start = abc.find(data) b_start = abc.find(b) - data_start c_start = abc.find(c) - data_start print data[:b_start] print data[b_start:c_start] print data[c_start:] --output:-- 405 1babcf fU ...or maybe this is easier to follow: a = 1030405 b = 1babcf c = fUUIUP abc = 10304051babcffUUIUP data = 4051babcffU data_start = abc.find(data) new_abc = abc[data_start:] print new_abc print data print - * 10 --output:-- 4051babcffUUIUP 4051babcffU -- b_start = new_abc.find(b) c_start = new_abc.find(c) print data[:b_start] print data[b_start:c_start] print data[c_start:] --output:-- 405 1babcf fU -- http://mail.python.org/mailman/listinfo/python-list
Re: efficiently splitting up strings based on substrings
On Sep 6, 1:23 am, 7stud bbxx789_0...@yahoo.com wrote: On Sep 6, 1:14 am, 7stud bbxx789_0...@yahoo.com wrote: On Sep 5, 5:29 pm, per perfr...@gmail.com wrote: On Sep 5, 7:07 pm, Rhodri James rho...@wildebst.demon.co.uk wrote: On Sat, 05 Sep 2009 23:54:08 +0100, per perfr...@gmail.com wrote: On Sep 5, 6:42 pm, Rhodri James rho...@wildebst.demon.co.uk wrote: On Sat, 05 Sep 2009 22:54:41 +0100, per perfr...@gmail.com wrote: I'm trying to efficiently split strings based on what substrings they are made up of. i have a set of strings that are comprised of known substrings. For example, a, b, and c are substrings that are not identical to each other, e.g.: a = 0 * 5 b = 1 * 5 c = 2 * 5 Then my_string might be: my_string = a + b + c i am looking for an efficient way to solve the following problem. suppose i have a short string x that is a substring of my_string. I want to split the string x into blocks based on what substrings (i.e. a, b, or c) chunks of s fall into. to illustrate this, suppose x = 00111. Then I can detect where x starts in my_string using my_string.find(x). But I don't know how to partition x into blocks depending on the substrings. What I want to get out in this case is: 00, 111. If x were 00122, I'd want to get out 00,1, 22. is there an easy way to do this? i can't simply split x on a, b, or c because these might not be contained in x. I want to avoid doing something inefficient like looking at all substrings of my_string etc. i wouldn't mind using regular expressions for this but i cannot think of an easy regular expression for this problem. I looked at the string module in the library but did not see anything that seemd related but i might have missed it. I'm not sure I understand your question exactly. You seem to imply that the order of the substrings of x is consistent. If that's the case, this ought to help: import re x = 00122 m = re.match(r(0*)(1*)(2*), x) m.groups() ('00', '1', '22') y = 00111 m = re.match(r(0*)(1*)(2*), y) m.groups() ('00', '111', '') You'll have to filter out the empty groups for yourself, but that's no great problem. The order of the substrings is consistent but what if it's not 0, 1, 2 but a more complicated string? e.g. a = 1030405, b = 1babcf, c = fUUIUP then the substring x might be 4051ba, in which case using a regexp with (1*) will not work since both a and b substrings begin with the character 1. Right. This looks approximately nothing like what I thought your problem was. Would I be right in thinking that you want to match substrings of your potential substrings against the string x? I'm sufficiently confused that I think I'd like to see what your use case actually is before I make more of a fool of myself. -- Rhodri James *-* Wildebeest Herder to the Masses it's exactly the same problem, except there are no constraints on the strings. so the problem is, like you say, matching the substrings against the string x. in other words, finding out where x aligns to the ordered substrings abc, and then determine what chunk of x belongs to a, what chunk belongs to b, and what chunk belongs to c. so in the example i gave above, the substrings are: a = 1030405, b = 1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP given a substring like 4051ba, i'd want to split it into the chunks a, b, and c. in this case, i'd want the result to be: [405, 1ba] -- i.e. 405 is the chunk of x that belongs to a, and 1ba the chunk that belongs to be. in this case, there are no chunks of c. if x instead were 4051babcffUU, the right output is: [405, 1babcf, fUU], which are the corresponding chunks of a, b, and c that make up x respectively. i'm not sure how to approach this. any ideas/tips would be greatly appreciated. thanks again. a = 1030405 b = 1babcf c = fUUIUP abc = 10304051babcffUUIUP data = 4051babcffU data_start = abc.find(data) b_start = abc.find(b) - data_start c_start = abc.find(c) - data_start print data[:b_start] print data[b_start:c_start] print data[c_start:] --output:-- 405 1babcf fU ...or maybe this is easier to follow: a = 1030405 b = 1babcf c = fUUIUP abc = 10304051babcffUUIUP data = 4051babcffU data_start = abc.find(data) new_abc = abc[data_start:] print new_abc print data print - * 10 --output:-- 4051babcffUUIUP 4051babcffU -- b_start = new_abc.find(b) c_start = new_abc.find(c) print data[:b_start] print data[b_start:c_start] print data[c_start:] --output:-- 405 1babcf fU Nope. My solutions have problems with: data = cffU To handle that
Re: efficiently splitting up strings based on substrings
On Sat, 05 Sep 2009 22:54:41 +0100, per perfr...@gmail.com wrote: I'm trying to efficiently split strings based on what substrings they are made up of. i have a set of strings that are comprised of known substrings. For example, a, b, and c are substrings that are not identical to each other, e.g.: a = 0 * 5 b = 1 * 5 c = 2 * 5 Then my_string might be: my_string = a + b + c i am looking for an efficient way to solve the following problem. suppose i have a short string x that is a substring of my_string. I want to split the string x into blocks based on what substrings (i.e. a, b, or c) chunks of s fall into. to illustrate this, suppose x = 00111. Then I can detect where x starts in my_string using my_string.find(x). But I don't know how to partition x into blocks depending on the substrings. What I want to get out in this case is: 00, 111. If x were 00122, I'd want to get out 00,1, 22. is there an easy way to do this? i can't simply split x on a, b, or c because these might not be contained in x. I want to avoid doing something inefficient like looking at all substrings of my_string etc. i wouldn't mind using regular expressions for this but i cannot think of an easy regular expression for this problem. I looked at the string module in the library but did not see anything that seemd related but i might have missed it. I'm not sure I understand your question exactly. You seem to imply that the order of the substrings of x is consistent. If that's the case, this ought to help: import re x = 00122 m = re.match(r(0*)(1*)(2*), x) m.groups() ('00', '1', '22') y = 00111 m = re.match(r(0*)(1*)(2*), y) m.groups() ('00', '111', '') You'll have to filter out the empty groups for yourself, but that's no great problem. -- Rhodri James *-* Wildebeest Herder to the Masses -- http://mail.python.org/mailman/listinfo/python-list
Re: efficiently splitting up strings based on substrings
On Sep 5, 6:42 pm, Rhodri James rho...@wildebst.demon.co.uk wrote: On Sat, 05 Sep 2009 22:54:41 +0100, per perfr...@gmail.com wrote: I'm trying to efficiently split strings based on what substrings they are made up of. i have a set of strings that are comprised of known substrings. For example, a, b, and c are substrings that are not identical to each other, e.g.: a = 0 * 5 b = 1 * 5 c = 2 * 5 Then my_string might be: my_string = a + b + c i am looking for an efficient way to solve the following problem. suppose i have a short string x that is a substring of my_string. I want to split the string x into blocks based on what substrings (i.e. a, b, or c) chunks of s fall into. to illustrate this, suppose x = 00111. Then I can detect where x starts in my_string using my_string.find(x). But I don't know how to partition x into blocks depending on the substrings. What I want to get out in this case is: 00, 111. If x were 00122, I'd want to get out 00,1, 22. is there an easy way to do this? i can't simply split x on a, b, or c because these might not be contained in x. I want to avoid doing something inefficient like looking at all substrings of my_string etc. i wouldn't mind using regular expressions for this but i cannot think of an easy regular expression for this problem. I looked at the string module in the library but did not see anything that seemd related but i might have missed it. I'm not sure I understand your question exactly. You seem to imply that the order of the substrings of x is consistent. If that's the case, this ought to help: import re x = 00122 m = re.match(r(0*)(1*)(2*), x) m.groups() ('00', '1', '22') y = 00111 m = re.match(r(0*)(1*)(2*), y) m.groups() ('00', '111', '') You'll have to filter out the empty groups for yourself, but that's no great problem. -- Rhodri James *-* Wildebeest Herder to the Masses The order of the substrings is consistent but what if it's not 0, 1, 2 but a more complicated string? e.g. a = 1030405, b = 1babcf, c = fUUIUP then the substring x might be 4051ba, in which case using a regexp with (1*) will not work since both a and b substrings begin with the character 1. your solution works if that weren't a possibility, so what you wrote is definitely the kind of solution i am looking for. i am just not sure how to solve it in the general case where the substrings might be similar to each other (but not similar enough that you can't tell where the substring came from). -- http://mail.python.org/mailman/listinfo/python-list
Re: efficiently splitting up strings based on substrings
On Sat, 05 Sep 2009 23:54:08 +0100, per perfr...@gmail.com wrote: On Sep 5, 6:42 pm, Rhodri James rho...@wildebst.demon.co.uk wrote: On Sat, 05 Sep 2009 22:54:41 +0100, per perfr...@gmail.com wrote: I'm trying to efficiently split strings based on what substrings they are made up of. i have a set of strings that are comprised of known substrings. For example, a, b, and c are substrings that are not identical to each other, e.g.: a = 0 * 5 b = 1 * 5 c = 2 * 5 Then my_string might be: my_string = a + b + c i am looking for an efficient way to solve the following problem. suppose i have a short string x that is a substring of my_string. I want to split the string x into blocks based on what substrings (i.e. a, b, or c) chunks of s fall into. to illustrate this, suppose x = 00111. Then I can detect where x starts in my_string using my_string.find(x). But I don't know how to partition x into blocks depending on the substrings. What I want to get out in this case is: 00, 111. If x were 00122, I'd want to get out 00,1, 22. is there an easy way to do this? i can't simply split x on a, b, or c because these might not be contained in x. I want to avoid doing something inefficient like looking at all substrings of my_string etc. i wouldn't mind using regular expressions for this but i cannot think of an easy regular expression for this problem. I looked at the string module in the library but did not see anything that seemd related but i might have missed it. I'm not sure I understand your question exactly. You seem to imply that the order of the substrings of x is consistent. If that's the case, this ought to help: import re x = 00122 m = re.match(r(0*)(1*)(2*), x) m.groups() ('00', '1', '22') y = 00111 m = re.match(r(0*)(1*)(2*), y) m.groups() ('00', '111', '') You'll have to filter out the empty groups for yourself, but that's no great problem. The order of the substrings is consistent but what if it's not 0, 1, 2 but a more complicated string? e.g. a = 1030405, b = 1babcf, c = fUUIUP then the substring x might be 4051ba, in which case using a regexp with (1*) will not work since both a and b substrings begin with the character 1. Right. This looks approximately nothing like what I thought your problem was. Would I be right in thinking that you want to match substrings of your potential substrings against the string x? I'm sufficiently confused that I think I'd like to see what your use case actually is before I make more of a fool of myself. -- Rhodri James *-* Wildebeest Herder to the Masses -- http://mail.python.org/mailman/listinfo/python-list
Re: efficiently splitting up strings based on substrings
On Sep 5, 7:07 pm, Rhodri James rho...@wildebst.demon.co.uk wrote: On Sat, 05 Sep 2009 23:54:08 +0100, per perfr...@gmail.com wrote: On Sep 5, 6:42 pm, Rhodri James rho...@wildebst.demon.co.uk wrote: On Sat, 05 Sep 2009 22:54:41 +0100, per perfr...@gmail.com wrote: I'm trying to efficiently split strings based on what substrings they are made up of. i have a set of strings that are comprised of known substrings. For example, a, b, and c are substrings that are not identical to each other, e.g.: a = 0 * 5 b = 1 * 5 c = 2 * 5 Then my_string might be: my_string = a + b + c i am looking for an efficient way to solve the following problem. suppose i have a short string x that is a substring of my_string. I want to split the string x into blocks based on what substrings (i.e. a, b, or c) chunks of s fall into. to illustrate this, suppose x = 00111. Then I can detect where x starts in my_string using my_string.find(x). But I don't know how to partition x into blocks depending on the substrings. What I want to get out in this case is: 00, 111. If x were 00122, I'd want to get out 00,1, 22. is there an easy way to do this? i can't simply split x on a, b, or c because these might not be contained in x. I want to avoid doing something inefficient like looking at all substrings of my_string etc. i wouldn't mind using regular expressions for this but i cannot think of an easy regular expression for this problem. I looked at the string module in the library but did not see anything that seemd related but i might have missed it. I'm not sure I understand your question exactly. You seem to imply that the order of the substrings of x is consistent. If that's the case, this ought to help: import re x = 00122 m = re.match(r(0*)(1*)(2*), x) m.groups() ('00', '1', '22') y = 00111 m = re.match(r(0*)(1*)(2*), y) m.groups() ('00', '111', '') You'll have to filter out the empty groups for yourself, but that's no great problem. The order of the substrings is consistent but what if it's not 0, 1, 2 but a more complicated string? e.g. a = 1030405, b = 1babcf, c = fUUIUP then the substring x might be 4051ba, in which case using a regexp with (1*) will not work since both a and b substrings begin with the character 1. Right. This looks approximately nothing like what I thought your problem was. Would I be right in thinking that you want to match substrings of your potential substrings against the string x? I'm sufficiently confused that I think I'd like to see what your use case actually is before I make more of a fool of myself. -- Rhodri James *-* Wildebeest Herder to the Masses it's exactly the same problem, except there are no constraints on the strings. so the problem is, like you say, matching the substrings against the string x. in other words, finding out where x aligns to the ordered substrings abc, and then determine what chunk of x belongs to a, what chunk belongs to b, and what chunk belongs to c. so in the example i gave above, the substrings are: a = 1030405, b = 1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP given a substring like 4051ba, i'd want to split it into the chunks a, b, and c. in this case, i'd want the result to be: [405, 1ba] -- i.e. 405 is the chunk of x that belongs to a, and 1ba the chunk that belongs to be. in this case, there are no chunks of c. if x instead were 4051babcffUU, the right output is: [405, 1babcf, fUU], which are the corresponding chunks of a, b, and c that make up x respectively. i'm not sure how to approach this. any ideas/tips would be greatly appreciated. thanks again. -- http://mail.python.org/mailman/listinfo/python-list
Re: efficiently splitting up strings based on substrings
On Sun, 06 Sep 2009 00:29:14 +0100, per perfr...@gmail.com wrote: it's exactly the same problem, except there are no constraints on the strings. so the problem is, like you say, matching the substrings against the string x. in other words, finding out where x aligns to the ordered substrings abc, and then determine what chunk of x belongs to a, what chunk belongs to b, and what chunk belongs to c. so in the example i gave above, the substrings are: a = 1030405, b = 1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP given a substring like 4051ba, i'd want to split it into the chunks a, b, and c. in this case, i'd want the result to be: [405, 1ba] -- i.e. 405 is the chunk of x that belongs to a, and 1ba the chunk that belongs to be. in this case, there are no chunks of c. if x instead were 4051babcffUU, the right output is: [405, 1babcf, fUU], which are the corresponding chunks of a, b, and c that make up x respectively. i'm not sure how to approach this. any ideas/tips would be greatly appreciated. thanks again. I see, I think. Let me explain it back to you, just to be sure. You have a string x, and three component strings a, b and c. x is a substring of the concatenation of a, b and c (i.e. a+b+c). You want to find out how x overlaps a, b and c. Assuming I've understood this right, you're overthinking the problem. All you need to do is find the start of x in a+b+c, then do some calculations based on the string lengths and slice appropriately. I'd scribble some example code, but it's nearly 1am and I'd be sure to commit fence-post errors at this time of night. -- Rhodri James *-* Wildebeest Herder to the Masses -- http://mail.python.org/mailman/listinfo/python-list