Re: Browsing text ; Python the right tool?
On 25 Jan 2005 09:40:35 -0800, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Here is an elementary suggestion. It would not be difficult to write a Python script to make a csv file from your text files, adding commas at the appropriate places to separate fields. Then the csv file can be browsed in Excel (or some other spreadsheet). I'd create text files like someone else suggested, because I'm more comfortable with at least three text editors/viewers than with Excel. But the bottom line is that it's a waste of time to design a new GUI around a file format, when you can tweak the data enough to reuse something that exists, and /has/ all the features you will eventually want. A0 and C1 records could be written to separate csv files. (Assuming that's OK, I wonder why they shared a file to begin with. Is the order between A0 and C1 records important?) /Jorgen -- // Jorgen Grahn jgrahn@ Ph'nglui mglw'nafh Cthulhu \X/algonet.se R'lyeh wgah'nagl fhtagn! -- http://mail.python.org/mailman/listinfo/python-list
Re: Browsing text ; Python the right tool?
Sorry to reply this late guys - I cannot access news from Work, and Google Groups cannot reply to a message so I had to do it at home. Let me address a few of the remarks and questions you guys asked: First of all, the example I gave was just that - an example. Yes, I know Python starts with 0, and I know that you cannot fit a 4-digit number in 2 positions, this was just to give the idea. To clarify, at THIS moment I need to browse 1-80 Mb size tekstfiles. At this moment, I have 16 different record definitions, numbered A,B, C1-C8, D-H. Each record definition has 20-60 different attributes. Not only that, but these formats change regularly; and I want to create or use something I can use on *other* applications or sites as well. As I said, I have encountered the type of problem I've described in numberous places already. John wrote: I have a Python script that takes layout info and an input file and can produce an output file in one of two formats: Yes John, I was thinking along these lines myself. The problem is that I have to parse several of these large files each day (debugging) and browsing converted output seems just to tedious and inefficient. I would REALLY like a GIU, and preferable something portable I can re-use later on. This should be pretty easy. If each record is CRLF terminated, then you can get one record at a time simply by iterating over the file (for line in open('myfile.dat'): ...). Jeff, this was indeed the way I was thinking. But instead of iterating I need the ability to browse forward and backward. You can have a dictionary of classes or factory functions, one for each record type, keyed off of the 2-character identifier. Each class/factory would know the layout of that record type, and return a(n) instance/dictionary with fields separated out into attributes/items. This is of course a clean approach, but would mean re-coding every time a records is changed - frequently! I really would like to edit only a data definition file. The trickiest part would be in displaying the data; you could potentially use COM to insert it into a Word or Excel document, or code your own GUI in Python. The former would be pretty easy if you're happy with fairly simple formatting; the latter would require a bit more effort, but if you used one of Python's RAD tools (Boa Constructor, or maybe PythonCard, as examples) you'd be able to get very nice results. I will at least look into Boa and PythonCard. Thanks for the hint. This is plausible only under the condition that Santa Claus is paying you $X per class/factory or per line of code, or you are so speed-crazy that you are machine-generating C code for the factories. Unfortunately, neither is the case :) I'd suggest data driven Yeah! Then you need a function to load this layout file into dictionaries, and build cross-references field_name - field_number (0,1,2,...) and vice versa. As your record name is not in a fixed position in the record, you will also need to supply a function (file_type, record_string) - record_name. I thought about supplying a flat ASCII definition such as: [record type] TAB [fieldname] TAB [start] TAB [end] Then you have *ONE* function that takes a file_type, a record_name, and a record_string, and gives you a list of the values. That is all you need for a generic browser application. I like this. You *don't* have to hand-craft a class for each record type. And you wouldn't want to, if you were dealing with files whose spec keeps on having fields added and fields obsoleted. Exactly. I think that's overly pessimistic. I *was* presuming a case where the number of record types was fairly small, and the definitions of those records reasonably constant. For ~10 or fewer types whose spec doesn't change, hand-coding the conversion would probably be quicker and/or more straightforward than writing a spec-parser as you suggest. Unfortunately, all wrong :) Lots of records, lots of changes, lots of different record types - hardcoding doesnt seem the right way. Parse? No parsing, and not much code at all: The routine to load (not parse) the layout from the layout.csv file into dicts of dicts is only 35 lines of Python code. The routine to take an input line and serve up an object instance is about the same. It does more than the OP's browsing requirement already. The routine to take an object and serve up a correctly formatted output line is only 50 lines of which 1/4 is comment or blank. John,do you have suggestions where I can find examples of these functions? I can program, but not being proficient in Python, any help or examples I can adapt would be nice Also, files used to create printed pages by an external company (especially by a company that had leaseplan in its e-mail address) would indicate many and complicated to me. How right you are. Think about production runs of 150.000 invoices, each invoice consisting of 2-10
Re: Browsing text ; Python the right tool?
John Machin wrote: Jeff Shannon wrote: [...] If each record is CRLF terminated, then you can get one record at a time simply by iterating over the file (for line in open('myfile.dat'): ...). You can have a dictionary classes or factory functions, one for each record type, keyed off of the 2-character identifier. Each class/factory would know the layout of that record type, This is plausible only under the condition that Santa Claus is paying you $X per class/factory or per line of code, or you are so speed-crazy that you are machine-generating C code for the factories. I think that's overly pessimistic. I *was* presuming a case where the number of record types was fairly small, and the definitions of those records reasonably constant. For ~10 or fewer types whose spec doesn't change, hand-coding the conversion would probably be quicker and/or more straightforward than writing a spec-parser as you suggest. If, on the other hand, there are many record types, and/or those record types are subject to changes in specification, then yes, it'd be better to parse the specs from some sort of data file. The O.P. didn't mention anything either way about how dynamic the record specs are, nor the number of record types expected. I suspect that we're both assuming a case similar to our own personal experiences, which are different enough to lead to different preferred solutions. ;) Jeff Shannon Technician/Programmer Credit International -- http://mail.python.org/mailman/listinfo/python-list
Re: Browsing text ; Python the right tool?
Jeff Shannon wrote: John Machin wrote: Jeff Shannon wrote: [...] If each record is CRLF terminated, then you can get one record at a time simply by iterating over the file (for line in open('myfile.dat'): ...). You can have a dictionary classes or factory functions, one for each record type, keyed off of the 2-character identifier. Each class/factory would know the layout of that record type, This is plausible only under the condition that Santa Claus is paying you $X per class/factory or per line of code, or you are so speed-crazy that you are machine-generating C code for the factories. I think that's overly pessimistic. I *was* presuming a case where the number of record types was fairly small, and the definitions of those records reasonably constant. For ~10 or fewer types whose spec doesn't change, hand-coding the conversion would probably be quicker and/or more straightforward than writing a spec-parser as you suggest. I didn't suggest writing a spec-parser. No (mechanical) parsing is involved. The specs that I'm used to dealing with set out the record layouts in a tabular fashion. The only hassle is extracting that from a MSWord document or a PDF. If, on the other hand, there are many record types, and/or those record types are subject to changes in specification, then yes, it'd be better to parse the specs from some sort of data file. Parse? No parsing, and not much code at all: The routine to load (not parse) the layout from the layout.csv file into dicts of dicts is only 35 lines of Python code. The routine to take an input line and serve up an object instance is about the same. It does more than the OP's browsing requirement already. The routine to take an object and serve up a correctly formatted output line is only 50 lines of which 1/4 is comment or blank. The O.P. didn't mention anything either way about how dynamic the record specs are, nor the number of record types expected. My reasoning: He did mention A0 and C1 hence one could guess from that he maybe had 6 at least. Also, files used to create printed pages by an external company (especially by a company that had leaseplan in its e-mail address) would indicate many and complicated to me. I suspect that we're both assuming a case similar to our own personal experiences, which are different enough to lead to different preferred solutions. ;) Indeed. You seem to have lead a charmed life; may the wizards and the rangers ever continue to protect you from the dark riders! :-) My personal experiences and attitudes: (1) extreme aversion to having to type (correctly) lots of numbers (column positions and lengths), and to having to mentally translate start = 663, len = 13 to [662:675] or having ugliness like [663-1:663+13-1] (2) cases like 17 record types and 112 fields in one file, 8 record types and 86 fields in a second -- this being a new relatively clean simple exercise in exchanging files with a government department (3) Past history of this govt dept is that there are at least another 7 file types in regular use and they change the _major_ version number of each file type about once a year on average (3) These things tend to start out deceptively small and simple and turn into monsters. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Browsing text ; Python the right tool?
John Machin wrote: Jeff Shannon wrote: [...] For ~10 or fewer types whose spec doesn't change, hand-coding the conversion would probably be quicker and/or more straightforward than writing a spec-parser as you suggest. I didn't suggest writing a spec-parser. No (mechanical) parsing is involved. The specs that I'm used to dealing with set out the record layouts in a tabular fashion. The only hassle is extracting that from a MSWord document or a PDF. The specs I'm used to dealing with are inconsistent enough that it's more work to massage them into strict tabular format than it is to retype and verify them. Typically it's one or two file types, with one or two record types each, from each vendor -- and of course no vendor uses anything similar to any other, nor is there a standardized way for them to specify what they *do* use. Everything is almost completely ad-hoc. If, on the other hand, there are many record types, and/or those record types are subject to changes in specification, then yes, it'd be better to parse the specs from some sort of data file. Parse? No parsing, and not much code at all: The routine to load (not parse) the layout from the layout.csv file into dicts of dicts is only 35 lines of Python code. The routine to take an input line and serve up an object instance is about the same. It does more than the OP's browsing requirement already. The routine to take an object and serve up a correctly formatted output line is only 50 lines of which 1/4 is comment or blank. There's a tradeoff between the effort involved in writing multiple custom record-type classes, and the effort necessary to write the generic loading routines plus the effort to massage coerce the specifications into a regular, machine-readable format. I suppose that parsing may not precisely be the correct term here, but I was using it in parallel to, say, ConfigParser and Optparse. Either you're writing code to translate some sort of received specification into a usable format, or you're manually pushing bytes around to get them into a format that your code *can* translate. I'd say that my creation of custom classes is just a bit further along a continuum than your massaging of specification data -- I'm just massaging it into Python code instead of CSV tables. I suspect that we're both assuming a case similar to our own personal experiences, which are different enough to lead to different preferred solutions. ;) Indeed. You seem to have lead a charmed life; may the wizards and the rangers ever continue to protect you from the dark riders! :-) Hardly charmed -- more that there's so little regularity in what I'm given that massaging it to a standard format is almost as much work as just buckling down and retyping it. My one saving grace is that I'm usually able to work with delimited files, rather than column-width-specified files. I'll spare you the rant about my many job-related frustrations, but trust me, there ain't no picnics here! Jeff Shannon Technician/Programmer Credit International -- http://mail.python.org/mailman/listinfo/python-list
Browsing text ; Python the right tool?
I need a tool to browse text files with a size of 10-20 Mb. These files have a fixed record length of 800 bytes (CR/LF), and containt records used to create printed pages by an external company. Each line (record) contains an 2-character identifier, like 'A0' or 'C1'. The identifier identifies the record format for the line, thereby allowing different record formats to be used in a textfile. For example: An A0 record may consist of: recordnumber [1:4] name [5:25] filler [26:800] while a C1 record consists of: recordnumber [1:4] phonenumber [5:15] zipcode [16:20] filler [21:800] As you see, all records have a fixed column format. I would like to build a utility which allows me (in a windows environment) to open a textfile and browse through the records (ideally with a search option), where each recordtype is displayed according to its recordformat ('Attributename: Value' format). This would mean that browsing from a A0 to C1 record results in a different list of attributes + values on the screen, allowing me to analyze the data generated a lot easier then I do now, browsing in a text editor with a stack of printed record formats at hand. This is of course quite a common way of encoding data in textfiles. I've tried to find a generic text-based browser which allows me to do just this, but cannot find anything. Enter Python; I know the language by name, I know it handles text just fine, but I am not really interested in learning Python just now, I just need a tool to do what I want. What I would REALLY like is way to define standard record formats in a separate definition, like: - defining a common record length; - defining the different record formats (attributes, position of the line); - and defining when a specific record format is to be used, dependent on 1 or more identifiers in the record. I CAN probably build something from scratch, but if I can (re)use something that already exists it would be so much better and faster... And a utility to do what I just described would be REALLY usefull in LOTS of environments. This means I have the following questions: 1. Does anybody now of a generic tool (not necessarily Python based) that does the job I've outlined? 2. If not, is there some framework or widget in Python I can adapt to do what I want? 3. If not, should I consider building all this just from scratch in Python - which would probably mean not only learning Python, but some other GUI related modules? 4. Or should I forget about Python and build someting in another environment? Any help would be appreciated. -- http://mail.python.org/mailman/listinfo/python-list
Re: Browsing text ; Python the right tool?
Paul Kooistra wrote: I need a tool to browse text files with a size of 10-20 Mb. These files have a fixed record length of 800 bytes (CR/LF), and containt records used to create printed pages by an external company. Each line (record) contains an 2-character identifier, like 'A0' or 'C1'. The identifier identifies the record format for the line, thereby allowing different record formats to be used in a textfile. For example: An A0 record may consist of: recordnumber [1:4] name [5:25] filler [26:800] 1. Python syntax calls these [0:4], [4:25], etc. One has to get into the habit of deducting 1 from the start column position given in a document. 2. So where's the A0? Are the records really 804 bytes wide -- A0 plus the above plus CR LF? What is recordnumber -- can't be a line number (4 digits - max 10k; 10k * 800 - only 8Mb); looks too small to be a customer identifier; is it the key to a mapping that produces A0, C1, etc? while a C1 record consists of: recordnumber [1:4] phonenumber [5:15] zipcode [16:20] filler [21:800] As you see, all records have a fixed column format. I would like to build a utility which allows me (in a windows environment) to open a textfile and browse through the records (ideally with a search option), where each recordtype is displayed according to its recordformat ('Attributename: Value' format). This would mean that browsing from a A0 to C1 record results in a different list of attributes + values on the screen, allowing me to analyze the data generated a lot easier then I do now, browsing in a text editor with a stack of printed record formats at hand. This is of course quite a common way of encoding data in textfiles. I've tried to find a generic text-based browser which allows me to do just this, but cannot find anything. Enter Python; I know the language by name, I know it handles text just fine, but I am not really interested in learning Python just now, I just need a tool to do what I want. What I would REALLY like is way to define standard record formats in a separate definition, like: - defining a common record length; - defining the different record formats (attributes, position of the line); Add in the type, number of decimal places, etc as well .. - and defining when a specific record format is to be used, dependent on 1 or more identifiers in the record. I CAN probably build something from scratch, but if I can (re)use something that already exists it would be so much better and faster... And a utility to do what I just described would be REALLY usefull in LOTS of environments. This means I have the following questions: 1. Does anybody now of a generic tool (not necessarily Python based) that does the job I've outlined? No, but please post if you hear of one. 2. If not, is there some framework or widget in Python I can adapt to do what I want? 3. If not, should I consider building all this just from scratch in Python - which would probably mean not only learning Python, but some other GUI related modules? Approach I use is along the lines of what you suggested, but w/o the GUI. I have a Python script that takes layout info and an input file and can produce an output file in one of two formats: Format 1: something like: Rec:A0 recordnumber:0001 phonenumber:(123) 555-1234 zipcode:12345 This is usually much shorter than the fixed length record, because you leave out the fillers (after checking they are blank!), and strip trailing spaces from alphanumeric fields. Whether you leave integers, money, date etc fields as per file or translated into human-readable form depends on who will be reading it. You then use a robust text editor (preferably one which supports regular expressions in its find function) to browse the output file. Format 2: Rec:A0 recordnumber:0001 etc etc i.e. one field per line? Why, you ask? If you are a consumer of such files, so that you can take small chunks of this, drop it into Excel, testers take copy, make lots of juicy test data, run it through another script which makes a flat file out of it. 4. Or should I forget about Python and build someting in another environment? No way! -- http://mail.python.org/mailman/listinfo/python-list
Re: Browsing text ; Python the right tool?
Paul Kooistra wrote: I need a tool to browse text files with a size of 10-20 Mb. These files have a fixed record length of 800 bytes (CR/LF), and containt records used to create printed pages by an external company. Each line (record) contains an 2-character identifier, like 'A0' or 'C1'. The identifier identifies the record format for the line, thereby allowing different record formats to be used in a textfile. For example: An A0 record may consist of: recordnumber [1:4] name [5:25] filler [26:800] 1. Python syntax calls these [0:4], [4:25], etc. One has to get into the habit of deducting 1 from the start column position given in a document. 2. So where's the A0? Are the records really 804 bytes wide -- A0 plus the above plus CR LF? What is recordnumber -- can't be a line number (4 digits - max 10k; 10k * 800 - only 8Mb); looks too small to be a customer identifier; is it the key to a mapping that produces A0, C1, etc? while a C1 record consists of: recordnumber [1:4] phonenumber [5:15] zipcode [16:20] filler [21:800] As you see, all records have a fixed column format. I would like to build a utility which allows me (in a windows environment) to open a textfile and browse through the records (ideally with a search option), where each recordtype is displayed according to its recordformat ('Attributename: Value' format). This would mean that browsing from a A0 to C1 record results in a different list of attributes + values on the screen, allowing me to analyze the data generated a lot easier then I do now, browsing in a text editor with a stack of printed record formats at hand. This is of course quite a common way of encoding data in textfiles. I've tried to find a generic text-based browser which allows me to do just this, but cannot find anything. Enter Python; I know the language by name, I know it handles text just fine, but I am not really interested in learning Python just now, I just need a tool to do what I want. What I would REALLY like is way to define standard record formats in a separate definition, like: - defining a common record length; - defining the different record formats (attributes, position of the line); Add in the type, number of decimal places, etc as well .. - and defining when a specific record format is to be used, dependent on 1 or more identifiers in the record. I CAN probably build something from scratch, but if I can (re)use something that already exists it would be so much better and faster... And a utility to do what I just described would be REALLY usefull in LOTS of environments. This means I have the following questions: 1. Does anybody now of a generic tool (not necessarily Python based) that does the job I've outlined? No, but please post if you hear of one. 2. If not, is there some framework or widget in Python I can adapt to do what I want? 3. If not, should I consider building all this just from scratch in Python - which would probably mean not only learning Python, but some other GUI related modules? Approach I use is along the lines of what you suggested, but w/o the GUI. I have a Python script that takes layout info and an input file and can produce an output file in one of two formats: Format 1: something like: Rec:A0 recordnumber:0001 phonenumber:(123) 555-1234 zipcode:12345 This is usually much shorter than the fixed length record, because you leave out the fillers (after checking they are blank!), and strip trailing spaces from alphanumeric fields. Whether you leave integers, money, date etc fields as per file or translated into human-readable form depends on who will be reading it. You then use a robust text editor (preferably one which supports regular expressions in its find function) to browse the output file. Format 2: Rec:A0 recordnumber:0001 etc etc i.e. one field per line? Why, you ask? If you are a consumer of such files, so that you can take small chunks of this, drop it into Excel, testers take copy, make lots of juicy test data, run it through another script which makes a flat file out of it. 4. Or should I forget about Python and build someting in another environment? No way! -- http://mail.python.org/mailman/listinfo/python-list
Re: Browsing text ; Python the right tool?
Paul Kooistra wrote: 1. Does anybody now of a generic tool (not necessarily Python based) that does the job I've outlined? 2. If not, is there some framework or widget in Python I can adapt to do what I want? Not that I know of, but... 3. If not, should I consider building all this just from scratch in Python - which would probably mean not only learning Python, but some other GUI related modules? This should be pretty easy. If each record is CRLF terminated, then you can get one record at a time simply by iterating over the file (for line in open('myfile.dat'): ...). You can have a dictionary of classes or factory functions, one for each record type, keyed off of the 2-character identifier. Each class/factory would know the layout of that record type, and return a(n) instance/dictionary with fields separated out into attributes/items. The trickiest part would be in displaying the data; you could potentially use COM to insert it into a Word or Excel document, or code your own GUI in Python. The former would be pretty easy if you're happy with fairly simple formatting; the latter would require a bit more effort, but if you used one of Python's RAD tools (Boa Constructor, or maybe PythonCard, as examples) you'd be able to get very nice results. Jeff Shannon Technician/Programmer Credit International -- http://mail.python.org/mailman/listinfo/python-list
Re: Browsing text ; Python the right tool?
Jeff Shannon wrote: Paul Kooistra wrote: 1. Does anybody now of a generic tool (not necessarily Python based) that does the job I've outlined? 2. If not, is there some framework or widget in Python I can adapt to do what I want? Not that I know of, but... 3. If not, should I consider building all this just from scratch in Python - which would probably mean not only learning Python, but some other GUI related modules? This should be pretty easy. If each record is CRLF terminated, then you can get one record at a time simply by iterating over the file (for line in open('myfile.dat'): ...). You can have a dictionary of classes or factory functions, one for each record type, keyed off of the 2-character identifier. Each class/factory would know the layout of that record type, This is plausible only under the condition that Santa Claus is paying you $X per class/factory or per line of code, or you are so speed-crazy that you are machine-generating C code for the factories. I'd suggest data driven -- you grab the .doc or .pdf that describes your layouts, ^A^C, fire up Excel, paste special, massage it, so you get one row per field, with start end posns, type, dec places, optional/mandatory, field name, whatever else you need. Insert a column with the record name. Save it as a CSV file. Then you need a function to load this layout file into dictionaries, and build cross-references field_name - field_number (0,1,2,...) and vice versa. As your record name is not in a fixed position in the record, you will also need to supply a function (file_type, record_string) - record_name. Then you have *ONE* function that takes a file_type, a record_name, and a record_string, and gives you a list of the values. That is all you need for a generic browser application. For working on a _specific_ known file_type, you can _then_ augment that to give you record objects that you use like a0.zipcode or record dictionaries that you use like a0['zipcode']. You *don't* have to hand-craft a class for each record type. And you wouldn't want to, if you were dealing with files whose spec keeps on having fields added and fields obsoleted. Notice: in none of the above do you ever have to type in a column position, except if you manually add updates to your layout file. Then contemplate how productive you will be when/if you need to _create_ such files -- you will push everything through one function which will format each field correctly in the correct column positions (and chuck an exception if it won't fit). Slightly better than an approach that uses something like nbytes = sprintf(buffer, %04d%-20s%-5s, a0_num, a0_phone, a0_zip); HTH, John -- http://mail.python.org/mailman/listinfo/python-list