RE: [Jprogramming] Scanning a large file
In Office 2003 for Windows, it evetually opens the file fine, only curses beforehands. SYLK appears to be a very important format for exchange between spreadsheet(-gnostic) software, like RTF for documents and WMF for pictures. It has CF_ code number 4 just after METAFILE. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnwui/html/msdn_ddeole.asp http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnwui/html/msdn_ddesystp.asp That's why it has priority over CSV, which is a rather amorphous format anyway -- there is no such thing as CSV Specification. Microsoft also likes to honor tradition, like their BASIC. SYLK also happens to be the file format of MultiPlan. --- Joey K Tuttle [EMAIL PROTECTED] wrote: At 20:54 -0700 2006/05/16, Oleg Kobchenko wrote: http://support.microsoft.com/kb/215591/ ID,NAME 666,MS Don' B H8N Yes - I knew the workaround and even puzzled out that the origination of the bug is that SYLK files begin with ID;. You would think that some bright programmer could decide that if the third character isn't the expected ; then it might be just an ordinary text file. Interesting how difficult it seems to fix such a simple thing. BTW, it also (used to) fails in Windows too. Also, it is any text file, not just csv. - joey -- For information about J forums see http://www.jsoftware.com/forums.htm __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
Oleg Kobchenko wrote: We need a general purpose read line functionality. It is common in C runtime and in other languages. Although, it is possible to do in J, but it's better not to do the low-level stuff every time. I suggest that we add two new definitions to the files script. One is Joey's verb to read a LF-terminated block from a file, the other is Oleg's adverb to apply a function to each line of a file. In each case, the file is assumed to be in lines terminated by LF, and a trailing LF is assumed if not present. CR is removed. Blocksize is hardcoded at 1e6. Definitions are: NB.*freadblock v read block from file NB. y is filename;start position NB. returns: block;new start position freadblock=: 3 : 0 'f p'=. y f=. 8 u: fNB. for j601 s=. 1!:4 f if. s = _1 do. return. end. if. p s do. dat=. 1!:11 f;p,1e6.s-p len=. 1 + dat i: LF p=. p + len if. len #dat do. if. p s do. dat=. dat, LF else. 'file not in LF-delimited lines' 13!:8[3 end. else. dat=. len {. dat end. else. dat=. '' end. (dat -. CR);p ) NB.*fapplylines a apply verb to lines in file delimited by LF fapplylines=: 1 : 0 y=. 8 u: yNB. for j601 s=. 1!:4 y if. s = _1 do. return. end. p=. 0 while. p s do. dat=. 1!:11 y;p,1e6.s-p len=. 1 + dat i: LF p=. p + len if. len #dat do. if. p s do. dat=. dat, LF else. 'file not in LF-delimited lines' 13!:8[3 end. else. dat=. len {. dat end. u ;._2 dat -. CR p=. s end. ) With these definitions, Yoel's problem would have solutions like the following: getcsn=: 3 : 0 ptr=. 0 res=. i. 0 0 while. 'dat ptr'=. freadblock y;ptr # dat=. ;._2 dat do. res=. ~. res, 4 }.1 dat #~ ('csn ') = 4 {. each dat end. ) readcsn=: 3 : 0 CSN=: i.0 0 readcsn1 fapplylines y CSN ) readcsn1=: 3 : 0 if. 'csn ' -: 4 {. y do. CSN=: ~. CSN, 4 }. y end. 0 ) -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
I have tryed it on a 1.2GB file. Since my laptop has only 1GB RAM I have killed the process when it consumed 500MB (and rising). Yoel On 5/15/06, Henry Rich [EMAIL PROTECTED] wrote: Try x ([: I. E.) y -- For information about J forums see http://www.jsoftware.com/forums.htm
RE: [Jprogramming] Scanning a large file
Chris Burke wrote: if. len #dat do. if. p s do. dat=. dat, LF else. 'file not in LF-delimited lines' 13!:8[3 Note that this assumes that the last line of the file is terminated by a line feed. Otherwise, there can be a spurious error if the file is slightly larger than an even multiple of 1e6. At minimum, this assumption should be documented. -- Raul -- For information about J forums see http://www.jsoftware.com/forums.htm
RE: [Jprogramming] Scanning a large file
At 09:38 -0400 2006/05/16, Miller, Raul D wrote: Chris Burke wrote: if. len #dat do. if. p s do. dat=. dat, LF else. 'file not in LF-delimited lines' 13!:8[3 Note that this assumes that the last line of the file is terminated by a line feed. Otherwise, there can be a spurious error if the file is slightly larger than an even multiple of 1e6. At minimum, this assumption should be documented. Actually, it needs to be dealt with. Some programs produce files without a final end of line -- e.g. M$ Excel text files. I have never understood how they could do that with good conscience... - joey -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
It is all relative. The LF can be seen (as you do) as end of line or as new line. In the first case, all lines should end with end of line. In the second, LF cuts one line from another. When editing a text file, and requesting to place the cursor at end of file, with no LF at the end the cursor is placed after the last character somewhere to the right at the end of line. With an LF at the end, it is placed at the beginning of an empty line at the end. I am not sure it is a M$ problem. Joey K Tuttle wrote: ... Actually, it needs to be dealt with. Some programs produce files without a final end of line -- e.g. M$ Excel text files. I have never understood how they could do that with good conscience... - joey -- For information about J forums see http://www.jsoftware.com/forums.htm -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
Certainly, in my experience, LF, CR, or CRLF are considered as EOL (in ..IX, MAC, PC OSs). Going way back, these things came from input devices such as the IBM 1050 which was an early typewriter terminal. It had the charming attribute that the return key did just that (returned the carriage as on a typewriter) Then, of course a line feed was needed to start on the typewriter... To get the input line entered into the computer one had to explicitly enter a EOT character - what fun... I think this idea of a typewriter crept into DOS but it was considered convenient to imply that the return indicated EOT as well... This clear thinking was likely a result of people not looking at things outside of IBM (a mistake IMHO)... All of these early input devices ended a line to indicate taking action in an attached processor - the fact that such input was streamed into memory (and maybe saved on a file) would indicate that all lines -- including the last one -- ended with a designated character (or two on the 1050 and PC) Nowhere in the history of how files evolved do I see/remember a different view - do you know of some thread of computer evolution that was different and leads you to say it is relative? Two memories related to this amuse me. One was in my very early days using APL from a 1050. The APL system was the original one in IBM Yorktown Heights Research. My 1050 was in Boulder Colorado. APL\360 had a command to do iMsg e.g. )OPR WHY IS THE SYSTEM SO SLOW? would post a message to the system operator console. On the 1050 I could send a multi-line message in a single go by not adding the EOT signal until after the last of the lines. As I type this story, I realize that I do not know/remember if the 1050 EOT had to immediately follow a return - (and maybe that is just such a thread as I asked if you knew of!) Such multi-line operator messages confused the operator who wondered how it was possible... The other instance I know of about EOL being strange is in the TIFF type 2 (FAX) file structure definition. The standard for that states that all (scan) lines of the document shall begin (not end) with a New Line character. I ran into cases where programs didn't do that and while the authors admitted that it was a bug, the loss of the first scan line on a FAX was considered acceptable instead of fixing the programs... Maybe there is some logic like that behind Excel not producing files with a terminating line end - but you must admit that not having a line end on the last line certainly could cause one to wonder if the file was complete, or was the victim of an accidental ending... Of course, having a line end doesn't insure that there wasn't an explosion at the source of the data just as the last EOL was put in place but before the file was completed - But but EOL just before EOF does provide comfort (not to mention convenience) that things are OK As an example of why I consider the Excel behavior a bug, consider trying to catenate two Excel text output files together, then using them as input to Excel. The missing line end becomes an issue.. In any case, because of programs like Excel, any line reading program should do its best to provide all the data - and should likely alert the user that things didn't end cleanly... - joey At 10:32 -0400 2006/05/16, Alain Miville de Chêne wrote: It is all relative. The LF can be seen (as you do) as end of line or as new line. In the first case, all lines should end with end of line. In the second, LF cuts one line from another. When editing a text file, and requesting to place the cursor at end of file, with no LF at the end the cursor is placed after the last character somewhere to the right at the end of line. With an LF at the end, it is placed at the beginning of an empty line at the end. I am not sure it is a M$ problem. Joey K Tuttle wrote: ... Actually, it needs to be dealt with. Some programs produce files without a final end of line -- e.g. M$ Excel text files. I have never understood how they could do that with good conscience... -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
These are interesting stories about line terminators. I agree on providing all the data. But I think absence of final terminator is more a stylistic issue (or a matter of choice) than a defect. Hence, it more like truthful conveying than alerting cleanliness. Here's on cygwin: [EMAIL PROTECTED] ~ $ cat t1.txt one two [EMAIL PROTECTED] ~ $ cat t2.txt one two [EMAIL PROTECTED] ~ $ od -c t1.txt 000 o n e \r \n t w o \r \n 012 [EMAIL PROTECTED] ~ $ od -c t2.txt 000 o n e \r \n t w o 010 P.S. Unless, it's just an excuse to bash Microsoft again: picking on Excel, that $ in the name... If you don't like it -- don't use it. Any program can do that: you can either put EOL at the end or not, so the chance is 50-50. :-) --- Joey K Tuttle [EMAIL PROTECTED] wrote: In any case, because of programs like Excel, any line reading program should do its best to provide all the data - and should likely alert the user that things didn't end cleanly... __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
OK, MS (not bashing women :) Excel - the problem is, one often doesn't have the choice not to use it in the sense that people send files exported from Excel... A case where you can choose not to use it includes things like trying to use Excel to open a text file that starts with the ascii characters ID (or a tab in place of that blank) -- actually the choice is made for you in that case, since Excel rejects the file. But I imagine there are many dark corners like that - and of course you are right, any program may choose to elide a common sense line ending. Still, that seems a bit irresponsible in most cases. - joey At 10:12 -0700 2006/05/16, Oleg Kobchenko wrote: These are interesting stories about line terminators. I agree on providing all the data. But I think absence of final terminator is more a stylistic issue (or a matter of choice) than a defect. Hence, it more like truthful conveying than alerting cleanliness. Here's on cygwin: [EMAIL PROTECTED] ~ $ cat t1.txt one two [EMAIL PROTECTED] ~ $ cat t2.txt one two [EMAIL PROTECTED] ~ $ od -c t1.txt 000 o n e \r \n t w o \r \n 012 [EMAIL PROTECTED] ~ $ od -c t2.txt 000 o n e \r \n t w o 010 P.S. Unless, it's just an excuse to bash Microsoft again: picking on Excel, that $ in the name... If you don't like it -- don't use it. Any program can do that: you can either put EOL at the end or not, so the chance is 50-50. :-) --- Joey K Tuttle [EMAIL PROTECTED] wrote: In any case, because of programs like Excel, any line reading program should do its best to provide all the data - and should likely alert the user that things didn't end cleanly... -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
Our company is entirely using OpenOffice. It is a mature product to replace MS Office. Joey K Tuttle wrote: OK, MS (not bashing women :) Excel - the problem is, one often doesn't have the choice not to use it in the sense that people send files exported from Excel... ... -- For information about J forums see http://www.jsoftware.com/forums.htm
RE: [Jprogramming] Scanning a large file
At 15:29 -0400 2006/05/16, Miller, Raul D wrote: Joey K Tuttle wrote: OK, MS (not bashing women :) Excel - the problem is, one often doesn't have the choice not to use it in the sense that people send files exported from Excel... And sometimes those files are broken or virus infected, etc. When the files are well formed, typically a person could use openoffice calc to read them and re-export them in a more convenient format. Alternatively, you could ask the original user for a copy of that the file in some other format. I've gotten quite a bit of mileage from asking people to save the file as CSV. In the typical case, CSV is more than adequate. CSV tends to be much easier to process programmatically (assuming you aren't using some simple thing in excel for your program -- a reasonable assumption for the case where the user is exporting the file and you are working with it in J). Failing that, asking the user to save the spreadsheet as xml retains all excel features might be easier to deal with than the default binary format. However, this is not as simple as CSV. Raul, The files that caused me troubles were requested and supplied in text or csv format, not binary - the fact that the last line of those files is sans EOL was always an annoyance (especially if using cut in j ...) I just did a little test to see if MS Excel still saves files that way and indeed files saved as .txt .csv .htm .prn and .dif all end unceremoniously with no EOL (in my case no CR since I have Mac Excel). I used my other bug as a test example - here is the complete .csv file: ID,NAME 666,MS (of course this example does have an EOL on both lines) I learned that the behavior of Excel has changed when trying to open the above file - it used to say Invalid File - now it says, SYLK: file format is not valid. and then crashes when you acknowledge the error dialog. I suppose that may mean that they are moving towards a fix for the bug that has been in every version of Excel I have looked at Long live SYLK. - joey PS - my version of Excel doesn't include save as XML. -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
Miller, Raul D wrote: Chris Burke wrote: if. len #dat do. if. p s do. dat=. dat, LF else. 'file not in LF-delimited lines' 13!:8[3 Note that this assumes that the last line of the file is terminated by a line feed. Otherwise, there can be a spurious error if the file is slightly larger than an even multiple of 1e6. At minimum, this assumption should be documented. This looks OK to me. The line after the if. statement should handle a file which is not terminated by LF. The line after the else. statement should handle a file where a line is longer than 1e6 bytes, and so is inappropriate for this function. -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
Oleg Kobchenko wrote: It's a great idea to include line reading into a standard library. Here is a few comments. There are two differences from the original readlines: - overlapped reading (not once and only once) (with asserting presence of LF in current block) - automatic removal of terminators Agreed on leaving in the LF, in fapplylines. Do you agree on removing the CR or think this should be left in as well? I am in two minds on the buffer. It does impact performance, though not by much. But it means that after the block of 1e6 bytes is read in, it is immediately copied because it is appended to the tail of the previous block. So the question is whether this performance hit is worthwhile to permit the code to be used for stdin or sockets. I don't feel strongly on this and wonder if there are other opinions on it. -- For information about J forums see http://www.jsoftware.com/forums.htm
RE: [Jprogramming] Scanning a large file
Chris Burke wrote: I am in two minds on the buffer. It does impact performance, though not by much. But it means that after the block of 1e6 bytes is read in, it is immediately copied because it is appended to the tail of the previous block. So the question is whether this performance hit is worthwhile to permit the code to be used for stdin or sockets. I don't feel strongly on this and wonder if there are other opinions on it. If you want to avoid that copy, you could special case the handling of the line which spans two blocks. In the long run, I suspect the issue with this copy would be latency not performance. Usually the processing of a line involves an order of magnitude more time than copying that line, and the cost of the 1e6 byte copy gets spread over a lot of lines. And if latency is an issue, the proper solution probably involves reducing the buffer size (since filling the buffer will also involve a lot more work than making a copy of it). -- Raul -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
I am not sure about overlapped either. Raul's idea about special-casing sounds good. And the discussion on spread of copy. In my test, the impact was 5-7% or so -- a good price for streaming. I think the bottle neck is in looping in u;.2 and the line proc itself. I ran the UNIX wc, and it felt like x100 faster. Then I ran jpm on wc, and the line proc takes bulk of the time. As for removing CR/LF, I would suggest optionizing with default: both removed. For simplicity, handle them as one option, because turning them on is low-level stuff, to be handled as such in user code. For example: lineproc fapplylines fname NB. terminators removed 1 lineproc fapplylines fname NB. terminators preserved I just had another idea: besides the adverb, to have a conjunction with additional verb to insert between line results. Then wc will become: lwc2=: 1 , #@;:@(CRLF-.~]) , # 1 lwc2 finsertlines + fn --- Chris Burke [EMAIL PROTECTED] wrote: Oleg Kobchenko wrote: It's a great idea to include line reading into a standard library. Here is a few comments. There are two differences from the original readlines: - overlapped reading (not once and only once) (with asserting presence of LF in current block) - automatic removal of terminators Agreed on leaving in the LF, in fapplylines. Do you agree on removing the CR or think this should be left in as well? I am in two minds on the buffer. It does impact performance, though not by much. But it means that after the block of 1e6 bytes is read in, it is immediately copied because it is appended to the tail of the previous block. So the question is whether this performance hit is worthwhile to permit the code to be used for stdin or sockets. I don't feel strongly on this and wonder if there are other opinions on it. -- For information about J forums see http://www.jsoftware.com/forums.htm __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -- For information about J forums see http://www.jsoftware.com/forums.htm
RE: [Jprogramming] Scanning a large file
At 20:54 -0700 2006/05/16, Oleg Kobchenko wrote: http://support.microsoft.com/kb/215591/ ID,NAME 666,MS Don' B H8N Yes - I knew the workaround and even puzzled out that the origination of the bug is that SYLK files begin with ID;. You would think that some bright programmer could decide that if the third character isn't the expected ; then it might be just an ordinary text file. Interesting how difficult it seems to fix such a simple thing. BTW, it also (used to) fails in Windows too. Also, it is any text file, not just csv. - joey -- For information about J forums see http://www.jsoftware.com/forums.htm
RE: [Jprogramming] Scanning a large file
Try x ([: I. E.) y to get the list of places where the string x occurs. This uses special code and doesn't create the entire result of E. . Henry Rich -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yoel Jacobsen Sent: Monday, May 15, 2006 10:09 AM To: Programming forum Subject: Re: [Jprogramming] Scanning a large file It won't work for large files. E. returns a 'limit error'. Yoel On 5/14/06, Joey K Tuttle [EMAIL PROTECTED] wrote: Yoel, Some of the feedback you got suggested mapped files, others suggested just reading the file. My own habits lean towards reading the file and I have a utility verb that gets lines while not exceeding a buffer size limit. I find that buffer sizes 100Kbytes generally make almost no difference in processing time - in fact, processing can take longer on larger chunks. Actually, the gain after 40Kbytes is minor indeed. But in your responses you indicated that you were interested in not using (explicit) loops and doing it in a j style yet being able to handle large files. j mapped files are certainly needed in that case. There was also a suggestion of regex, but my experience calling regex from j has been less than satisfactory. In my opinion, these things usually require some thought and knowledge of the data and the objectives. If the pattern you are searching for is nice (like your keyword 'csn') then there are usually pretty good ways to have j gather the data. To find an actual example to illustrate, I catenated the past 8 weeks worth of sendmail logs on my linux system to create a file maillogs - here is some experimenting with it - [EMAIL PROTECTED] mqueue]$ wc maillogs 564175 6987478 75395162 maillogs that is, the file is 75Mbytes with 564,175 lines [EMAIL PROTECTED] mqueue]$ ja # starting jconsole version '' j504/2005-03-16/15:30 Running in: Linux host 'cat /proc/cpuinfo' processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 5 model name : Pentium II (Deschutes) stepping: 2 cpu MHz : 399.071 cache size : 512 KB NB. not a very fast machine, but it does have 1Gbyte ram available require 'jmf' JCHAR map_jmf_ 'mls';'maillogs';'';1 NB. HIGHLY recommended to map read only... that is the 1 at the NB. end of the mapping expression. There is a vicious side effect NB. (IMHO a BUG) in setting an alias of a mapped name within a verb. NB. My example is to get the size of messages that passed through NB. sendmail. Typically there is a phrase like size=1234, in NB. the log. The following is based on that. delim =: ',' tag =: 'size=' timex 'tagis =: I. tag E. mls'NB. time and space to get indexes 3.49947 1.34481e8 timex 'sizes =: delim (_1: . (] i.1 [) {.0 1 ]) (tagis +/ (#tag)+i. 12){mls' 0.431585 1.37452e7 $sizes 43947 +/ x: sizes 11572953524 Maybe these are some ideas you can use to attack your problem. - joey At 11:01 +0300 2006/05/14, Yoel Jacobsen wrote: Hello, I'm new to J so please forgive me if this is a FAQ. I wrote some short sentences to parse a log file. I want to retrieve all the unique values of some attribute. The way it shows in the log file is attribute nameSPACEattribute value such as . csn 92892849893284 ... My initial (brute force) program is: text =: 1!:1 '/tmp/logfile' words =: cutopen text bv =: ('csn') = words srbv =: _1 |.!.0 bv csns =: ~. srbv # words Now csns holds the unique values as requested. The program works fine for small files (few megabytes). My question is, what should be done to make it work for large files (say, 1GB or more)? I guess it involves memory mapped files but I have no clue where to continue from here. Further, is there any notion of 'laziness' (evaluate only when the data is really needed) in J? can a verb be decalred as a lazy verb? Thanks, Yoel -- For information about J forums see http://www.jsoftware.com/forums.htm -- For information about J forums see http://www.jsoftware.com/forums.htm -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
It won't work for large files. E. returns a 'limit error'. Yoel On 5/14/06, Joey K Tuttle [EMAIL PROTECTED] wrote: Yoel, Some of the feedback you got suggested mapped files, others suggested just reading the file. My own habits lean towards reading the file and I have a utility verb that gets lines while not exceeding a buffer size limit. I find that buffer sizes 100Kbytes generally make almost no difference in processing time - in fact, processing can take longer on larger chunks. Actually, the gain after 40Kbytes is minor indeed. But in your responses you indicated that you were interested in not using (explicit) loops and doing it in a j style yet being able to handle large files. j mapped files are certainly needed in that case. There was also a suggestion of regex, but my experience calling regex from j has been less than satisfactory. In my opinion, these things usually require some thought and knowledge of the data and the objectives. If the pattern you are searching for is nice (like your keyword 'csn') then there are usually pretty good ways to have j gather the data. To find an actual example to illustrate, I catenated the past 8 weeks worth of sendmail logs on my linux system to create a file maillogs - here is some experimenting with it - [EMAIL PROTECTED] mqueue]$ wc maillogs 564175 6987478 75395162 maillogs that is, the file is 75Mbytes with 564,175 lines [EMAIL PROTECTED] mqueue]$ ja # starting jconsole version '' j504/2005-03-16/15:30 Running in: Linux host 'cat /proc/cpuinfo' processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 5 model name : Pentium II (Deschutes) stepping: 2 cpu MHz : 399.071 cache size : 512 KB NB. not a very fast machine, but it does have 1Gbyte ram available require 'jmf' JCHAR map_jmf_ 'mls';'maillogs';'';1 NB. HIGHLY recommended to map read only... that is the 1 at the NB. end of the mapping expression. There is a vicious side effect NB. (IMHO a BUG) in setting an alias of a mapped name within a verb. NB. My example is to get the size of messages that passed through NB. sendmail. Typically there is a phrase like size=1234, in NB. the log. The following is based on that. delim =: ',' tag =: 'size=' timex 'tagis =: I. tag E. mls'NB. time and space to get indexes 3.49947 1.34481e8 timex 'sizes =: delim (_1: . (] i.1 [) {.0 1 ]) (tagis +/ (#tag)+i. 12){mls' 0.431585 1.37452e7 $sizes 43947 +/ x: sizes 11572953524 Maybe these are some ideas you can use to attack your problem. - joey At 11:01 +0300 2006/05/14, Yoel Jacobsen wrote: Hello, I'm new to J so please forgive me if this is a FAQ. I wrote some short sentences to parse a log file. I want to retrieve all the unique values of some attribute. The way it shows in the log file is attribute nameSPACEattribute value such as . csn 92892849893284 ... My initial (brute force) program is: text =: 1!:1 '/tmp/logfile' words =: cutopen text bv =: ('csn') = words srbv =: _1 |.!.0 bv csns =: ~. srbv # words Now csns holds the unique values as requested. The program works fine for small files (few megabytes). My question is, what should be done to make it work for large files (say, 1GB or more)? I guess it involves memory mapped files but I have no clue where to continue from here. Further, is there any notion of 'laziness' (evaluate only when the data is really needed) in J? can a verb be decalred as a lazy verb? Thanks, Yoel -- For information about J forums see http://www.jsoftware.com/forums.htm -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
At 12:14 -0300 2006/05/15, Randy MacDonald wrote: limit is only 2GB A phrase I thought I'd _never_ hear --- indeed ... and presumably not applicative in 64 bit systems... -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
Some of the clients of the company I'm working for, are working with up to Terabyte-long files. Usually in Physics, life-science,simulation etc. The new file system in Solaris (ZFS) is a 128bit FS. Anyway, data mining from log files is an important use of a language for me. I am very pleased from the interactivity and conciseness of the process in J. I just have to understand how to parse the largest files elegantly (i.e. with user code which is at lease as elegant as the 3-statement python program from my previous post). Thanks for all the enlightening replies! Yoel On 5/15/06, Randy MacDonald [EMAIL PROTECTED] wrote: limit is only 2GB A phrase I thought I'd _never_ hear -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
Answers to you questions are as youself point out in memory mapped files Read the labs and experiment with them 2006/5/14, Yoel Jacobsen [EMAIL PROTECTED]: Hello, I'm new to J so please forgive me if this is a FAQ. I wrote some short sentences to parse a log file. I want to retrieve all the unique values of some attribute. The way it shows in the log file is attribute nameSPACEattribute value such as . csn 92892849893284 ... My initial (brute force) program is: text =: 1!:1 '/tmp/logfile' words =: cutopen text bv =: ('csn') = words srbv =: _1 |.!.0 bv csns =: ~. srbv # words Now csns holds the unique values as requested. The program works fine for small files (few megabytes). My question is, what should be done to make it work for large files (say, 1GB or more)? I guess it involves memory mapped files but I have no clue where to continue from here. Further, is there any notion of 'laziness' (evaluate only when the data is really needed) in J? can a verb be decalred as a lazy verb? Thanks, Yoel -- For information about J forums see http://www.jsoftware.com/forums.htm -- Björn Helgason, Verkfræðingur FuglFiskur ehf, Þerneyjarsund 23, Skype: gosiminn, gsm: +3546985532 801 Grímsnes ,t-póst: [EMAIL PROTECTED] Landslags og skrúðgarðagerð, gröfuþjónusta http://groups.google.com/group/J-Programming Tæknikunnátta höndlar hið flókna, sköpunargáfa er meistari einfaldleikans góður kennari getur stigið á tær án þess að glansinn fari af skónum /|_ .---. ,' .\ / | Með léttri lund verður| ,--'_,' | Dagurinn í dag | / / | Enn betri en gærdagurinn | ( -. |`---' | ) | (`-. '--.) `. )' -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
I probably was not clear. My question is not how to use mapped files, but where to go from there. Mapped files does not solve the problem directly since I can't use the same algorithm on it. For instance, catopen would take tremendous time and space. Moreover, since the length of the lines is not fixed I can't state the number of columns when mapping the file. On s scalar language (for instance python) I would do: for line in file.readlines(): handle_line(line) this is very efficient spacewise since readlines() reads several blocks at a time. But: 1) As far as I understand, walking over the lines is not the J way. 2) Even if I want that, I didn't find the equivalent of Python's readlines() in the docs. Yoel On 5/14/06, Björn Helgason [EMAIL PROTECTED] wrote: Answers to you questions are as youself point out in memory mapped files Read the labs and experiment with them 2006/5/14, Yoel Jacobsen [EMAIL PROTECTED]: Hello, I'm new to J so please forgive me if this is a FAQ. I wrote some short sentences to parse a log file. I want to retrieve all the unique values of some attribute. The way it shows in the log file is attribute nameSPACEattribute value such as . csn 92892849893284 ... My initial (brute force) program is: text =: 1!:1 '/tmp/logfile' words =: cutopen text bv =: ('csn') = words srbv =: _1 |.!.0 bv csns =: ~. srbv # words Now csns holds the unique values as requested. The program works fine for small files (few megabytes). My question is, what should be done to make it work for large files (say, 1GB or more)? I guess it involves memory mapped files but I have no clue where to continue from here. Further, is there any notion of 'laziness' (evaluate only when the data is really needed) in J? can a verb be decalred as a lazy verb? Thanks, Yoel -- For information about J forums see http://www.jsoftware.com/forums.htm -- Björn Helgason, Verkfræðingur FuglFiskur ehf, Þerneyjarsund 23, Skype: gosiminn, gsm: +3546985532 801 Grímsnes ,t-póst: [EMAIL PROTECTED] Landslags og skrúðgarðagerð, gröfuþjónusta http://groups.google.com/group/J-Programming Tæknikunnátta höndlar hið flókna, sköpunargáfa er meistari einfaldleikans góður kennari getur stigið á tær án þess að glansinn fari af skónum /|_ .---. ,' .\ / | Með léttri lund verður| ,--'_,' | Dagurinn í dag | / / | Enn betri en gærdagurinn | ( -. |`---' | ) | (`-. '--.) `. )' -- For information about J forums see http://www.jsoftware.com/forums.htm -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
You may not have understood what mapped files are You do not read them into the workarea Opening a mapped file takes a very short time catopen you mention is probably reading all the data into the workarea and that is not the way mapped files will help your case As you see in the mapped file labs the mapped files stay outside the workarea You only bring in the bits you do need when you need them 2006/5/14, Yoel Jacobsen [EMAIL PROTECTED]: I probably was not clear. My question is not how to use mapped files, but where to go from there. Mapped files does not solve the problem directly since I can't use the same algorithm on it. For instance, catopen would take tremendous time and space. Moreover, since the length of the lines is not fixed I can't state the number of columns when mapping the file. On s scalar language (for instance python) I would do: for line in file.readlines(): handle_line(line) this is very efficient spacewise since readlines() reads several blocks at a time. But: 1) As far as I understand, walking over the lines is not the J way. 2) Even if I want that, I didn't find the equivalent of Python's readlines() in the docs. Yoel On 5/14/06, Björn Helgason [EMAIL PROTECTED] wrote: Answers to you questions are as youself point out in memory mapped files Read the labs and experiment with them 2006/5/14, Yoel Jacobsen [EMAIL PROTECTED]: Hello, I'm new to J so please forgive me if this is a FAQ. I wrote some short sentences to parse a log file. I want to retrieve all the unique values of some attribute. The way it shows in the log file is attribute nameSPACEattribute value such as . csn 92892849893284 ... My initial (brute force) program is: text =: 1!:1 '/tmp/logfile' words =: cutopen text bv =: ('csn') = words srbv =: _1 |.!.0 bv csns =: ~. srbv # words Now csns holds the unique values as requested. The program works fine for small files (few megabytes). My question is, what should be done to make it work for large files (say, 1GB or more)? I guess it involves memory mapped files but I have no clue where to continue from here. Further, is there any notion of 'laziness' (evaluate only when the data is really needed) in J? can a verb be decalred as a lazy verb? Thanks, Yoel -- For information about J forums see http://www.jsoftware.com/forums.htm -- Björn Helgason, Verkfræðingur FuglFiskur ehf, Þerneyjarsund 23, Skype: gosiminn, gsm: +3546985532 801 Grímsnes ,t-póst: [EMAIL PROTECTED] Landslags og skrúðgarðagerð, gröfuþjónusta http://groups.google.com/group/J-Programming Tæknikunnátta höndlar hið flókna, sköpunargáfa er meistari einfaldleikans góður kennari getur stigið á tær án þess að glansinn fari af skónum /|_ .---. ,' .\ / | Með léttri lund verður| ,--'_,' | Dagurinn í dag | / / | Enn betri en gærdagurinn | ( -. |`---' | ) | (`-. '--.) `. )' -- For information about J forums see http://www.jsoftware.com/forums.htm -- For information about J forums see http://www.jsoftware.com/forums.htm -- Björn Helgason, Verkfræðingur FuglFiskur ehf, Þerneyjarsund 23, Skype: gosiminn, gsm: +3546985532 801 Grímsnes ,t-póst: [EMAIL PROTECTED] Landslags og skrúðgarðagerð, gröfuþjónusta http://groups.google.com/group/J-Programming Tæknikunnátta höndlar hið flókna, sköpunargáfa er meistari einfaldleikans góður kennari getur stigið á tær án þess að glansinn fari af skónum /|_ .---. ,' .\ / | Með léttri lund verður| ,--'_,' | Dagurinn í dag | / / | Enn betri en gærdagurinn | ( -. |`---' | ) | (`-. '--.) `. )' -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
Yoel Jacobsen wrote: I wrote some short sentences to parse a log file. I want to retrieve all the unique values of some attribute. The way it shows in the log file is attribute nameSPACEattribute value such as . csn 92892849893284 ... My initial (brute force) program is: text =: 1!:1 '/tmp/logfile' words =: cutopen text bv =: ('csn') = words srbv =: _1 |.!.0 bv csns =: ~. srbv # words Now csns holds the unique values as requested. The program works fine for small files (few megabytes). Probably the simplest way to handle this is to read the file in large blocks, and chop the blocks into lines. Since lines are of uneven length, the blocks will likely not end in a line separator, so need to be truncated. You don't need to memory map the file. The following example assumes each line ends in LF: getcsn=: 3 : 0 siz=. fsize y blk=. 1e7 ptr=. 0 res=. '' while. ptr siz do. len=. blk . siz - ptr dat=. fread y;ptr,len lfx=. 1 + dat i: LF ptr=. ptr + lfx dat=. ;._2 lfx {. dat key=. (dat i. ' ') {. each dat msk=. key = 'csn' res=. ~. res, msk # dat end. 4 }. each res ) A=: 0 : 0 abc qweqwe csn 1234 def 123123 csn 87654 ) A fwrites F=: jpath '~temp/t1.dat' 41 getcsn F ++-+ |1234|87654| ++-+ -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
If the file is really large, I prefer regex instead. -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
Even for regex, I don't see how to avoid manually reading the file in chunks which is too imperative style for me. Again, consider the Python example: for line in file.readlines(): match_object = re.search((= csn )\w+, line) if match_object: process(match_object.group(0)) The regex can be precompiled as well. This works on a 5GB file as well as on small files since readlines() take care for reading the file in chunks. Is there a way to do it a consice way in J? Yoel On 5/14/06, bill lam [EMAIL PROTECTED] wrote: If the file is really large, I prefer regex instead. -- For information about J forums see http://www.jsoftware.com/forums.htm -- For information about J forums see http://www.jsoftware.com/forums.htm
Re: [Jprogramming] Scanning a large file
We need a general purpose read line functionality. It is common in C runtime and in other languages. Although, it is possible to do in J, but it's better not to do the low-level stuff every time. Chris has shown how to do it in a way specific for a concrete example. It is suggested to separate the reading part from processing, so that reading could be reused. Here is a list constraints: - it's OK to assume LF line separators only (no CR) - read every byte of the file once and only once - proceess empty lines - proceess non-terminated last line - be fast and lean Here is an approach that keeps the state of file management out of the user code by means of a callback for each line. It calculates wc for 1Mb file on P2.8GHz in 1.7 sec. (wc FN) , ts'wc FN' 8 20 99 1.6866 95808 NB. = NB. readlines -- line reader require 'files' SB=: 1 readlines=: 1 : 0 assert fexist y S=. fsize y P=. 0 B=. '' while. P S do. B=. B,fread y ; P,SR=. SB.S-P P=. P+SR if. (#B) : L=. 1 + B i:LF do. u ;.2 L {. B B=. L }. B end. end. if. #B do. u B end. ) NB. = NB. user code lwc=: 3 : 0 LC=: LC + 1 WC=: WC + #@;: }:^:(LF={:)y CC=: CC + #y ) wc=: 3 : 0 LC=: WC=: CC=: 0 lwc readlines y LC , WC , CC ) ts=: 6!:2 , 7!:[EMAIL PROTECTED] A=: 2 ((* #) $ ]) 0 : 0 one two three four five six seven eight nine ten ) 0 : 0 (}:A) fwrite FN=: jpath '~temp/t1.txt' (wc FN) , ts'wc FN' ) NB. = --- Chris Burke [EMAIL PROTECTED] wrote: Yoel Jacobsen wrote: I wrote some short sentences to parse a log file. I want to retrieve all the unique values of some attribute. The way it shows in the log file is attribute nameSPACEattribute value such as . csn 92892849893284 ... My initial (brute force) program is: text =: 1!:1 '/tmp/logfile' words =: cutopen text bv =: ('csn') = words srbv =: _1 |.!.0 bv csns =: ~. srbv # words Now csns holds the unique values as requested. The program works fine for small files (few megabytes). Probably the simplest way to handle this is to read the file in large blocks, and chop the blocks into lines. Since lines are of uneven length, the blocks will likely not end in a line separator, so need to be truncated. You don't need to memory map the file. The following example assumes each line ends in LF: getcsn=: 3 : 0 siz=. fsize y blk=. 1e7 ptr=. 0 res=. '' while. ptr siz do. len=. blk . siz - ptr dat=. fread y;ptr,len lfx=. 1 + dat i: LF ptr=. ptr + lfx dat=. ;._2 lfx {. dat key=. (dat i. ' ') {. each dat msk=. key = 'csn' res=. ~. res, msk # dat end. 4 }. each res ) A=: 0 : 0 abc qweqwe csn 1234 def 123123 csn 87654 ) A fwrites F=: jpath '~temp/t1.dat' 41 getcsn F ++-+ |1234|87654| ++-+ -- For information about J forums see http://www.jsoftware.com/forums.htm __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -- For information about J forums see http://www.jsoftware.com/forums.htm