RE: [Jprogramming] Scanning a large file

2006-05-17 Thread Oleg Kobchenko
In Office 2003 for Windows, it evetually opens
the file fine, only curses beforehands.

SYLK appears to be a very important format for
exchange between spreadsheet(-gnostic) software,
like RTF for documents and WMF for pictures.

It has CF_ code number 4 just after METAFILE.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnwui/html/msdn_ddeole.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnwui/html/msdn_ddesystp.asp

That's why it has priority over CSV, which is
a rather amorphous format anyway -- there is no
such thing as CSV Specification.

Microsoft also likes to honor tradition, like their
BASIC. SYLK also happens to be the file format of
MultiPlan.



--- Joey K Tuttle [EMAIL PROTECTED] wrote:

 At 20:54  -0700 2006/05/16, Oleg Kobchenko wrote:
 http://support.microsoft.com/kb/215591/
 
   ID,NAME
   666,MS
 
 Don' B H8N
 
 Yes - I knew the workaround and even puzzled out that
 the origination of the bug is that SYLK files begin with
 ID;. You would think that some bright programmer could
 decide that if the third character isn't the expected ;
 then it might be just an ordinary text file. Interesting
 how difficult it seems to fix such a simple thing.
 
 BTW, it also (used to) fails in Windows too. Also, it is
 any text file, not just csv.
 
 - joey
 --
 For information about J forums see http://www.jsoftware.com/forums.htm
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-16 Thread Chris Burke
Oleg Kobchenko wrote:
 We need a general purpose read line functionality.
 It is common in C runtime and in other languages.
 Although, it is possible to do in J, but it's better not
 to do the low-level stuff every time.

I suggest that we add two new definitions to the files script. One is
Joey's verb to read a LF-terminated block from a file, the other is
Oleg's adverb to apply a function to each line of a file.

In each case, the file is assumed to be in lines terminated by LF, and a
trailing LF is assumed if not present. CR is removed. Blocksize is
hardcoded at 1e6.

Definitions are:

NB.*freadblock v read block from file
NB. y is filename;start position
NB. returns: block;new start position
freadblock=: 3 : 0
'f p'=. y
f=. 8 u: fNB. for j601
s=. 1!:4 f
if. s = _1 do. return. end.
if. p  s do.
  dat=. 1!:11 f;p,1e6.s-p
  len=. 1 + dat i: LF
  p=. p + len
  if. len  #dat do.
if. p  s do.
  dat=. dat, LF
else.
  'file not in LF-delimited lines' 13!:8[3
end.
  else.
dat=. len {. dat
  end.
else.
  dat=. ''
end.
(dat -. CR);p
)

NB.*fapplylines a apply verb to lines in file delimited by LF
fapplylines=: 1 : 0
y=. 8 u: yNB. for j601
s=. 1!:4 y
if. s = _1 do. return. end.
p=. 0
while. p  s do.
  dat=. 1!:11 y;p,1e6.s-p
  len=. 1 + dat i: LF
  p=. p + len
  if. len  #dat do.
if. p  s do.
  dat=. dat, LF
else.
  'file not in LF-delimited lines' 13!:8[3
end.
  else.
dat=. len {. dat
  end.
  u ;._2 dat -. CR
  p=. s
end.
)

With these definitions, Yoel's problem would have solutions like the
following:

getcsn=: 3 : 0
ptr=. 0
res=. i. 0 0
while.
  'dat ptr'=. freadblock y;ptr
  # dat=. ;._2 dat do.
  res=. ~. res, 4 }.1  dat #~ ('csn ') = 4 {. each dat
end.
)

readcsn=: 3 : 0
CSN=: i.0 0
readcsn1 fapplylines y
CSN
)

readcsn1=: 3 : 0
if. 'csn ' -: 4 {. y do. CSN=: ~. CSN, 4 }. y end. 0
)
--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-16 Thread Yoel Jacobsen

I have tryed it on a 1.2GB file. Since my laptop has only 1GB RAM I have
killed the process when it consumed 500MB (and rising).

Yoel


On 5/15/06, Henry Rich [EMAIL PROTECTED] wrote:


Try

x ([: I. E.) y



--
For information about J forums see http://www.jsoftware.com/forums.htm


RE: [Jprogramming] Scanning a large file

2006-05-16 Thread Miller, Raul D
Chris Burke wrote:
   if. len  #dat do.
 if. p  s do.
   dat=. dat, LF
 else.
   'file not in LF-delimited lines' 13!:8[3

Note that this assumes that the last line of the file is
terminated by a line feed.  Otherwise, there can be a
spurious error if the file is slightly larger than an
even multiple of 1e6.

At minimum, this assumption should be documented.

-- 
Raul
--
For information about J forums see http://www.jsoftware.com/forums.htm


RE: [Jprogramming] Scanning a large file

2006-05-16 Thread Joey K Tuttle

At 09:38  -0400 2006/05/16, Miller, Raul D wrote:

Chris Burke wrote:

   if. len  #dat do.
 if. p  s do.
   dat=. dat, LF
 else.
   'file not in LF-delimited lines' 13!:8[3


Note that this assumes that the last line of the file is
terminated by a line feed.  Otherwise, there can be a
spurious error if the file is slightly larger than an
even multiple of 1e6.

At minimum, this assumption should be documented.



Actually, it needs to be dealt with. Some programs produce
files without a final end of line -- e.g. M$ Excel text files.
I have never understood how they could do that with good
conscience...

- joey
--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-16 Thread Alain Miville de Chêne

It is all relative.
The LF can be seen (as you do) as end of line or as new line.
In the first case, all lines should end with end of line.
In the second, LF cuts one line from another.

When editing a text file, and requesting to place the cursor at end of 
file,  with no LF at the end the cursor is placed after the last 
character somewhere to the right at the end of line. With an LF at the 
end, it is placed at the beginning of an empty line at the end.


I am not sure it is a M$ problem.

Joey K Tuttle wrote:
...

Actually, it needs to be dealt with. Some programs produce
files without a final end of line -- e.g. M$ Excel text files.
I have never understood how they could do that with good
conscience...

- joey
--
For information about J forums see http://www.jsoftware.com/forums.htm

--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-16 Thread Joey K Tuttle

Certainly, in my experience, LF, CR, or CRLF are considered
as EOL (in ..IX, MAC, PC OSs). Going way back, these things
came from input devices such as the IBM 1050 which was an
early typewriter terminal. It had the charming attribute
that the return key did just that (returned the carriage
as on a typewriter) Then, of course a line feed was needed
to start on the typewriter... To get the input line entered into
the computer one had to explicitly enter a EOT character - what
fun... I think this idea of a typewriter crept into DOS but it
was considered convenient to imply that the return indicated
EOT as well... This clear thinking was likely a result of
people not looking at things outside of IBM (a mistake IMHO)...

All of these early input devices ended a line to indicate
taking action in an attached processor - the fact that such
input was streamed into memory (and maybe saved on a file)
would indicate that all lines -- including the last one --
ended with a designated character (or two on the 1050 and PC)
Nowhere in the history of how files evolved do I see/remember
a different view - do you know of some thread of computer
evolution that was different and leads you to say it is
relative?

Two memories related to this amuse me. One was in my very
early days using APL from a 1050. The APL system was the
original one in IBM Yorktown Heights Research. My 1050
was in Boulder Colorado. APL\360 had a command to do iMsg e.g.

   )OPR  WHY IS THE SYSTEM SO SLOW?

would post a message to the system operator console. On the 1050
I could send a multi-line message in a single go by not adding
the EOT signal until after the last of the lines. As I type
this story, I realize that I do not know/remember if the 1050 EOT
had to immediately follow a return - (and maybe that is just
such a thread as I asked if you knew of!) Such multi-line operator
messages confused the operator who wondered how it was possible...

The other instance I know of about EOL being strange is in the
TIFF type 2 (FAX) file structure definition. The standard for
that states that all (scan) lines of the document shall begin
(not end) with a New Line character. I ran into cases where
programs didn't do that and while the authors admitted that
it was a bug, the loss of the first scan line on a FAX was
considered acceptable instead of fixing the programs...

Maybe there is some logic like that behind Excel not producing
files with a terminating line end - but you must admit that not
having a line end on the last line certainly could cause one to
wonder if the file was complete, or was the victim of an
accidental ending... Of course, having a line end doesn't insure
that there wasn't an explosion at the source of the data just as
the last EOL was put in place but before the file was completed -
But but EOL just before EOF does provide comfort (not to mention
convenience) that things are OK

As an example of why I consider the Excel behavior a bug, consider
trying to catenate two Excel text output files together, then using
them as input to Excel. The missing line end becomes an issue..

In any case, because of programs like Excel, any line reading
program should do its best to provide all the data - and should
likely alert the user that things didn't end cleanly...

- joey


At 10:32  -0400 2006/05/16, Alain Miville de Chêne wrote:

It is all relative.
The LF can be seen (as you do) as end of line or as new line.
In the first case, all lines should end with end of line.
In the second, LF cuts one line from another.

When editing a text file, and requesting to 
place the cursor at end of file,  with no LF at 
the end the cursor is placed after the last 
character somewhere to the right at the end of 
line. With an LF at the end, it is placed at the 
beginning of an empty line at the end.


I am not sure it is a M$ problem.

Joey K Tuttle wrote:
...

Actually, it needs to be dealt with. Some programs produce
files without a final end of line -- e.g. M$ Excel text files.
I have never understood how they could do that with good
conscience...


--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-16 Thread Oleg Kobchenko
These are interesting stories about line terminators.
I agree on providing all the data.
But I think absence of final terminator is more
a stylistic issue (or a matter of choice) than a defect.

Hence, it more like truthful conveying than alerting
cleanliness.

Here's on cygwin:

[EMAIL PROTECTED] ~
$ cat  t1.txt
one
two

[EMAIL PROTECTED] ~
$ cat  t2.txt
one
two
[EMAIL PROTECTED] ~
$ od -c t1.txt
000   o   n   e  \r  \n   t   w   o  \r  \n
012

[EMAIL PROTECTED] ~
$ od -c t2.txt
000   o   n   e  \r  \n   t   w   o
010


P.S. Unless, it's just an excuse to bash Microsoft again: 
picking on Excel, that $ in the name... If you don't
like it -- don't use it. Any program can do that: 
you can either put EOL at the end or not,
so the chance is 50-50. :-)



--- Joey K Tuttle [EMAIL PROTECTED] wrote:

 In any case, because of programs like Excel, any line reading
 program should do its best to provide all the data - and should
 likely alert the user that things didn't end cleanly...


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-16 Thread Joey K Tuttle

OK, MS (not bashing women :) Excel - the problem is,
one often doesn't have the choice not to use it in
the sense that people send files exported from Excel...

A case where you can choose not to use it includes things
like trying to use Excel to open a text file that starts
with the ascii characters ID  (or a tab in place of that
blank) -- actually the choice is made for you in that case,
since Excel rejects the file. But I imagine there are many
dark corners like that - and of course you are right, any
program may choose to elide a common sense line ending.
Still, that seems a bit irresponsible in most cases.

- joey

At 10:12  -0700 2006/05/16, Oleg Kobchenko wrote:

These are interesting stories about line terminators.
I agree on providing all the data.
But I think absence of final terminator is more
a stylistic issue (or a matter of choice) than a defect.

Hence, it more like truthful conveying than alerting
cleanliness.

Here's on cygwin:

[EMAIL PROTECTED] ~
$ cat  t1.txt
one
two

[EMAIL PROTECTED] ~
$ cat  t2.txt
one
two
[EMAIL PROTECTED] ~
$ od -c t1.txt
000   o   n   e  \r  \n   t   w   o  \r  \n
012

[EMAIL PROTECTED] ~
$ od -c t2.txt
000   o   n   e  \r  \n   t   w   o
010


P.S. Unless, it's just an excuse to bash Microsoft again:
picking on Excel, that $ in the name... If you don't
like it -- don't use it. Any program can do that:
you can either put EOL at the end or not,
so the chance is 50-50. :-)



--- Joey K Tuttle [EMAIL PROTECTED] wrote:


 In any case, because of programs like Excel, any line reading
 program should do its best to provide all the data - and should

  likely alert the user that things didn't end cleanly...


--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-16 Thread Alain Miville de Chêne
Our company is entirely using OpenOffice. It is a mature product to 
replace MS Office.


Joey K Tuttle wrote:

OK, MS (not bashing women :) Excel - the problem is,
one often doesn't have the choice not to use it in
the sense that people send files exported from Excel...

...
--
For information about J forums see http://www.jsoftware.com/forums.htm


RE: [Jprogramming] Scanning a large file

2006-05-16 Thread Joey K Tuttle

At 15:29  -0400 2006/05/16, Miller, Raul D wrote:

Joey K Tuttle wrote:

 OK, MS (not bashing women :) Excel - the problem is,
 one often doesn't have the choice not to use it in
 the sense that people send files exported from Excel...


And sometimes those files are broken or virus infected,
etc.

When the files are well formed, typically a person could
use openoffice calc to read them and re-export them in
a more convenient format.

Alternatively, you could ask the original user for a copy
of that the file in some other format.

I've gotten quite a bit of mileage from asking people to
save the file as CSV.  In the typical case, CSV is more
than adequate. 


CSV tends to be much easier to process programmatically
(assuming you aren't using some simple thing in excel
for your program -- a reasonable assumption for the case
where the user is exporting the file and you are working
with it in J).

Failing that, asking the user to save the spreadsheet as
xml retains all excel features might be easier to deal with
than the default binary format.  However, this is not as
simple as CSV.



Raul,

The files that caused me troubles were requested and
supplied in text or csv format, not binary - the fact
that the last line of those files is sans EOL was
always an annoyance (especially if using cut in j ...)

I just did a little test to see if MS Excel still saves
files that way and indeed files saved as .txt .csv .htm
.prn and .dif all end unceremoniously with no EOL (in
my case no CR since I have Mac Excel). I used my other
bug as a test example - here is the complete .csv file:

ID,NAME
666,MS

(of course this example does have an EOL on both lines)

I learned that the behavior of Excel has changed when
trying to open the above file - it used to say Invalid
File - now it says, SYLK: file format is not valid.
and then crashes when you acknowledge the error dialog.
I suppose that may mean that they are moving towards a
fix for the bug that has been in every version of Excel
I have looked at Long live SYLK.

- joey

PS - my version of Excel doesn't include save as XML.
--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-16 Thread Chris Burke
Miller, Raul D wrote:
 Chris Burke wrote:
 
  if. len  #dat do.
if. p  s do.
  dat=. dat, LF
else.
  'file not in LF-delimited lines' 13!:8[3
 
 
 Note that this assumes that the last line of the file is
 terminated by a line feed.  Otherwise, there can be a
 spurious error if the file is slightly larger than an
 even multiple of 1e6.
 
 At minimum, this assumption should be documented.

This looks OK to me. The line after the if. statement should handle a
file which is not terminated by LF. The line after the else. statement
should handle a file where a line is longer than 1e6 bytes, and so is
inappropriate for this function.


--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-16 Thread Chris Burke
Oleg Kobchenko wrote:
 It's a great idea to include line reading
 into a standard library. Here is a few comments.
 
 There are two differences from the original
 readlines:
  - overlapped reading (not once and only once)
(with asserting presence of LF in current block)
  - automatic removal of terminators

Agreed on leaving in the LF, in fapplylines. Do you agree on removing
the CR or think this should be left in as well?

I am in two minds on the buffer. It does impact performance, though not
by much. But it means that after the block of 1e6 bytes is read in, it
is immediately copied because it is appended to the tail of the previous
block. So the question is whether this performance hit is worthwhile to
permit the code to be used for stdin or sockets. I don't feel strongly
on this and wonder if there are other opinions on it.
--
For information about J forums see http://www.jsoftware.com/forums.htm


RE: [Jprogramming] Scanning a large file

2006-05-16 Thread Miller, Raul D
Chris Burke wrote:
 I am in two minds on the buffer. It does impact performance, though not
 by much. But it means that after the block of 1e6 bytes is read in, it
 is immediately copied because it is appended to the tail of the previous
 block. So the question is whether this performance hit is worthwhile to
 permit the code to be used for stdin or sockets. I don't feel strongly
 on this and wonder if there are other opinions on it.

If you want to avoid that copy, you could special case the handling
of the line which spans two blocks.

In the long run, I suspect the issue with this copy would be latency 
not performance.  Usually the processing of a line involves an order
of magnitude more time than copying that line, and the cost of the
1e6 byte copy gets spread over a lot of lines.

And if latency is an issue, the proper solution probably involves
reducing the buffer size (since filling the buffer will also involve
a lot more work than making a copy of it).

-- 
Raul



--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-16 Thread Oleg Kobchenko
I am not sure about overlapped either. Raul's idea about 
special-casing sounds good. And the discussion on 
spread of copy. In my test, the impact was 5-7% 
or so -- a good price for streaming.

I think the bottle neck is in looping in u;.2 
and the line proc itself.
I ran the UNIX wc, and it felt like x100 faster. 
Then I ran jpm on wc, and the line proc takes
bulk of the time. 

As for removing CR/LF, I would suggest optionizing
with default: both removed. For simplicity, handle
them as one option, because turning them on is
low-level stuff, to be handled as such in user code.

For example:
lineproc fapplylines fname  NB. terminators removed
  1 lineproc fapplylines fname  NB. terminators preserved



I just had another idea: besides the adverb, to have
a conjunction with additional verb to insert between
line results. Then wc will become:

lwc2=: 1 , #@;:@(CRLF-.~]) , #
1 lwc2 finsertlines + fn



--- Chris Burke [EMAIL PROTECTED] wrote:

 Oleg Kobchenko wrote:
  It's a great idea to include line reading
  into a standard library. Here is a few comments.
  
  There are two differences from the original
  readlines:
   - overlapped reading (not once and only once)
 (with asserting presence of LF in current block)
   - automatic removal of terminators
 
 Agreed on leaving in the LF, in fapplylines. Do you agree on removing
 the CR or think this should be left in as well?
 
 I am in two minds on the buffer. It does impact performance, though not
 by much. But it means that after the block of 1e6 bytes is read in, it
 is immediately copied because it is appended to the tail of the previous
 block. So the question is whether this performance hit is worthwhile to
 permit the code to be used for stdin or sockets. I don't feel strongly
 on this and wonder if there are other opinions on it.
 --
 For information about J forums see http://www.jsoftware.com/forums.htm
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
--
For information about J forums see http://www.jsoftware.com/forums.htm


RE: [Jprogramming] Scanning a large file

2006-05-16 Thread Joey K Tuttle

At 20:54  -0700 2006/05/16, Oleg Kobchenko wrote:

http://support.microsoft.com/kb/215591/


 ID,NAME
 666,MS


Don' B H8N


Yes - I knew the workaround and even puzzled out that
the origination of the bug is that SYLK files begin with
ID;. You would think that some bright programmer could
decide that if the third character isn't the expected ;
then it might be just an ordinary text file. Interesting
how difficult it seems to fix such a simple thing.

BTW, it also (used to) fails in Windows too. Also, it is
any text file, not just csv.

- joey
--
For information about J forums see http://www.jsoftware.com/forums.htm


RE: [Jprogramming] Scanning a large file

2006-05-15 Thread Henry Rich
Try

x ([: I. E.) y

to get the list of places where the string x occurs.  This uses
special code and doesn't create the entire result of E. .

Henry Rich

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Yoel Jacobsen
 Sent: Monday, May 15, 2006 10:09 AM
 To: Programming forum
 Subject: Re: [Jprogramming] Scanning a large file
 
 It won't work for large files. E. returns a 'limit error'.
 
 Yoel
 
 On 5/14/06, Joey K Tuttle [EMAIL PROTECTED] wrote:
 
  Yoel,
 
  Some of the feedback you got suggested mapped files, others
  suggested just reading the file. My own habits lean towards
  reading the file and I have a utility verb that gets lines
  while not exceeding a buffer size limit. I find that buffer
  sizes  100Kbytes generally make almost no difference in
  processing time - in fact, processing can take longer on
  larger chunks. Actually, the gain after 40Kbytes is minor
  indeed.
 
  But in your responses you indicated that you were interested
  in not using (explicit) loops and doing it in a j style yet
  being able to handle large files. j mapped files are certainly
  needed in that case. There was also a suggestion of regex,
  but my experience calling regex from j has been less than
  satisfactory.
 
  In my opinion, these things usually require some thought and
  knowledge of the data and the objectives. If the pattern you
  are searching for is nice (like your keyword 'csn') then
  there are usually pretty good ways to have j gather the data.
  To find an actual example to illustrate, I catenated the past
  8 weeks worth of sendmail logs on my linux system to create
  a file maillogs - here is some experimenting with it -
 
  [EMAIL PROTECTED] mqueue]$ wc maillogs
564175 6987478 75395162 maillogs
 
  that is, the file is 75Mbytes with 564,175 lines
 
  [EMAIL PROTECTED] mqueue]$ ja  # starting jconsole
  version ''
  j504/2005-03-16/15:30
  Running in: Linux
  host 'cat /proc/cpuinfo'
  processor   : 0
  vendor_id   : GenuineIntel
  cpu family  : 6
  model   : 5
  model name  : Pentium II (Deschutes)
  stepping: 2
  cpu MHz : 399.071
  cache size  : 512 KB
 
 
  NB. not a very fast machine, but it does have 1Gbyte ram available
 
  require 'jmf'
  JCHAR map_jmf_ 'mls';'maillogs';'';1
  NB. HIGHLY recommended to map read only... that is the 1 at the
  NB. end of the mapping expression. There is a vicious side effect
  NB. (IMHO a BUG) in setting an alias of a mapped name within a verb.
 
  NB. My example is to get the size of messages that passed through
  NB. sendmail. Typically there is a phrase like   size=1234,  in
  NB. the log. The following is based on that.
 
  delim =: ','
  tag =: 'size='
 
  timex 'tagis =: I. tag E. mls'NB. time and space to 
 get indexes
  3.49947 1.34481e8
  timex 'sizes =: delim (_1: . (] i.1 [) {.0 1 ]) (tagis +/
  (#tag)+i. 12){mls'
  0.431585 1.37452e7
  $sizes
  43947
  +/ x: sizes
  11572953524
 
  Maybe these are some ideas you can use to attack your problem.
 
  - joey
 
 
  At 11:01  +0300 2006/05/14, Yoel Jacobsen wrote:
  Hello,
  
  I'm new to J so please forgive me if this is a FAQ.
  
  I wrote some short sentences to parse a log file. I want 
 to retrieve all
  the
  unique values of some attribute. The way it shows in the 
 log file is
  attribute nameSPACEattribute value such as . csn 
 92892849893284
  ...
  
  My initial (brute force) program is:
  
  text =: 1!:1  '/tmp/logfile'
  words =: cutopen text
  bv =: ('csn') = words
  srbv =: _1 |.!.0 bv
  csns =: ~. srbv # words
  
  Now csns holds the unique values as requested.
  
  The program works fine for small files (few megabytes).
  
  My question is, what should be done to make it work for 
 large files (say,
  1GB or more)? I guess it involves memory mapped files but 
 I have no clue
  where to continue from here.
  
  Further, is there any notion of 'laziness' (evaluate only 
 when the data
  is
  really needed) in J? can a verb be decalred as a lazy verb?
  
  Thanks,
  
  Yoel
  
 --
  For information about J forums see 
 http://www.jsoftware.com/forums.htm
 
 --
 For information about J forums see 
 http://www.jsoftware.com/forums.htm

--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-15 Thread Yoel Jacobsen

It won't work for large files. E. returns a 'limit error'.

Yoel

On 5/14/06, Joey K Tuttle [EMAIL PROTECTED] wrote:


Yoel,

Some of the feedback you got suggested mapped files, others
suggested just reading the file. My own habits lean towards
reading the file and I have a utility verb that gets lines
while not exceeding a buffer size limit. I find that buffer
sizes  100Kbytes generally make almost no difference in
processing time - in fact, processing can take longer on
larger chunks. Actually, the gain after 40Kbytes is minor
indeed.

But in your responses you indicated that you were interested
in not using (explicit) loops and doing it in a j style yet
being able to handle large files. j mapped files are certainly
needed in that case. There was also a suggestion of regex,
but my experience calling regex from j has been less than
satisfactory.

In my opinion, these things usually require some thought and
knowledge of the data and the objectives. If the pattern you
are searching for is nice (like your keyword 'csn') then
there are usually pretty good ways to have j gather the data.
To find an actual example to illustrate, I catenated the past
8 weeks worth of sendmail logs on my linux system to create
a file maillogs - here is some experimenting with it -

[EMAIL PROTECTED] mqueue]$ wc maillogs
  564175 6987478 75395162 maillogs

that is, the file is 75Mbytes with 564,175 lines

[EMAIL PROTECTED] mqueue]$ ja  # starting jconsole
version ''
j504/2005-03-16/15:30
Running in: Linux
host 'cat /proc/cpuinfo'
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 5
model name  : Pentium II (Deschutes)
stepping: 2
cpu MHz : 399.071
cache size  : 512 KB
   

NB. not a very fast machine, but it does have 1Gbyte ram available

require 'jmf'
JCHAR map_jmf_ 'mls';'maillogs';'';1
NB. HIGHLY recommended to map read only... that is the 1 at the
NB. end of the mapping expression. There is a vicious side effect
NB. (IMHO a BUG) in setting an alias of a mapped name within a verb.

NB. My example is to get the size of messages that passed through
NB. sendmail. Typically there is a phrase like   size=1234,  in
NB. the log. The following is based on that.

delim =: ','
tag =: 'size='

timex 'tagis =: I. tag E. mls'NB. time and space to get indexes
3.49947 1.34481e8
timex 'sizes =: delim (_1: . (] i.1 [) {.0 1 ]) (tagis +/
(#tag)+i. 12){mls'
0.431585 1.37452e7
$sizes
43947
+/ x: sizes
11572953524

Maybe these are some ideas you can use to attack your problem.

- joey


At 11:01  +0300 2006/05/14, Yoel Jacobsen wrote:
Hello,

I'm new to J so please forgive me if this is a FAQ.

I wrote some short sentences to parse a log file. I want to retrieve all
the
unique values of some attribute. The way it shows in the log file is
attribute nameSPACEattribute value such as . csn 92892849893284
...

My initial (brute force) program is:

text =: 1!:1  '/tmp/logfile'
words =: cutopen text
bv =: ('csn') = words
srbv =: _1 |.!.0 bv
csns =: ~. srbv # words

Now csns holds the unique values as requested.

The program works fine for small files (few megabytes).

My question is, what should be done to make it work for large files (say,
1GB or more)? I guess it involves memory mapped files but I have no clue
where to continue from here.

Further, is there any notion of 'laziness' (evaluate only when the data
is
really needed) in J? can a verb be decalred as a lazy verb?

Thanks,

Yoel
--
For information about J forums see http://www.jsoftware.com/forums.htm


--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-15 Thread Joey K Tuttle

At 12:14  -0300 2006/05/15, Randy MacDonald wrote:

  limit is only 2GB

A phrase I thought I'd _never_ hear
---


indeed ... and presumably not applicative in 64 bit systems...
--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-15 Thread Yoel Jacobsen

Some of the clients of the company I'm working for, are working with up to
Terabyte-long  files. Usually in Physics, life-science,simulation etc.

The new file system in Solaris (ZFS) is a 128bit FS.

Anyway, data mining from log files is an important use of a language for me.
I am very pleased from the interactivity and  conciseness of the process in
J. I just have to understand how to parse the largest files elegantly (i.e.
with user code which is at lease as elegant as the 3-statement python
program from my previous post).

Thanks for all the enlightening replies!

Yoel



On 5/15/06, Randy MacDonald [EMAIL PROTECTED] wrote:


 limit is only 2GB

A phrase I thought I'd _never_ hear



--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-14 Thread Björn Helgason

Answers to you questions are as youself point out in memory mapped files
Read the labs and experiment with them

2006/5/14, Yoel Jacobsen [EMAIL PROTECTED]:


Hello,

I'm new to J so please forgive me if this is a FAQ.

I wrote some short sentences to parse a log file. I want to retrieve all
the
unique values of some attribute. The way it shows in the log file is
attribute nameSPACEattribute value such as . csn 92892849893284
...

My initial (brute force) program is:

text =: 1!:1  '/tmp/logfile'
words =: cutopen text
bv =: ('csn') = words
srbv =: _1 |.!.0 bv
csns =: ~. srbv # words

Now csns holds the unique values as requested.

The program works fine for small files (few megabytes).

My question is, what should be done to make it work for large files (say,
1GB or more)? I guess it involves memory mapped files but I have no clue
where to continue from here.

Further, is there any notion of 'laziness' (evaluate only when the data is
really needed) in J? can a verb be decalred as a lazy verb?

Thanks,

Yoel
--
For information about J forums see http://www.jsoftware.com/forums.htm





--
Björn Helgason, Verkfræðingur
FuglFiskur ehf, Þerneyjarsund 23,
Skype: gosiminn, gsm: +3546985532
801 Grímsnes ,t-póst: [EMAIL PROTECTED]
Landslags og skrúðgarðagerð, gröfuþjónusta
http://groups.google.com/group/J-Programming


Tæknikunnátta höndlar hið flókna, sköpunargáfa er meistari einfaldleikans

góður kennari getur stigið á tær án þess að glansinn fari af skónum
 /|_  .---.

,'  .\  /  | Með léttri lund verður|
,--'_,'   | Dagurinn í dag |
   /   /   | Enn betri en gærdagurinn  |
  (   -.  |`---'
  | ) |
 (`-.  '--.)
  `. )'
--
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Scanning a large file

2006-05-14 Thread Yoel Jacobsen

I probably was not clear.

My question is not how to use mapped files, but  where to go from there.
Mapped files does not solve the problem directly since I can't use the same
algorithm on it.

For instance, catopen would take tremendous time and space. Moreover, since
the length of the lines is not fixed I can't state the number of columns
when mapping the file.

On s scalar language (for instance python) I would do:

for line in file.readlines():
  handle_line(line)

this is very efficient spacewise since readlines() reads several blocks at a
time.

But:
1) As far as I understand, walking over the lines is not the J way.
2) Even if I want that, I didn't find the equivalent of Python's readlines()
in the docs.

Yoel

On 5/14/06, Björn Helgason [EMAIL PROTECTED] wrote:


Answers to you questions are as youself point out in memory mapped files
Read the labs and experiment with them

2006/5/14, Yoel Jacobsen [EMAIL PROTECTED]:

 Hello,

 I'm new to J so please forgive me if this is a FAQ.

 I wrote some short sentences to parse a log file. I want to retrieve all
 the
 unique values of some attribute. The way it shows in the log file is
 attribute nameSPACEattribute value such as . csn 92892849893284
 ...

 My initial (brute force) program is:

 text =: 1!:1  '/tmp/logfile'
 words =: cutopen text
 bv =: ('csn') = words
 srbv =: _1 |.!.0 bv
 csns =: ~. srbv # words

 Now csns holds the unique values as requested.

 The program works fine for small files (few megabytes).

 My question is, what should be done to make it work for large files
(say,
 1GB or more)? I guess it involves memory mapped files but I have no clue
 where to continue from here.

 Further, is there any notion of 'laziness' (evaluate only when the data
is
 really needed) in J? can a verb be decalred as a lazy verb?

 Thanks,

 Yoel
 --
 For information about J forums see http://www.jsoftware.com/forums.htm




--
Björn Helgason, Verkfræðingur
FuglFiskur ehf, Þerneyjarsund 23,
Skype: gosiminn, gsm: +3546985532
801 Grímsnes ,t-póst: [EMAIL PROTECTED]
Landslags og skrúðgarðagerð, gröfuþjónusta
http://groups.google.com/group/J-Programming


Tæknikunnátta höndlar hið flókna, sköpunargáfa er meistari einfaldleikans

góður kennari getur stigið á tær án þess að glansinn fari af skónum
  /|_  .---.

 ,'  .\  /  | Með léttri lund verður|
 ,--'_,'   | Dagurinn í dag |
/   /   | Enn betri en gærdagurinn  |
   (   -.  |`---'
   | ) |
  (`-.  '--.)
   `. )'

--
For information about J forums see http://www.jsoftware.com/forums.htm



--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-14 Thread Björn Helgason

You may not have understood what mapped files are
You do not read them into the workarea
Opening a mapped file takes a very short time
catopen you mention is probably reading all the data into the workarea and
that is not the way mapped files will help your case
As you see in the mapped file labs the mapped files stay outside the
workarea
You only bring in the bits you do need when you need them


2006/5/14, Yoel Jacobsen [EMAIL PROTECTED]:


I probably was not clear.

My question is not how to use mapped files, but  where to go from there.
Mapped files does not solve the problem directly since I can't use the
same
algorithm on it.

For instance, catopen would take tremendous time and space. Moreover,
since
the length of the lines is not fixed I can't state the number of columns
when mapping the file.

On s scalar language (for instance python) I would do:

for line in file.readlines():
   handle_line(line)

this is very efficient spacewise since readlines() reads several blocks at
a
time.

But:
1) As far as I understand, walking over the lines is not the J way.
2) Even if I want that, I didn't find the equivalent of Python's
readlines()
in the docs.

Yoel

On 5/14/06, Björn Helgason [EMAIL PROTECTED] wrote:

 Answers to you questions are as youself point out in memory mapped files
 Read the labs and experiment with them

 2006/5/14, Yoel Jacobsen [EMAIL PROTECTED]:
 
  Hello,
 
  I'm new to J so please forgive me if this is a FAQ.
 
  I wrote some short sentences to parse a log file. I want to retrieve
all
  the
  unique values of some attribute. The way it shows in the log file is
  attribute nameSPACEattribute value such as . csn
92892849893284
  ...
 
  My initial (brute force) program is:
 
  text =: 1!:1  '/tmp/logfile'
  words =: cutopen text
  bv =: ('csn') = words
  srbv =: _1 |.!.0 bv
  csns =: ~. srbv # words
 
  Now csns holds the unique values as requested.
 
  The program works fine for small files (few megabytes).
 
  My question is, what should be done to make it work for large files
 (say,
  1GB or more)? I guess it involves memory mapped files but I have no
clue
  where to continue from here.
 
  Further, is there any notion of 'laziness' (evaluate only when the
data
 is
  really needed) in J? can a verb be decalred as a lazy verb?
 
  Thanks,
 
  Yoel
  --
  For information about J forums see http://www.jsoftware.com/forums.htm
 



 --
 Björn Helgason, Verkfræðingur
 FuglFiskur ehf, Þerneyjarsund 23,
 Skype: gosiminn, gsm: +3546985532
 801 Grímsnes ,t-póst: [EMAIL PROTECTED]
 Landslags og skrúðgarðagerð, gröfuþjónusta
 http://groups.google.com/group/J-Programming


 Tæknikunnátta höndlar hið flókna, sköpunargáfa er meistari
einfaldleikans

 góður kennari getur stigið á tær án þess að glansinn fari af skónum
   /|_  .---.

  ,'  .\  /  | Með léttri lund verður|
  ,--'_,'   | Dagurinn í dag |
 /   /   | Enn betri en gærdagurinn  |
(   -.  |`---'
| ) |
   (`-.  '--.)
`. )'

 --
 For information about J forums see http://www.jsoftware.com/forums.htm


--
For information about J forums see http://www.jsoftware.com/forums.htm





--
Björn Helgason, Verkfræðingur
FuglFiskur ehf, Þerneyjarsund 23,
Skype: gosiminn, gsm: +3546985532
801 Grímsnes ,t-póst: [EMAIL PROTECTED]
Landslags og skrúðgarðagerð, gröfuþjónusta
http://groups.google.com/group/J-Programming


Tæknikunnátta höndlar hið flókna, sköpunargáfa er meistari einfaldleikans

góður kennari getur stigið á tær án þess að glansinn fari af skónum
 /|_  .---.

,'  .\  /  | Með léttri lund verður|
,--'_,'   | Dagurinn í dag |
   /   /   | Enn betri en gærdagurinn  |
  (   -.  |`---'
  | ) |
 (`-.  '--.)
  `. )'
--
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Scanning a large file

2006-05-14 Thread Chris Burke
Yoel Jacobsen wrote:
 I wrote some short sentences to parse a log file. I want to retrieve all
 the
 unique values of some attribute. The way it shows in the log file is
 attribute nameSPACEattribute value such as . csn 92892849893284
 ...
 
 My initial (brute force) program is:
 
 text =: 1!:1  '/tmp/logfile'
 words =: cutopen text
 bv =: ('csn') = words
 srbv =: _1 |.!.0 bv
 csns =: ~. srbv # words
 
 Now csns holds the unique values as requested.
 
 The program works fine for small files (few megabytes).

Probably the simplest way to handle this is to read the file in large
blocks, and chop the blocks into lines. Since lines are of uneven
length, the blocks will likely not end in a line separator, so need to
be truncated.

You don't need to memory map the file.

The following example assumes each line ends in LF:

getcsn=: 3 : 0
siz=. fsize y
blk=. 1e7
ptr=. 0
res=. ''
while. ptr  siz do.
  len=. blk . siz - ptr
  dat=. fread y;ptr,len
  lfx=. 1 + dat i: LF
  ptr=. ptr + lfx
  dat=. ;._2 lfx {. dat
  key=. (dat i. ' ') {. each dat
  msk=. key = 'csn'
  res=. ~. res, msk # dat
end.
4 }. each res
)

A=: 0 : 0
abc qweqwe
csn 1234
def 123123
csn 87654
)

   A fwrites F=: jpath '~temp/t1.dat'
41

   getcsn F
++-+
|1234|87654|
++-+

--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-14 Thread bill lam

If the file is really large, I prefer regex instead.

--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-14 Thread Yoel Jacobsen

Even for regex, I don't see how to avoid manually reading the file in chunks
which is too imperative style for me. Again, consider the Python example:

for line in file.readlines():
 match_object = re.search((= csn )\w+, line)
 if match_object:
   process(match_object.group(0))

The regex can be precompiled as well.

This works on a 5GB file as well as on small files since readlines() take
care for reading the file in chunks.

Is there a way to do it a consice way in J?

Yoel

On 5/14/06, bill lam [EMAIL PROTECTED] wrote:


If the file is really large, I prefer regex instead.

--
For information about J forums see http://www.jsoftware.com/forums.htm


--
For information about J forums see http://www.jsoftware.com/forums.htm


Re: [Jprogramming] Scanning a large file

2006-05-14 Thread Oleg Kobchenko
We need a general purpose read line functionality.
It is common in C runtime and in other languages.
Although, it is possible to do in J, but it's better not
to do the low-level stuff every time.

Chris has shown how to do it in a way specific for
a concrete example. It is suggested to separate the
reading part from processing, so that reading could be
reused.

Here is a list constraints:
 - it's OK to assume LF line separators only (no CR)
 - read every byte of the file once and only once
 - proceess empty lines
 - proceess non-terminated last line
 - be fast and lean

Here is an approach that keeps the state of
file management out of the user code by means of
a callback for each line.

It calculates wc for 1Mb file on P2.8GHz in 1.7 sec.

   (wc FN) , ts'wc FN'
8 20 99 1.6866 95808


NB. =
NB. readlines -- line reader

require 'files'

SB=: 1

readlines=: 1 : 0
  assert fexist y
  S=. fsize y
  P=. 0
  B=. ''
  while. P  S do.
B=. B,fread y ; P,SR=. SB.S-P
P=. P+SR
if. (#B) : L=. 1 + B i:LF do.
  u ;.2 L {. B
  B=.   L }. B
end.
  end.
  if. #B do. u B end.
)

NB. =
NB. user code

lwc=: 3 : 0
  LC=: LC + 1
  WC=: WC + #@;: }:^:(LF={:)y
  CC=: CC + #y
)

wc=: 3 : 0
  LC=: WC=: CC=: 0
  lwc readlines y
  LC , WC , CC
)

ts=: 6!:2 , 7!:[EMAIL PROTECTED]

A=: 2 ((* #) $ ]) 0 : 0
one two three four five

six seven
eight nine ten
)

0 : 0
  (}:A) fwrite FN=: jpath '~temp/t1.txt'
  (wc FN) , ts'wc FN'
)
NB. =


--- Chris Burke [EMAIL PROTECTED] wrote:

 Yoel Jacobsen wrote:
  I wrote some short sentences to parse a log file. I want to retrieve all
  the
  unique values of some attribute. The way it shows in the log file is
  attribute nameSPACEattribute value such as . csn 92892849893284
  ...
  
  My initial (brute force) program is:
  
  text =: 1!:1  '/tmp/logfile'
  words =: cutopen text
  bv =: ('csn') = words
  srbv =: _1 |.!.0 bv
  csns =: ~. srbv # words
  
  Now csns holds the unique values as requested.
  
  The program works fine for small files (few megabytes).
 
 Probably the simplest way to handle this is to read the file in large
 blocks, and chop the blocks into lines. Since lines are of uneven
 length, the blocks will likely not end in a line separator, so need to
 be truncated.
 
 You don't need to memory map the file.
 
 The following example assumes each line ends in LF:
 
 getcsn=: 3 : 0
 siz=. fsize y
 blk=. 1e7
 ptr=. 0
 res=. ''
 while. ptr  siz do.
   len=. blk . siz - ptr
   dat=. fread y;ptr,len
   lfx=. 1 + dat i: LF
   ptr=. ptr + lfx
   dat=. ;._2 lfx {. dat
   key=. (dat i. ' ') {. each dat
   msk=. key = 'csn'
   res=. ~. res, msk # dat
 end.
 4 }. each res
 )
 
 A=: 0 : 0
 abc qweqwe
 csn 1234
 def 123123
 csn 87654
 )
 
A fwrites F=: jpath '~temp/t1.dat'
 41
 
getcsn F
 ++-+
 |1234|87654|
 ++-+
 
 --
 For information about J forums see http://www.jsoftware.com/forums.htm
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
--
For information about J forums see http://www.jsoftware.com/forums.htm