Re: I found a sort bug! - How to sort big files?

2015-06-24 Thread Jan Stary
On Mar 16 11:36:08, o...@drijf.net wrote:
 On Mon, Mar 16, 2015 at 10:20:07AM +, Stuart Henderson wrote:
 
  On 2015-03-15, Todd C. Miller todd.mil...@courtesan.com wrote:
   On Sat, 14 Mar 2015 12:29:21 -, Stuart Henderson wrote:
  
   I think the consensus was to try and replace it with another version but
   not sure what happened.
  
   I have a port of the FreeBSD sort but it is slower than our current
   sort (and slower than GNU sort).
  
  Personally I think that is a reasonable trade-off for more actively
  developed code, and when I tried it on some difficult files it coped
  better than our current sort (not that this small sample means much
  in terms of ability to handle every difficult file).
 
 Current sort(1) is unmaintanable in many ways. I say switch.

Incidentally, reading up on UNIX history, I came across this:
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6771921



Re: I found a sort bug! - How to sort big files?

2015-03-17 Thread Steve Litt
On Tue, 17 Mar 2015 08:58:56 +1300
worik worik.stan...@gmail.com wrote:

 On 16/03/15 06:43, Steve Litt wrote:
  But IMHO, sorting 60megalines isn't something I would expect a
  generic sort command to easily and timely do out of the box.
 
 I would.  These days such files are getting more and more common.
 
 But there is a warning in the man page for sort under BUGS:
 
  To sort files larger than 60MB, use sort -H; files larger than
 704MB must be sorted in smaller pieces, then merged.
 
 So it seams there is a bug in... files larger than 60MB, use sort -H
 since that did not work for the OP.
 
 Worik

Oh, jeez, you put your finger *right* on the problem Worik. Both I and
the OP read the manpage wrong. sort -H won't work for extremely big
files (more than 704MB). But there's a fairly easy solution...

An average line length can be found with wc and then dividing. Then
figure out how many lines would make about a 10MB file, and use split
-l to split the file into smaller files with that many lines. Then sort
each of those files, with no arguments, and finally use sort -m to
merge them all back together again into one sorted file.

According to the man page, the preceding should work just fine, and it
can pretty much be automated with a simple shellscript, so you can set
it to run and have it work while you do other things.

SteveT

Steve Litt*  http://www.troubleshooters.com/
Troubleshooting Training  *  Human Performance



Re: I found a sort bug! - How to sort big files?

2015-03-16 Thread Stuart Henderson
On 2015-03-15, Todd C. Miller todd.mil...@courtesan.com wrote:
 On Sat, 14 Mar 2015 12:29:21 -, Stuart Henderson wrote:

 I think the consensus was to try and replace it with another version but
 not sure what happened.

 I have a port of the FreeBSD sort but it is slower than our current
 sort (and slower than GNU sort).

Personally I think that is a reasonable trade-off for more actively
developed code, and when I tried it on some difficult files it coped
better than our current sort (not that this small sample means much
in terms of ability to handle every difficult file).



Re: I found a sort bug! - How to sort big files?

2015-03-16 Thread Otto Moerbeek
On Mon, Mar 16, 2015 at 10:20:07AM +, Stuart Henderson wrote:

 On 2015-03-15, Todd C. Miller todd.mil...@courtesan.com wrote:
  On Sat, 14 Mar 2015 12:29:21 -, Stuart Henderson wrote:
 
  I think the consensus was to try and replace it with another version but
  not sure what happened.
 
  I have a port of the FreeBSD sort but it is slower than our current
  sort (and slower than GNU sort).
 
 Personally I think that is a reasonable trade-off for more actively
 developed code, and when I tried it on some difficult files it coped
 better than our current sort (not that this small sample means much
 in terms of ability to handle every difficult file).

Current sort(1) is unmaintanable in many ways. I say switch.

-Otto



Re: I found a sort bug! - How to sort big files?

2015-03-16 Thread Paul Stoeber
 Current sort(1) is unmaintanable in many ways. I say switch.

I've seen with gdb that the current sort(1) somehow manages to make
radixsort(3) do the work when the sort key is somewhere in the middle
of the line. I don't even want to know... (and my reading
comprehension of C is too weak to go and look). Yes, switch, please!



Re: I found a sort bug! - How to sort big files?

2015-03-16 Thread worik
On 16/03/15 06:43, Steve Litt wrote:
 But IMHO, sorting 60megalines isn't something I would expect a
 generic sort command to easily and timely do out of the box.

I would.  These days such files are getting more and more common.

But there is a warning in the man page for sort under BUGS:

 To sort files larger than 60MB, use sort -H; files larger than
704MB must be sorted in smaller pieces, then merged.

So it seams there is a bug in... files larger than 60MB, use sort -H
since that did not work for the OP.

Worik
-- 
Why is the legal status of chardonnay different to that of cannabis?
   worik.stan...@gmail.com 021-1680650, (03) 4821804
  Aotearoa (New Zealand)
 I voted for love



Re: Fwd: Re: I found a sort bug! - How to sort big files?

2015-03-16 Thread Stuart Henderson
On 2015-03-15, sort problem sortprob...@safe-mail.net wrote:
 So the default sort command is a  big pile of shit when it comes to files 
 bigger then 60 MByte? .. lol

It's probably not the size, rather the contents of the files. 



Fwd: Re: I found a sort bug! - How to sort big files?

2015-03-15 Thread sort problem
Whoops. At least I thought it helped. The default sort with the -H worked for 
132 minutes then said: no space left in /home (that had before the sort 
command: 111 GBytes FREE). And btw, df command said for free space: -18 
GByte, 104%.. what? Some kind of reserved space for root?


Why does it takes more then 111 GBytes to sort -u ~600 MByte sized files? 
This in nonsense. 


So the default sort command is a  big pile of shit when it comes to files 
bigger then 60 MByte? .. lol

I can send the ~600 MByte txt files compressed if needed...

I was suprised... sort is a very old command..


 Original Message 
From: sort problem sortprob...@safe-mail.net
To: andreas.zeilme...@mailbox.org
Cc: misc@openbsd.org
Subject: Re: I found a sort bug! - How to sort big files?
Date: Sat, 14 Mar 2015 08:39:55 -0400

o.m.g. It works. 

Why doesn't sort uses this by default on files larger then 60 MByte? 

Thanks!

 Original Message 
From: Andreas Zeilmeier andreas.zeilme...@mailbox.org
Apparently from: owner-misc+m147...@openbsd.org
To: misc@openbsd.org
Subject: Re: I found a sort bug! - How to sort big files?
Date: Sat, 14 Mar 2015 13:16:05 +0100

 On 03/14/15 12:49, sort problem wrote:
  Hello, 
  
  --
  # uname -a
  OpenBSD notebook.lan 5.6 GENERIC.MP#333 amd64
  # 
  # du -sh small/ 
  
 
  663Msmall/
  # ls -lah small/*.txt | wc -l   
  
 
43
  # 
  # cd small
  # ulimit -n
  1000
  # sysctl | grep -i maxfiles
  kern.maxfiles=10
  # 
  # grep open /etc/login.conf 
  
 
  :openfiles-cur=10:\
  :openfiles-cur=128:\
  :openfiles-cur=512:\
  # 
  # sort -u *.txt -o out
  Segmentation fault (core dumped)
  # 
  --
  
  This is after a minute run.. The txt files have UTF-8 chars too. A line is 
  maximum a few ten chars long in the txt files. All the txt files have UNIX 
  eol's. There is enough storage, enough RAM, enough CPU. I'm even trying 
  this with root user. The txt files are about ~60 000 000 lines.. not a big 
  number... a reboot didn't help. 
  
  
  
  Any ideas how can I use the sort command to actually sort? Please help!
  
  
  
  Thanks, 
  
  btw, this happens on other UNIX OS too, lol... why do we have the sort 
  command if it doesn't work?
  
 
 Hi,
 
 have you tried the option '-H'?
 The manpage suggested this for files  60MB.
 
 
 Regards,
 
 Andi



Re: I found a sort bug! - How to sort big files?

2015-03-15 Thread Kenneth Gober
I don't know why sort is giving you such problems.  there may be something
unusual about your specific input that it wasn't designed to handle (or it
might simply be a latent bug that has never been identified and fixed).

when I need to sort large files, I split(1) them into smaller pieces, then
sort(1) the pieces individually, then use sort(1) (with the -m option) to
merge the sorted pieces into a single large result file.  this has always
worked reliably for me (and because I was raised using 8-bit and 16-bit
computers I don't have any special expectations that programs should just
work when given very large inputs).

even if you think doing all this is too much bother, try doing it just
once.  you might be able to identify a specific chunk of your input that's
causing the problem, which will help move us all towards a proper solution
(or at least a caveat in the man page).

-ken

On Sun, Mar 15, 2015 at 9:53 AM, sort problem sortprob...@safe-mail.net
wrote:

 Whoops. At least I thought it helped. The default sort with the -H
 worked for 132 minutes then said: no space left in /home (that had before
 the sort command: 111 GBytes FREE). And btw, df command said for free
 space: -18 GByte, 104%.. what? Some kind of reserved space for root?


 Why does it takes more then 111 GBytes to sort -u ~600 MByte sized
 files? This in nonsense.


 So the default sort command is a  big pile of shit when it comes to
 files bigger then 60 MByte? .. lol

 I can send the ~600 MByte txt files compressed if needed...

 I was suprised... sort is a very old command..


  Original Message 
 From: sort problem sortprob...@safe-mail.net
 To: andreas.zeilme...@mailbox.org
 Cc: misc@openbsd.org
 Subject: Re: I found a sort bug! - How to sort big files?
 Date: Sat, 14 Mar 2015 08:39:55 -0400

 o.m.g. It works.

 Why doesn't sort uses this by default on files larger then 60 MByte?

 Thanks!

  Original Message 
 From: Andreas Zeilmeier andreas.zeilme...@mailbox.org
 Apparently from: owner-misc+m147...@openbsd.org
 To: misc@openbsd.org
 Subject: Re: I found a sort bug! - How to sort big files?
 Date: Sat, 14 Mar 2015 13:16:05 +0100

  On 03/14/15 12:49, sort problem wrote:
   Hello,
  
   --
   # uname -a
   OpenBSD notebook.lan 5.6 GENERIC.MP#333 amd64
   #
   # du -sh small/
   663Msmall/
   # ls -lah small/*.txt | wc -l
 43
   #
   # cd small
   # ulimit -n
   1000
   # sysctl | grep -i maxfiles
   kern.maxfiles=10
   #
   # grep open /etc/login.conf
   :openfiles-cur=10:\
   :openfiles-cur=128:\
   :openfiles-cur=512:\
   #
   # sort -u *.txt -o out
   Segmentation fault (core dumped)
   #
   --
  
   This is after a minute run.. The txt files have UTF-8 chars too. A
 line is maximum a few ten chars long in the txt files. All the txt files
 have UNIX eol's. There is enough storage, enough RAM, enough CPU. I'm even
 trying this with root user. The txt files are about ~60 000 000 lines.. not
 a big number... a reboot didn't help.
  
  
  
   Any ideas how can I use the sort command to actually sort? Please
 help!
  
  
  
   Thanks,
  
   btw, this happens on other UNIX OS too, lol... why do we have the sort
 command if it doesn't work?
  
 
  Hi,
 
  have you tried the option '-H'?
  The manpage suggested this for files  60MB.
 
 
  Regards,
 
  Andi



Re: I found a sort bug! - How to sort big files?

2015-03-15 Thread Steve Litt
On Sun, 15 Mar 2015 09:53:34 -0400
sort problem sortprob...@safe-mail.net wrote:

 Whoops. At least I thought it helped. The default sort with the -H
 worked for 132 minutes then said: no space left in /home (that had
 before the sort command: 111 GBytes FREE). 

That's not surprising. -H implements a merge sort, meaning it's split
into lots and lots of files, each of which is again split into lots of
files, etc. It wouldn't surprise me to see a 60Mline file consume a
huge multiple of itself during a merge sort.

And of course, the algorithm might be swapping.

 And btw, df command said
 for free space: -18 GByte, 104%.. what? Some kind of reserved space
 for root?
 
 
 Why does it takes more then 111 GBytes to sort -u ~600 MByte sized
 files? This in nonsense. 
 
 
 So the default sort command is a  big pile of shit when it comes to
 files bigger then 60 MByte? .. lol

That doesn't surprise me. You originally said you have 60 million
lines. Sorting 60 million items is a difficult task for any algorithm.
You don't say how long each line is, or what they contain, or whether
they're all the same line length.

How would *you* sort so many items, and sort them in a fast yet generic
way? I mean, if RAM and disk space are at a premium, you could always
use a bubble sort, and in-place sort your array in a year or two.

If I were in your shoes, I'd write my own sort routine for the task.
Perhaps using qsort() (see
http://calmerthanyouare.org/2013/05/31/qsort-shootout.html). If there's
a way you can convert line contents into a number reflecting
alpha-order, you could even qsort() in RAM if you have quite a bit of
RAM, and then the last step is to run through the sorted list of
numbers and line numbers, and write the original file by line number.
There are probably a thousand other ways to do it.

But IMHO, sorting 60megalines isn't something I would expect a generic
sort command to easily and timely do out of the box.

SteveT

Steve Litt*  http://www.troubleshooters.com/
Troubleshooting Training  *  Human Performance



Re: Fwd: Re: I found a sort bug! - How to sort big files?

2015-03-15 Thread Ted Unangst
sort problem wrote:
 So the default sort command is a  big pile of shit when it comes to files 
 bigger then 60 MByte? .. lol
 
 I can send the ~600 MByte txt files compressed if needed...
 
 I was suprised... sort is a very old command..

I think you have discovered the answer. :(



Re: I found a sort bug! - How to sort big files?

2015-03-14 Thread Todd C. Miller
On Sat, 14 Mar 2015 12:29:21 -, Stuart Henderson wrote:

 I think the consensus was to try and replace it with another version but
 not sure what happened.

I have a port of the FreeBSD sort but it is slower than our current
sort (and slower than GNU sort).

 - todd



Re: I found a sort bug! - How to sort big files?

2015-03-14 Thread Stuart Henderson
On 2015-03-14, sort problem sortprob...@safe-mail.net wrote:
 # sort -u *.txt -o out
 Segmentation fault (core dumped)

There are some known bugs in sort, I ran into a file it couldn't cope with a
couple of years ago too, but it doesn't happen all that often.

I think the consensus was to try and replace it with another version but
not sure what happened.

For your current problem you could pkg_add coreutils and try gsort,
maybe it will cope with your files better..

 btw, this happens on other UNIX OS too, lol... why do we have the sort 
 command if it doesn't work?

Normally it does work.



Re: I found a sort bug! - How to sort big files?

2015-03-14 Thread Andreas Zeilmeier
On 03/14/15 12:49, sort problem wrote:
 Hello, 
 
 --
 # uname -a
 OpenBSD notebook.lan 5.6 GENERIC.MP#333 amd64
 # 
 # du -sh small/   
   

 663Msmall/
 # ls -lah small/*.txt | wc -l 
   

   43
 # 
 # cd small
 # ulimit -n
 1000
 # sysctl | grep -i maxfiles
 kern.maxfiles=10
 # 
 # grep open /etc/login.conf   
   

 :openfiles-cur=10:\
 :openfiles-cur=128:\
 :openfiles-cur=512:\
 # 
 # sort -u *.txt -o out
 Segmentation fault (core dumped)
 # 
 --
 
 This is after a minute run.. The txt files have UTF-8 chars too. A line is 
 maximum a few ten chars long in the txt files. All the txt files have UNIX 
 eol's. There is enough storage, enough RAM, enough CPU. I'm even trying this 
 with root user. The txt files are about ~60 000 000 lines.. not a big 
 number... a reboot didn't help. 
 
 
 
 Any ideas how can I use the sort command to actually sort? Please help!
 
 
 
 Thanks, 
 
 btw, this happens on other UNIX OS too, lol... why do we have the sort 
 command if it doesn't work?
 

Hi,

have you tried the option '-H'?
The manpage suggested this for files  60MB.


Regards,

Andi



Re: I found a sort bug! - How to sort big files?

2015-03-14 Thread sort problem
o.m.g. It works. 

Why doesn't sort uses this by default on files larger then 60 MByte? 

Thanks!

 Original Message 
From: Andreas Zeilmeier andreas.zeilme...@mailbox.org
Apparently from: owner-misc+m147...@openbsd.org
To: misc@openbsd.org
Subject: Re: I found a sort bug! - How to sort big files?
Date: Sat, 14 Mar 2015 13:16:05 +0100

 On 03/14/15 12:49, sort problem wrote:
  Hello, 
  
  --
  # uname -a
  OpenBSD notebook.lan 5.6 GENERIC.MP#333 amd64
  # 
  # du -sh small/ 
  
 
  663Msmall/
  # ls -lah small/*.txt | wc -l   
  
 
43
  # 
  # cd small
  # ulimit -n
  1000
  # sysctl | grep -i maxfiles
  kern.maxfiles=10
  # 
  # grep open /etc/login.conf 
  
 
  :openfiles-cur=10:\
  :openfiles-cur=128:\
  :openfiles-cur=512:\
  # 
  # sort -u *.txt -o out
  Segmentation fault (core dumped)
  # 
  --
  
  This is after a minute run.. The txt files have UTF-8 chars too. A line is 
  maximum a few ten chars long in the txt files. All the txt files have UNIX 
  eol's. There is enough storage, enough RAM, enough CPU. I'm even trying 
  this with root user. The txt files are about ~60 000 000 lines.. not a big 
  number... a reboot didn't help. 
  
  
  
  Any ideas how can I use the sort command to actually sort? Please help!
  
  
  
  Thanks, 
  
  btw, this happens on other UNIX OS too, lol... why do we have the sort 
  command if it doesn't work?
  
 
 Hi,
 
 have you tried the option '-H'?
 The manpage suggested this for files  60MB.
 
 
 Regards,
 
 Andi