Re: I found a sort bug! - How to sort big files?
On Mar 16 11:36:08, o...@drijf.net wrote: On Mon, Mar 16, 2015 at 10:20:07AM +, Stuart Henderson wrote: On 2015-03-15, Todd C. Miller todd.mil...@courtesan.com wrote: On Sat, 14 Mar 2015 12:29:21 -, Stuart Henderson wrote: I think the consensus was to try and replace it with another version but not sure what happened. I have a port of the FreeBSD sort but it is slower than our current sort (and slower than GNU sort). Personally I think that is a reasonable trade-off for more actively developed code, and when I tried it on some difficult files it coped better than our current sort (not that this small sample means much in terms of ability to handle every difficult file). Current sort(1) is unmaintanable in many ways. I say switch. Incidentally, reading up on UNIX history, I came across this: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6771921
Re: I found a sort bug! - How to sort big files?
On Tue, 17 Mar 2015 08:58:56 +1300 worik worik.stan...@gmail.com wrote: On 16/03/15 06:43, Steve Litt wrote: But IMHO, sorting 60megalines isn't something I would expect a generic sort command to easily and timely do out of the box. I would. These days such files are getting more and more common. But there is a warning in the man page for sort under BUGS: To sort files larger than 60MB, use sort -H; files larger than 704MB must be sorted in smaller pieces, then merged. So it seams there is a bug in... files larger than 60MB, use sort -H since that did not work for the OP. Worik Oh, jeez, you put your finger *right* on the problem Worik. Both I and the OP read the manpage wrong. sort -H won't work for extremely big files (more than 704MB). But there's a fairly easy solution... An average line length can be found with wc and then dividing. Then figure out how many lines would make about a 10MB file, and use split -l to split the file into smaller files with that many lines. Then sort each of those files, with no arguments, and finally use sort -m to merge them all back together again into one sorted file. According to the man page, the preceding should work just fine, and it can pretty much be automated with a simple shellscript, so you can set it to run and have it work while you do other things. SteveT Steve Litt* http://www.troubleshooters.com/ Troubleshooting Training * Human Performance
Re: I found a sort bug! - How to sort big files?
On 2015-03-15, Todd C. Miller todd.mil...@courtesan.com wrote: On Sat, 14 Mar 2015 12:29:21 -, Stuart Henderson wrote: I think the consensus was to try and replace it with another version but not sure what happened. I have a port of the FreeBSD sort but it is slower than our current sort (and slower than GNU sort). Personally I think that is a reasonable trade-off for more actively developed code, and when I tried it on some difficult files it coped better than our current sort (not that this small sample means much in terms of ability to handle every difficult file).
Re: I found a sort bug! - How to sort big files?
On Mon, Mar 16, 2015 at 10:20:07AM +, Stuart Henderson wrote: On 2015-03-15, Todd C. Miller todd.mil...@courtesan.com wrote: On Sat, 14 Mar 2015 12:29:21 -, Stuart Henderson wrote: I think the consensus was to try and replace it with another version but not sure what happened. I have a port of the FreeBSD sort but it is slower than our current sort (and slower than GNU sort). Personally I think that is a reasonable trade-off for more actively developed code, and when I tried it on some difficult files it coped better than our current sort (not that this small sample means much in terms of ability to handle every difficult file). Current sort(1) is unmaintanable in many ways. I say switch. -Otto
Re: I found a sort bug! - How to sort big files?
Current sort(1) is unmaintanable in many ways. I say switch. I've seen with gdb that the current sort(1) somehow manages to make radixsort(3) do the work when the sort key is somewhere in the middle of the line. I don't even want to know... (and my reading comprehension of C is too weak to go and look). Yes, switch, please!
Re: I found a sort bug! - How to sort big files?
On 16/03/15 06:43, Steve Litt wrote: But IMHO, sorting 60megalines isn't something I would expect a generic sort command to easily and timely do out of the box. I would. These days such files are getting more and more common. But there is a warning in the man page for sort under BUGS: To sort files larger than 60MB, use sort -H; files larger than 704MB must be sorted in smaller pieces, then merged. So it seams there is a bug in... files larger than 60MB, use sort -H since that did not work for the OP. Worik -- Why is the legal status of chardonnay different to that of cannabis? worik.stan...@gmail.com 021-1680650, (03) 4821804 Aotearoa (New Zealand) I voted for love
Re: Fwd: Re: I found a sort bug! - How to sort big files?
On 2015-03-15, sort problem sortprob...@safe-mail.net wrote: So the default sort command is a big pile of shit when it comes to files bigger then 60 MByte? .. lol It's probably not the size, rather the contents of the files.
Fwd: Re: I found a sort bug! - How to sort big files?
Whoops. At least I thought it helped. The default sort with the -H worked for 132 minutes then said: no space left in /home (that had before the sort command: 111 GBytes FREE). And btw, df command said for free space: -18 GByte, 104%.. what? Some kind of reserved space for root? Why does it takes more then 111 GBytes to sort -u ~600 MByte sized files? This in nonsense. So the default sort command is a big pile of shit when it comes to files bigger then 60 MByte? .. lol I can send the ~600 MByte txt files compressed if needed... I was suprised... sort is a very old command.. Original Message From: sort problem sortprob...@safe-mail.net To: andreas.zeilme...@mailbox.org Cc: misc@openbsd.org Subject: Re: I found a sort bug! - How to sort big files? Date: Sat, 14 Mar 2015 08:39:55 -0400 o.m.g. It works. Why doesn't sort uses this by default on files larger then 60 MByte? Thanks! Original Message From: Andreas Zeilmeier andreas.zeilme...@mailbox.org Apparently from: owner-misc+m147...@openbsd.org To: misc@openbsd.org Subject: Re: I found a sort bug! - How to sort big files? Date: Sat, 14 Mar 2015 13:16:05 +0100 On 03/14/15 12:49, sort problem wrote: Hello, -- # uname -a OpenBSD notebook.lan 5.6 GENERIC.MP#333 amd64 # # du -sh small/ 663Msmall/ # ls -lah small/*.txt | wc -l 43 # # cd small # ulimit -n 1000 # sysctl | grep -i maxfiles kern.maxfiles=10 # # grep open /etc/login.conf :openfiles-cur=10:\ :openfiles-cur=128:\ :openfiles-cur=512:\ # # sort -u *.txt -o out Segmentation fault (core dumped) # -- This is after a minute run.. The txt files have UTF-8 chars too. A line is maximum a few ten chars long in the txt files. All the txt files have UNIX eol's. There is enough storage, enough RAM, enough CPU. I'm even trying this with root user. The txt files are about ~60 000 000 lines.. not a big number... a reboot didn't help. Any ideas how can I use the sort command to actually sort? Please help! Thanks, btw, this happens on other UNIX OS too, lol... why do we have the sort command if it doesn't work? Hi, have you tried the option '-H'? The manpage suggested this for files 60MB. Regards, Andi
Re: I found a sort bug! - How to sort big files?
I don't know why sort is giving you such problems. there may be something unusual about your specific input that it wasn't designed to handle (or it might simply be a latent bug that has never been identified and fixed). when I need to sort large files, I split(1) them into smaller pieces, then sort(1) the pieces individually, then use sort(1) (with the -m option) to merge the sorted pieces into a single large result file. this has always worked reliably for me (and because I was raised using 8-bit and 16-bit computers I don't have any special expectations that programs should just work when given very large inputs). even if you think doing all this is too much bother, try doing it just once. you might be able to identify a specific chunk of your input that's causing the problem, which will help move us all towards a proper solution (or at least a caveat in the man page). -ken On Sun, Mar 15, 2015 at 9:53 AM, sort problem sortprob...@safe-mail.net wrote: Whoops. At least I thought it helped. The default sort with the -H worked for 132 minutes then said: no space left in /home (that had before the sort command: 111 GBytes FREE). And btw, df command said for free space: -18 GByte, 104%.. what? Some kind of reserved space for root? Why does it takes more then 111 GBytes to sort -u ~600 MByte sized files? This in nonsense. So the default sort command is a big pile of shit when it comes to files bigger then 60 MByte? .. lol I can send the ~600 MByte txt files compressed if needed... I was suprised... sort is a very old command.. Original Message From: sort problem sortprob...@safe-mail.net To: andreas.zeilme...@mailbox.org Cc: misc@openbsd.org Subject: Re: I found a sort bug! - How to sort big files? Date: Sat, 14 Mar 2015 08:39:55 -0400 o.m.g. It works. Why doesn't sort uses this by default on files larger then 60 MByte? Thanks! Original Message From: Andreas Zeilmeier andreas.zeilme...@mailbox.org Apparently from: owner-misc+m147...@openbsd.org To: misc@openbsd.org Subject: Re: I found a sort bug! - How to sort big files? Date: Sat, 14 Mar 2015 13:16:05 +0100 On 03/14/15 12:49, sort problem wrote: Hello, -- # uname -a OpenBSD notebook.lan 5.6 GENERIC.MP#333 amd64 # # du -sh small/ 663Msmall/ # ls -lah small/*.txt | wc -l 43 # # cd small # ulimit -n 1000 # sysctl | grep -i maxfiles kern.maxfiles=10 # # grep open /etc/login.conf :openfiles-cur=10:\ :openfiles-cur=128:\ :openfiles-cur=512:\ # # sort -u *.txt -o out Segmentation fault (core dumped) # -- This is after a minute run.. The txt files have UTF-8 chars too. A line is maximum a few ten chars long in the txt files. All the txt files have UNIX eol's. There is enough storage, enough RAM, enough CPU. I'm even trying this with root user. The txt files are about ~60 000 000 lines.. not a big number... a reboot didn't help. Any ideas how can I use the sort command to actually sort? Please help! Thanks, btw, this happens on other UNIX OS too, lol... why do we have the sort command if it doesn't work? Hi, have you tried the option '-H'? The manpage suggested this for files 60MB. Regards, Andi
Re: I found a sort bug! - How to sort big files?
On Sun, 15 Mar 2015 09:53:34 -0400 sort problem sortprob...@safe-mail.net wrote: Whoops. At least I thought it helped. The default sort with the -H worked for 132 minutes then said: no space left in /home (that had before the sort command: 111 GBytes FREE). That's not surprising. -H implements a merge sort, meaning it's split into lots and lots of files, each of which is again split into lots of files, etc. It wouldn't surprise me to see a 60Mline file consume a huge multiple of itself during a merge sort. And of course, the algorithm might be swapping. And btw, df command said for free space: -18 GByte, 104%.. what? Some kind of reserved space for root? Why does it takes more then 111 GBytes to sort -u ~600 MByte sized files? This in nonsense. So the default sort command is a big pile of shit when it comes to files bigger then 60 MByte? .. lol That doesn't surprise me. You originally said you have 60 million lines. Sorting 60 million items is a difficult task for any algorithm. You don't say how long each line is, or what they contain, or whether they're all the same line length. How would *you* sort so many items, and sort them in a fast yet generic way? I mean, if RAM and disk space are at a premium, you could always use a bubble sort, and in-place sort your array in a year or two. If I were in your shoes, I'd write my own sort routine for the task. Perhaps using qsort() (see http://calmerthanyouare.org/2013/05/31/qsort-shootout.html). If there's a way you can convert line contents into a number reflecting alpha-order, you could even qsort() in RAM if you have quite a bit of RAM, and then the last step is to run through the sorted list of numbers and line numbers, and write the original file by line number. There are probably a thousand other ways to do it. But IMHO, sorting 60megalines isn't something I would expect a generic sort command to easily and timely do out of the box. SteveT Steve Litt* http://www.troubleshooters.com/ Troubleshooting Training * Human Performance
Re: Fwd: Re: I found a sort bug! - How to sort big files?
sort problem wrote: So the default sort command is a big pile of shit when it comes to files bigger then 60 MByte? .. lol I can send the ~600 MByte txt files compressed if needed... I was suprised... sort is a very old command.. I think you have discovered the answer. :(
Re: I found a sort bug! - How to sort big files?
On Sat, 14 Mar 2015 12:29:21 -, Stuart Henderson wrote: I think the consensus was to try and replace it with another version but not sure what happened. I have a port of the FreeBSD sort but it is slower than our current sort (and slower than GNU sort). - todd
Re: I found a sort bug! - How to sort big files?
On 2015-03-14, sort problem sortprob...@safe-mail.net wrote: # sort -u *.txt -o out Segmentation fault (core dumped) There are some known bugs in sort, I ran into a file it couldn't cope with a couple of years ago too, but it doesn't happen all that often. I think the consensus was to try and replace it with another version but not sure what happened. For your current problem you could pkg_add coreutils and try gsort, maybe it will cope with your files better.. btw, this happens on other UNIX OS too, lol... why do we have the sort command if it doesn't work? Normally it does work.
Re: I found a sort bug! - How to sort big files?
On 03/14/15 12:49, sort problem wrote: Hello, -- # uname -a OpenBSD notebook.lan 5.6 GENERIC.MP#333 amd64 # # du -sh small/ 663Msmall/ # ls -lah small/*.txt | wc -l 43 # # cd small # ulimit -n 1000 # sysctl | grep -i maxfiles kern.maxfiles=10 # # grep open /etc/login.conf :openfiles-cur=10:\ :openfiles-cur=128:\ :openfiles-cur=512:\ # # sort -u *.txt -o out Segmentation fault (core dumped) # -- This is after a minute run.. The txt files have UTF-8 chars too. A line is maximum a few ten chars long in the txt files. All the txt files have UNIX eol's. There is enough storage, enough RAM, enough CPU. I'm even trying this with root user. The txt files are about ~60 000 000 lines.. not a big number... a reboot didn't help. Any ideas how can I use the sort command to actually sort? Please help! Thanks, btw, this happens on other UNIX OS too, lol... why do we have the sort command if it doesn't work? Hi, have you tried the option '-H'? The manpage suggested this for files 60MB. Regards, Andi
Re: I found a sort bug! - How to sort big files?
o.m.g. It works. Why doesn't sort uses this by default on files larger then 60 MByte? Thanks! Original Message From: Andreas Zeilmeier andreas.zeilme...@mailbox.org Apparently from: owner-misc+m147...@openbsd.org To: misc@openbsd.org Subject: Re: I found a sort bug! - How to sort big files? Date: Sat, 14 Mar 2015 13:16:05 +0100 On 03/14/15 12:49, sort problem wrote: Hello, -- # uname -a OpenBSD notebook.lan 5.6 GENERIC.MP#333 amd64 # # du -sh small/ 663Msmall/ # ls -lah small/*.txt | wc -l 43 # # cd small # ulimit -n 1000 # sysctl | grep -i maxfiles kern.maxfiles=10 # # grep open /etc/login.conf :openfiles-cur=10:\ :openfiles-cur=128:\ :openfiles-cur=512:\ # # sort -u *.txt -o out Segmentation fault (core dumped) # -- This is after a minute run.. The txt files have UTF-8 chars too. A line is maximum a few ten chars long in the txt files. All the txt files have UNIX eol's. There is enough storage, enough RAM, enough CPU. I'm even trying this with root user. The txt files are about ~60 000 000 lines.. not a big number... a reboot didn't help. Any ideas how can I use the sort command to actually sort? Please help! Thanks, btw, this happens on other UNIX OS too, lol... why do we have the sort command if it doesn't work? Hi, have you tried the option '-H'? The manpage suggested this for files 60MB. Regards, Andi