Re: Sorting Large File (Code/Performance)

2008-02-02 Thread Albert van der Horst
In article [EMAIL PROTECTED], [EMAIL PROTECTED] wrote: Thanks to all who replied. It's very appreciated. Yes, I had to doublecheck line counts and the number of lines is ~16 million (insetead of stated 1.6B). Also: What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, utf16be,

Re: Sorting Large File (Code/Performance)

2008-02-02 Thread Albert van der Horst
In article [EMAIL PROTECTED], John Nagle [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Thanks to all who replied. It's very appreciated. Yes, I had to double check line counts and the number of lines is ~16 million (instead of stated 1.6B). OK, that's not bad at all. You have a

Re: Sorting Large File (Code/Performance)

2008-01-27 Thread Stefan Behnel
Gabriel Genellina wrote: use the Windows sort command. It has been there since MS-DOS ages, there is no need to download and install other packages, and the documentation at http://technet.microsoft.com/en-us/library/bb491004.aspx says: Limits on file size: The sort command has no limit

Re: Sorting Large File (Code/Performance)

2008-01-27 Thread Grant Edwards
On 2008-01-27, Stefan Behnel [EMAIL PROTECTED] wrote: Gabriel Genellina wrote: use the Windows sort command. It has been there since MS-DOS ages, there is no need to download and install other packages, and the documentation at http://technet.microsoft.com/en-us/library/bb491004.aspx says:

Re: Sorting Large File (Code/Performance)

2008-01-27 Thread Marc 'BlackJack' Rintsch
On Sun, 27 Jan 2008 10:00:45 +, Grant Edwards wrote: On 2008-01-27, Stefan Behnel [EMAIL PROTECTED] wrote: Gabriel Genellina wrote: use the Windows sort command. It has been there since MS-DOS ages, there is no need to download and install other packages, and the documentation at

Re: Sorting Large File (Code/Performance)

2008-01-26 Thread Gabriel Genellina
En Fri, 25 Jan 2008 17:50:17 -0200, Paul Rubin http://phr.cx@NOSPAM.invalid escribi�: Nicko [EMAIL PROTECTED] writes: # The next line is order O(n) in the number of chunks (line, fileindex) = min(mergechunks) You should use the heapq module to make this operation O(log n) instead.

Re: Sorting Large File (Code/Performance)

2008-01-25 Thread Nicko
On Jan 24, 9:26 pm, [EMAIL PROTECTED] wrote: If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath. I am limited with resources. Unfortunately. As long as you have at least as much disc space spare as you need to hold a copy of the file then this is

Re: Sorting Large File (Code/Performance)

2008-01-25 Thread Asim
On Jan 24, 4:26 pm, [EMAIL PROTECTED] wrote: Thanks to all who replied. It's very appreciated. Yes, I had to doublecheck line counts and the number of lines is ~16 million (insetead of stated 1.6B). Also: What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, utf16be, ???

Re: Sorting Large File (Code/Performance)

2008-01-25 Thread Asim
On Jan 25, 9:23 am, Asim [EMAIL PROTECTED] wrote: On Jan 24, 4:26 pm, [EMAIL PROTECTED] wrote: Thanks to all who replied. It's very appreciated. Yes, I had to doublecheck line counts and the number of lines is ~16 million (insetead of stated 1.6B). Also: What is a Unicode text

Re: Sorting Large File (Code/Performance)

2008-01-25 Thread Paul Rubin
Nicko [EMAIL PROTECTED] writes: # The next line is order O(n) in the number of chunks (line, fileindex) = min(mergechunks) You should use the heapq module to make this operation O(log n) instead. -- http://mail.python.org/mailman/listinfo/python-list

Sorting Large File (Code/Performance)

2008-01-24 Thread Ira . Kovac
Hello all, I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like to sort based on first two characters. I'd greatly appreciate if someone can post sample code that can help me do this. Also, any ideas on approximately how long is the sort process going to take (XP, Dual Core

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Paul Rubin
[EMAIL PROTECTED] writes: I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like to sort based on first two characters. I'd greatly appreciate if someone can post sample code that can help me do this. Use the unix sort command: sort inputfile -o outputfile I think

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Nagle
[EMAIL PROTECTED] wrote: Hello all, I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like to sort based on first two characters. Given those numbers, the average number of characters per line is less than 2. Please check. John

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Machin
On Jan 25, 6:18 am, [EMAIL PROTECTED] wrote: Hello all, I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like to sort based on first two characters. If you mean 1.6 American billion i.e. 1.6 * 1000 ** 3 lines, and 2 * 1024 ** 3 bytes of data, that's 1.34 bytes per line. If

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Ira . Kovac
Thanks to all who replied. It's very appreciated. Yes, I had to doublecheck line counts and the number of lines is ~16 million (insetead of stated 1.6B). Also: What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this: The file is UTF-8 Do

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Stefan Behnel
[EMAIL PROTECTED] wrote: What are you going to do with it after it's sorted? I need to isolate all lines that start with two characters (zz to be particular) Isolate as in extract? Remove the rest? Then why don't you extract the lines first, without sorting the file? (or sort it afterwards if

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Stefan Behnel
Stefan Behnel wrote: [EMAIL PROTECTED] wrote: What are you going to do with it after it's sorted? I need to isolate all lines that start with two characters (zz to be particular) Isolate as in extract? Remove the rest? Then why don't you extract the lines first, without sorting the file?

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Machin
On Jan 25, 8:26 am, [EMAIL PROTECTED] wrote: I need to isolate all lines that start with two characters (zz to be particular) What does isolate mean to you? What does this have to do with sorting? What do you actually want to do with (a) the lines starting with zz (b) the other lines? What

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Martin Marcher
On Thursday 24 January 2008 20:56 John Nagle wrote: [EMAIL PROTECTED] wrote: Hello all, I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like to sort based on first two characters. Given those numbers, the average number of characters per line is less than 2.

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Paul Rubin
John Nagle [EMAIL PROTECTED] writes: - Get enough memory to do the sort with an in-memory sort, like UNIX sort or Python's sort function. Unix sort does external sorting when needed. -- http://mail.python.org/mailman/listinfo/python-list

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Nagle
[EMAIL PROTECTED] wrote: Thanks to all who replied. It's very appreciated. Yes, I had to double check line counts and the number of lines is ~16 million (instead of stated 1.6B). OK, that's not bad at all. You have a few options: - Get enough memory to do the sort with an

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Nagle
Paul Rubin wrote: John Nagle [EMAIL PROTECTED] writes: - Get enough memory to do the sort with an in-memory sort, like UNIX sort or Python's sort function. Unix sort does external sorting when needed. Ah, someone finally put that in. Good. I hadn't looked at sort's manual page

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Paul Rubin
John Nagle [EMAIL PROTECTED] writes: Unix sort does external sorting when needed. Ah, someone finally put that in. Good. I hadn't looked at sort's manual page in many years. Huh? It has been like that from the beginning. It HAD to be. Unix was originally written on a PDP-11. The GNU