Re: Splitting up large dirty file

2018-05-21 Thread Jon Degenhardt via Digitalmars-d-learn
On Monday, 21 May 2018 at 15:00:09 UTC, Dennis wrote: I want to be convinced that Range programming works like a charm, but the procedural approaches remain more flexible (and faster too) it seems. Thanks for the example. On Monday, 21 May 2018 at 22:11:42 UTC, Dennis wrote: In this case I

Re: Splitting up large dirty file

2018-05-21 Thread Dennis via Digitalmars-d-learn
On Monday, 21 May 2018 at 17:42:19 UTC, Jonathan M Davis wrote: On Monday, May 21, 2018 15:00:09 Dennis via Digitalmars-d-learn wrote: drop is range-based, so if you give it a string, it's going to decode because of the whole auto-decoding mess with std.range.primitives.front and popFront.

Re: Splitting up large dirty file

2018-05-21 Thread Jonathan M Davis via Digitalmars-d-learn
On Monday, May 21, 2018 15:00:09 Dennis via Digitalmars-d-learn wrote: > On Thursday, 17 May 2018 at 21:10:35 UTC, Dennis wrote: > > It's unfortunate that Phobos tells you 'there's problems with > > the encoding' without providing any means to fix it or even > > diagnose it. > > I have to take

Re: Splitting up large dirty file

2018-05-21 Thread Dennis via Digitalmars-d-learn
On Thursday, 17 May 2018 at 21:10:35 UTC, Dennis wrote: It's unfortunate that Phobos tells you 'there's problems with the encoding' without providing any means to fix it or even diagnose it. I have to take that back since I found out about std.encoding which has functions like `sanitize`,

Re: Splitting up large dirty file

2018-05-18 Thread Kagamin via Digitalmars-d-learn
On Thursday, 17 May 2018 at 20:08:09 UTC, Dennis wrote: ``` auto inputStream = (args.length < 2 || args[1] == "-") ? stdin : args[1].File; auto outputFile = new File("output.txt"); foreach (line; inputStream.byLine(KeepTerminator.yes)) outputFile.write(line); ``` Do it old

Re: Splitting up large dirty file

2018-05-17 Thread Jonathan M Davis via Digitalmars-d-learn
On Thursday, May 17, 2018 21:10:35 Dennis via Digitalmars-d-learn wrote: > On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote: > > For various reasons, that doesn't always hold true like it > > should, but pretty much all of Phobos is written with that > > assumption and will

Re: Splitting up large dirty file

2018-05-17 Thread Jon Degenhardt via Digitalmars-d-learn
On Thursday, 17 May 2018 at 20:08:09 UTC, Dennis wrote: On Wednesday, 16 May 2018 at 15:47:29 UTC, Jon Degenhardt wrote: If you write it in the style of my earlier example and use counters and if-tests it will work. byLine by itself won't try to interpret the characters (won't auto-decode

Re: Splitting up large dirty file

2018-05-17 Thread ag0aep6g via Digitalmars-d-learn
On 05/17/2018 11:40 PM, Neia Neutuladh wrote: 0b1100_ through 0b_1110 is the start of a multibyte character Nitpick: It only goes up to 0b_0100. The highest code point is U+10. There are no sequences with more than four bytes.

Re: Splitting up large dirty file

2018-05-17 Thread Neia Neutuladh via Digitalmars-d-learn
On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote: I have a file with two problems: - It's too big to fit in memory (apparently, I thought 1.5 Gb would fit but I get an out of memory error when using std.file.read) Memory mapping should work. That's in core.sys.posix.sys.mman for Posix

Re: Splitting up large dirty file

2018-05-17 Thread Dennis via Digitalmars-d-learn
On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote: For various reasons, that doesn't always hold true like it should, but pretty much all of Phobos is written with that assumption and will generally throw an exception if it isn't. It's unfortunate that Phobos tells you

Re: Splitting up large dirty file

2018-05-17 Thread Dennis via Digitalmars-d-learn
On Wednesday, 16 May 2018 at 15:47:29 UTC, Jon Degenhardt wrote: If you write it in the style of my earlier example and use counters and if-tests it will work. byLine by itself won't try to interpret the characters (won't auto-decode them), so it won't trigger an exception if there are invalid

Re: Splitting up large dirty file

2018-05-16 Thread Jon Degenhardt via Digitalmars-d-learn
On Wednesday, 16 May 2018 at 07:06:45 UTC, Dennis wrote: On Wednesday, 16 May 2018 at 02:47:50 UTC, Jon Degenhardt wrote: Can you show the program you are using that throws when using byLine? Here's a version that only outputs the first chunk: ``` import std.stdio; import std.range; import

Re: Splitting up large dirty file

2018-05-16 Thread Jonathan M Davis via Digitalmars-d-learn
On Wednesday, May 16, 2018 08:57:10 Dennis via Digitalmars-d-learn wrote: > I thought it wouldn't be hard to crudely split this file using > D's range functions and basic string manipulation, but the > combination of being to large for a string and having invalid > encoding seems to defeat most

Re: Splitting up large dirty file

2018-05-16 Thread Dennis via Digitalmars-d-learn
On Wednesday, 16 May 2018 at 08:20:06 UTC, drug wrote: What is the purpose of `.drop(4)`? I'm pretty sure this is the reason of the exception. The file in question is a .json database dump with an array "rows" of 10 million 8-line objects. The newlines in the string fields are escaped, but

Re: Splitting up large dirty file

2018-05-16 Thread Dennis via Digitalmars-d-learn
On Wednesday, 16 May 2018 at 08:20:06 UTC, drug wrote: What is the purpose of `.drop(4)`? I'm pretty sure this is the reason of the exception. The file in question is a .json database dump with an array "rows" of 10 million 8-line objects. The newlines in the string fields are escaped, but

Re: Splitting up large dirty file

2018-05-16 Thread drug via Digitalmars-d-learn
16.05.2018 10:06, Dennis пишет: Here's a version that only outputs the first chunk: ``` import std.stdio; import std.range; import std.algorithm; import std.file; import std.exception; void main(string[] args) { enforce(args.length == 2, "Pass one filename as argument"); auto

Re: Splitting up large dirty file

2018-05-16 Thread Dennis via Digitalmars-d-learn
On Wednesday, 16 May 2018 at 02:47:50 UTC, Jon Degenhardt wrote: Can you show the program you are using that throws when using byLine? Here's a version that only outputs the first chunk: ``` import std.stdio; import std.range; import std.algorithm; import std.file; import std.exception; void

Re: Splitting up large dirty file

2018-05-15 Thread Jon Degenhardt via Digitalmars-d-learn
On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote: I have a file with two problems: - It's too big to fit in memory (apparently, I thought 1.5 Gb would fit but I get an out of memory error when using std.file.read) - It is dirty (contains invalid Unicode characters, null bytes in the

Re: Splitting up large dirty file

2018-05-15 Thread Jonathan M Davis via Digitalmars-d-learn
On Tuesday, May 15, 2018 20:36:21 Dennis via Digitalmars-d-learn wrote: > I have a file with two problems: > - It's too big to fit in memory (apparently, I thought 1.5 Gb > would fit but I get an out of memory error when using > std.file.read) > - It is dirty (contains invalid Unicode characters,

Re: Splitting up large dirty file

2018-05-15 Thread Steven Schveighoffer via Digitalmars-d-learn
On 5/15/18 4:36 PM, Dennis wrote: I have a file with two problems: - It's too big to fit in memory (apparently, I thought 1.5 Gb would fit but I get an out of memory error when using std.file.read) - It is dirty (contains invalid Unicode characters, null bytes in the middle of lines) I want

Splitting up large dirty file

2018-05-15 Thread Dennis via Digitalmars-d-learn
I have a file with two problems: - It's too big to fit in memory (apparently, I thought 1.5 Gb would fit but I get an out of memory error when using std.file.read) - It is dirty (contains invalid Unicode characters, null bytes in the middle of lines) I want to write a program that splits it