subject:"Re\: Splitting up large dirty file"

Re: Splitting up large dirty file

2018-05-21 Thread Jon Degenhardt via Digitalmars-d-learn


On Monday, 21 May 2018 at 15:00:09 UTC, Dennis wrote:
I want to be convinced that Range programming works like a 
charm, but the procedural approaches remain more flexible (and 
faster too) it seems. Thanks for the example.



On Monday, 21 May 2018 at 22:11:42 UTC, Dennis wrote:
In this case I used drop to drop lines, not characters. The 
exception was thrown by the joiner it turns out.

 ...
From the benchmarking I did, I found that ranges are easily an 
order of magnitude slower even with compiler optimizations:


My general experience is that range programming works quite well. 
It's especially useful when used to do lazy processing and as a 
result minimize memory allocations. I've gotten quite good 
performance with these techniques (see my DConf talk slides: 
https://dconf.org/2018/talks/degenhardt.html).


Your benchmarks are not against the file split case, but if you 
benchmarked that you may have also seen it as slow. It that case 
you may be hitting specific areas where there are opportunities 
for performance improvement in the standard library. One is that 
joiner is slow (PR: https://github.com/dlang/phobos/pull/6492). 
Another is that the write[fln] routines are much faster when 
operating on a single large object than many small objects. e.g. 
It's faster to call write[fln] with an array of 100 characters 
than: (a) calling it 100 times with one character; (b) calling it 
once, with 100 characters as individual arguments (template 
form); (c) calling it once with range of 100 characters, each 
processed one at a time.


When joiner is used as in your example, you not only hit the 
joiner performance issue, but the write[fln] issue. This is due 
to something that may not be obvious at first: When joiner is 
used to concatenate arrays or ranges, it flattens out the 
array/range into a single range of elements. So, rather than 
writing a line at a time, you example is effectively passing a 
character at a time to write[fln].


So, in the file split case, using byLine in an imperative fashion 
as in my example will have the effect of passing a full line at a 
time to write[fln], rather than individual characters. Mine will 
be faster, but not because it's imperative. The same thing could 
be achieved procedurally.


Regarding the benchmark programs you showed - This is very 
interesting. It would certainly be worth additional looks into 
this. One thing I wonder is if the performance penalty may be due 
to a lack of inlining due to crossing library boundaries. The 
imperative versions aren't crossing these boundaries. If you're 
willing, you could try adding LDC's LTO options and see what 
happens. There are some instructions in the release notes for LDC 
1.9.0 (https://github.com/ldc-developers/ldc/releases). Make sure 
you use the form that includes druntime and phobos.


--Jon

Re: Splitting up large dirty file

2018-05-21 Thread Dennis via Digitalmars-d-learn


On Monday, 21 May 2018 at 17:42:19 UTC, Jonathan M Davis wrote:
On Monday, May 21, 2018 15:00:09 Dennis via Digitalmars-d-learn 
wrote:
drop is range-based, so if you give it a string, it's going to 
decode because of the whole auto-decoding mess with 
std.range.primitives.front and popFront.


In this case I used drop to drop lines, not characters. The 
exception was thrown by the joiner it turns out.


On Monday, 21 May 2018 at 17:42:19 UTC, Jonathan M Davis wrote:

I find Exceptions in range code hard to interpret.


Well, if you just look at the stack trace, it should tell you. 
I don't see why ranges would be any worse than any other code 
except for maybe the fact that it's typical to chain a lot of 
calls, and you frequently end up with wrapper types in the 
stack trace that you're not necessarily familiar with.


Exactly that: stack trace full of weird mangled names of template 
functions, lambdas etc. And because of lazy evaluation and chains 
of range functions, the line number doesn't easily show who the 
culprit is.


On Monday, 21 May 2018 at 17:42:19 UTC, Jonathan M Davis wrote:
In many cases, ranges will be pretty much the same as writing 
loops, and in others, the abstraction is worth the cost.


From the benchmarking I did, I found that ranges are easily an 
order of magnitude slower even with compiler optimizations:


https://run.dlang.io/gist/5f243ca5ba80d958c0bc16d5b73f2934?compiler=ldc=-O3%20-release

```
LDC -O3 -release
 Range   Procedural
Stringtest: ["267ns", "11ns"]
Numbertest: ["393ns", "153ns"]


DMD -O -inline -release
  Range   Procedural
Stringtest: ["329ns", "8ns"]
Numbertest: ["1237ns", "282ns"]
```

This first range test is an opcode scanner I wrote for an 
assembler. The range code is very nice and it works, but it 
needlessly allocates a new string. So I switched to a procedural 
version, which runs (and compiles) faster. This procedural 
version did have some bugs initially though.


The second test is a simple number calculation. I thought that 
the range code inlines to roughly the same procedural code so it 
could be optimized the same, but there remains a factor 2 gap. I 
don't know where the difficulty is, but I did notice that 
switching the maximum number from int to enum makes the 
procedural version 0 ns (calculated at compile time) while LDC 
can't deduce the outcome in the range version (which still runs 
for >300 ns).

Re: Splitting up large dirty file

2018-05-21 Thread Jonathan M Davis via Digitalmars-d-learn

On Monday, May 21, 2018 15:00:09 Dennis via Digitalmars-d-learn wrote:
> On Thursday, 17 May 2018 at 21:10:35 UTC, Dennis wrote:
> > It's unfortunate that Phobos tells you 'there's problems with
> > the encoding' without providing any means to fix it or even
> > diagnose it.
>
> I have to take that back since I found out about std.encoding
> which has functions like `sanitize`, but also `transcode`. (My
> file turned out to actually be encoded with ANSI / Windows-1252,
> not UTF-8)
> Documentation is scarce however, and it requires strings instead
> of forward ranges.
>
> @Jon Degenhardt
>
> > Instead of:
> >  auto outputFile = new File("output.txt");
> >
> > try:
> > auto outputFile = File("output.txt", "w");
>
> Wow I really butchered that code. So it is the `drop(4)` that
> triggers the UTFException?

drop is range-based, so if you give it a string, it's going to decode
because of the whole auto-decoding mess with std.range.primitives.front and
popFront. If you can't have auto-decoding, you either have to be dealing
with functions that you know avoid it, or you need to do use something like
std.string.representation or std.utf.byCodeUnit to get around the
auto-decoding. If you're dealing with invalid Unicode, you basically have to
either convert it all up front or do something like treat it like binary
data, or Phobos is going to try to decode it as Unicode and give you a
UTFExceptions.

> I find Exceptions in range code hard to interpret.

Well, if you just look at the stack trace, it should tell you. I don't see
why ranges would be any worse than any other code except for maybe the fact
that it's typical to chain a lot of calls, and you frequently end up with
wrapper types in the stack trace that you're not necessarily familiar with.
The big problem here really is that all you're really being told is that
your string has invalid Unicode in it somewhere and the chain of function
calls that resulted in std.utf.decode being called on your invalid Unicode.
But even if you weren't dealing with ranges, if you passed invalid Unicode
to something completely string-based which did decoding, you'd run into
pretty much the same problem. The data is being used outside of its original
context where you could easily figure out what it relates to, so it's going
to be a problem by its very nature. The only real solution there is to be
controlling the decoding yourself, and even then, it's easy to be in a
position where it's hard to figure out where in the data the bad data is
unless you've done something like keep track of exactly what index your at,
which really doesn't work well once you're dealing with slicing data.

> @Kagamin
>
> > Do it old school?
>
> I want to be convinved that Range programming works like a charm,
> but the procedural approaches remain more flexible (and faster
> too) it seems. Thanks for the example.

The whole auto-decoding mess makes things worse than they should be, but if
you find procedural examples more flexible, then I would guess that that
would be simply a matter of getting more experience with ranges. Ranges are
far more composable in terms of how they're used, which tends to inherently
make them more flexible. However, it does result in code that's a mixture of
functional and procedural programming, which can be quite a shift for some
folks. So, there's no question that it takes some getting used to, but D
does allow for the more classic approaches, and ranges are not always the
best approach.

As for performance, that depends on the code and the compiler. It wouldn't
surprise me if dmd didn't optimize out the range stuff as much as it really
should, but it's my understanding that ldc typically manages to generate
code where the range abstraction didn't cost you anything. If there's an
issue, I think that it's frequently an algorithmic one or the fact that some
range-processing has a tendency to process the same data multiple times,
because that's the easiest, most abstract way to go about it and works in
general but isn't always the best solution.

For instance, because of how the range API works, when using splitter, if
you iterate through the entire range, you pretty much have to iterate
through it twice, because it does look-ahead to find the delimiter and then
returns you a slice up to that point, after which, you process that chunk of
the data to do whatever it is you want to do with each split piece. At a
conceptual level, what you're doing with your code with splitter is then
really clean and easy to write, and often, it should be plenty efficient,
but it does require going over the data twice, whereas if you looped over
the data yourself, looking for each delimiter, you'd only need to iterate
over it once. So, in cases like that, I'd fully expect the abstraction to
cost you, though whether it costs enough to matter depends on what you're
doing.

As is the case when dealing with most abstractions, I think that it's mostly
a matter of using it where it makes sense

Re: Splitting up large dirty file

2018-05-21 Thread Dennis via Digitalmars-d-learn


On Thursday, 17 May 2018 at 21:10:35 UTC, Dennis wrote:
It's unfortunate that Phobos tells you 'there's problems with 
the encoding' without providing any means to fix it or even 
diagnose it.


I have to take that back since I found out about std.encoding 
which has functions like `sanitize`, but also `transcode`. (My 
file turned out to actually be encoded with ANSI / Windows-1252, 
not UTF-8)
Documentation is scarce however, and it requires strings instead 
of forward ranges.


@Jon Degenhardt

Instead of:

 auto outputFile = new File("output.txt");

try:

auto outputFile = File("output.txt", "w");


Wow I really butchered that code. So it is the `drop(4)` that 
triggers the UTFException? I find Exceptions in range code hard 
to interpret.


@Kagamin

Do it old school?


I want to be convinved that Range programming works like a charm, 
but the procedural approaches remain more flexible (and faster 
too) it seems. Thanks for the example.

Re: Splitting up large dirty file

2018-05-18 Thread Kagamin via Digitalmars-d-learn


On Thursday, 17 May 2018 at 20:08:09 UTC, Dennis wrote:

```
auto inputStream = (args.length < 2 || args[1] == "-") ? 
stdin : args[1].File;

auto outputFile = new File("output.txt");
foreach (line; inputStream.byLine(KeepTerminator.yes)) 
outputFile.write(line);

```


Do it old school?
---
int line;
auto outputFile = File("output.txt", "wb");
foreach (chunk; inputStream.byChunk(4<<10))
{
  auto rem=chunk;
  while(rem!=null)
  {
auto i=rem.countUntil(10);
auto len=i+1;
if(i<0)len=rem.length; else line++;
outputFile.rawWrite(rem[0..len]);
rem=rem[len..$];
  }
}
---

Re: Splitting up large dirty file

2018-05-17 Thread Jonathan M Davis via Digitalmars-d-learn

On Thursday, May 17, 2018 21:10:35 Dennis via Digitalmars-d-learn wrote:
> On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
> > For various reasons, that doesn't always hold true like it
> > should, but pretty much all of Phobos is written with that
> > assumption and will generally throw an exception if it isn't.
>
> It's unfortunate that Phobos tells you 'there's problems with the
> encoding' without providing any means to fix it or even diagnose
> it. The UTFException doesn't contain what the character in
> question was. You just have to abort whatever you were trying to
> do.

UTFException has a sequence member and a len member (which appear to be
public but undocumented) which should contain the invalid sequence of code
units. In general though, exceptions aren't a great way to deal with this
problem. I think that you either want to be calling decode manually (in
which case, you have direct access to where the invalid Unicode is and have
the freedom to deal with it however is appropriate), or using the Unicode
replacement character would be better (which std.utf.decode supports, but
it's not what's used by default). Really, what's byting you here is the
auto-decoding. With Phobos, you have to fight to have it not happen by doing
stuff like special-casing your code for strings or using
std.string.representation or std.utf.byCodeUnit.

In principle, the way that Unicode would ideally be handled would be to
validate all character data when it enters the program (soing whatever is
appropriate with invalid Unicode at that point), and then the rest of the
program then either is always dealing with valid Unicode, or it's dealing
with integral values that it doesn't treat as Unicode (e.g. ubyte[]). But
the way that Phobos is written, it ends up decoding and validating all over
the place.

> On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
> > If you're ever dealing with a different encoding (or with
> > invalid Unicode), you really need to use integral types like
> > ubyte
>
> I tried something like byChunk(4096).joiner.splitter(cast(ubyte)
> '\n') but it turns out splitter wants at least a forward range,
> even when the separator is a single element.

Actually, I'm pretty sure that splitter curently requires a random-access
range (even though it should theoretically work with a forward range). I
don't think that it can be made to work with an input range though given how
the range API works - or at least, it were made to work with it, you'd have
to deal with the fact that popping front on the spitter range would
invalidate anything that had been returned from front. And it would be
difficult to implement it @safely if what gets returned by front is not
completely independent of the splitter range (which means that it needs
save). Basic input ranges in general tend to be extremely limited in what
they can do, which can get really annoying when you deal with stuff like
files or sockets where making it a forward range likely means either reading
it all into memory or having buffers that potentially have to be dup-ed by
each call to save.

- Jonathan M Davis

Re: Splitting up large dirty file

2018-05-17 Thread Jon Degenhardt via Digitalmars-d-learn


On Thursday, 17 May 2018 at 20:08:09 UTC, Dennis wrote:

On Wednesday, 16 May 2018 at 15:47:29 UTC, Jon Degenhardt wrote:
If you write it in the style of my earlier example and use 
counters and if-tests it will work. byLine by itself won't try 
to interpret the characters (won't auto-decode them), so it 
won't trigger an exception if there are invalid utf-8 
characters.


When printing to stdout it seems to skip any validation, but 
writing to a file does give an exception:


```
auto inputStream = (args.length < 2 || args[1] == "-") ? 
stdin : args[1].File;

auto outputFile = new File("output.txt");
foreach (line; inputStream.byLine(KeepTerminator.yes)) 
outputFile.write(line);

```
std.exception.ErrnoException@C:\D\dmd2\windows\bin\..\..\src\phobos\std\stdio.d(2877):
  (No error)

According to the documentation, byLine can throw an 
UTFException so relying on the fact that it doesn't in some 
cases doesn't seem like a good idea.


Instead of:

 auto outputFile = new File("output.txt");

try:

auto outputFile = File("output.txt", "w");

That works for me. The second arg ("w") opens the file for write. 
When I omit it, I also get an exception, as the default open mode 
is for read:


 * If file does not exist:  Cannot open file `output.txt' in mode 
`rb' (No such file or directory)

 * If file does exist:   (Bad file descriptor)

The second error presumably occurs when writing.

As an aside - I agree with one of your bigger picture 
observations: It would be preferable to have more control over 
utf-8 error handling behavior at the application level.

Re: Splitting up large dirty file

2018-05-17 Thread ag0aep6g via Digitalmars-d-learn


On 05/17/2018 11:40 PM, Neia Neutuladh wrote:
0b1100_ through 0b_1110 is the start of a 
multibyte character


Nitpick: It only goes up to 0b_0100. The highest code point is 
U+10. There are no sequences with more than four bytes.

Re: Splitting up large dirty file

2018-05-17 Thread Neia Neutuladh via Digitalmars-d-learn


On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:

I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gb 
would fit but I get an out of memory error when using 
std.file.read)


Memory mapping should work. That's in core.sys.posix.sys.mman for 
Posix systems, and Windows has some equivalent probably. (But 
nobody uses Windows, right?)


- It is dirty (contains invalid Unicode characters, null bytes 
in the middle of lines)


std.algorithm should generally work with sequences of anything, 
not just strings. So memory map, cast to ubyte[], and deal with 
it that way?


- When you convert chunks to arrays, you have the risk of a 
split being in the middle of a character with multiple code 
units


It's straightforward to scan for the start of a Unicode 
character; you just skip past characters where the highest bit is 
set and the next-highest is not. (0b1100_ through 0b_1110 
is the start of a multibyte character; 0b_ through 
0b0111_ is a single-byte character.)


That said, you seem to only need to split based on a newline 
character, so you might be able to ignore this entirely, even if 
you go by chunks.

Re: Splitting up large dirty file

2018-05-17 Thread Dennis via Digitalmars-d-learn


On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
For various reasons, that doesn't always hold true like it 
should, but pretty much all of Phobos is written with that 
assumption and will generally throw an exception if it isn't.


It's unfortunate that Phobos tells you 'there's problems with the 
encoding' without providing any means to fix it or even diagnose 
it. The UTFException doesn't contain what the character in 
question was. You just have to abort whatever you were trying to 
do.


On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
If you're ever dealing with a different encoding (or with 
invalid Unicode), you really need to use integral types like 
ubyte


I tried something like byChunk(4096).joiner.splitter(cast(ubyte) 
'\n') but it turns out splitter wants at least a forward range, 
even when the separator is a single element.

Re: Splitting up large dirty file

2018-05-17 Thread Dennis via Digitalmars-d-learn


On Wednesday, 16 May 2018 at 15:47:29 UTC, Jon Degenhardt wrote:
If you write it in the style of my earlier example and use 
counters and if-tests it will work. byLine by itself won't try 
to interpret the characters (won't auto-decode them), so it 
won't trigger an exception if there are invalid utf-8 
characters.


When printing to stdout it seems to skip any validation, but 
writing to a file does give an exception:


```
auto inputStream = (args.length < 2 || args[1] == "-") ? 
stdin : args[1].File;

auto outputFile = new File("output.txt");
foreach (line; inputStream.byLine(KeepTerminator.yes)) 
outputFile.write(line);

```
std.exception.ErrnoException@C:\D\dmd2\windows\bin\..\..\src\phobos\std\stdio.d(2877):
  (No error)

According to the documentation, byLine can throw an UTFException 
so relying on the fact that it doesn't in some cases doesn't seem 
like a good idea.

Re: Splitting up large dirty file

2018-05-16 Thread Jon Degenhardt via Digitalmars-d-learn


On Wednesday, 16 May 2018 at 07:06:45 UTC, Dennis wrote:

On Wednesday, 16 May 2018 at 02:47:50 UTC, Jon Degenhardt wrote:
Can you show the program you are using that throws when using 
byLine?


Here's a version that only outputs the first chunk:
```
import std.stdio;
import std.range;
import std.algorithm;
import std.file;
import std.exception;

void main(string[] args) {
enforce(args.length == 2, "Pass one filename as argument");
	auto lineChunks = File(args[1], 
"r").byLine.drop(4).chunks(10_000_000/10);

new File("output.txt", "w").write(lineChunks.front.joiner);
}
```


If you write it in the style of my earlier example and use 
counters and if-tests it will work. byLine by itself won't try to 
interpret the characters (won't auto-decode them), so it won't 
trigger an exception if there are invalid utf-8 characters.

Re: Splitting up large dirty file

2018-05-16 Thread Jonathan M Davis via Digitalmars-d-learn

On Wednesday, May 16, 2018 08:57:10 Dennis via Digitalmars-d-learn wrote:
> I thought it wouldn't be hard to crudely split this file using
> D's range functions and basic string manipulation, but the
> combination of being to large for a string and having invalid
> encoding seems to defeat most simple solutions.

D is designed with the idea that a string is valid UTF-8, a wstring is valid
UTF-16, and dstring is valid UTF-32. For various reasons, that doesn't
always hold true like it should, but pretty much all of Phobos is written
with that assumption and will generally throw an exception if it isn't. If
you're ever dealing with a different encoding (or with invalid Unicode), you
really need to use integral types like ubyte (e.g. by using
std.string.representation or by reading the data in as ubytes rather than as
a string) and not try to use character types like char or string. If you try
to use char or string with invalid UTF-8 without having it throw any
exceptions, you're pretty much guaranteed to fail.

- Jonathan M Davis

Re: Splitting up large dirty file

2018-05-16 Thread Dennis via Digitalmars-d-learn


On Wednesday, 16 May 2018 at 08:20:06 UTC, drug wrote:
What is the purpose of `.drop(4)`? I'm pretty sure this is the 
reason of the exception.


The file in question is a .json database dump with an array 
"rows" of 10 million 8-line objects. The newlines in the string 
fields are escaped, but they still contain other invalid 
characters which makes std.json reject it.


The first 4 lines of the file are basically "header" and the last 
2 lines are a closing ] and }, so I want to split every 4 + 
8*(10_000_000/amountOfFiles) n lines and also remove trailing the 
comma, add brackets, drop the last 2 lines etc.


I thought it wouldn't be hard to crudely split this file using 
D's range functions and basic string manipulation, but the 
combination of being to large for a string and having invalid 
encoding seems to defeat most simple solutions. For now I decided 
to use Git Bash and do:

tail -n8002 inputfile.json | split -l 800 - outputfile

And now I have files that do fit in memory. I'm still interested 
in complete D solutions though, thanks for the iopipe and memory 
mapped file suggestions Steven and Jonathan. I will check those 
out.

Re: Splitting up large dirty file

2018-05-16 Thread Dennis via Digitalmars-d-learn


On Wednesday, 16 May 2018 at 08:20:06 UTC, drug wrote:
What is the purpose of `.drop(4)`? I'm pretty sure this is the 
reason of the exception.


The file in question is a .json database dump with an array 
"rows" of 10 million 8-line objects. The newlines in the string 
fields are escaped, but they still contain other invalid 
characters which makes std.json reject it.


The first 4 lines of the file are basically "header" and the last 
2 lines are a closing ] and }, so I want to split every 4 + 
8*(10_000_000/amountOfFiles) lines and also remove trailing the 
comma, add brackets, drop the last 2 lines etc.


I thought it wouldn't be hard to crudely split this file using 
D's range functions and basic string manipulation, but the 
combination of being to large for a string and having invalid 
encoding seems to defeat most simple solutions. For now I decided 
to use Git Bash and do:

```
tail -n8002 inputfile.json | split -l 800 - outputfile
```
And now I have files that do fit in memory. I'm still interested 
in complete D solutions though, thanks for the iopipe and memory 
mapped file suggestions Steven and Jonathan. I will check those 
out.

Re: Splitting up large dirty file

2018-05-16 Thread drug via Digitalmars-d-learn


16.05.2018 10:06, Dennis пишет:


Here's a version that only outputs the first chunk:
```
import std.stdio;
import std.range;
import std.algorithm;
import std.file;
import std.exception;

void main(string[] args) {
 enforce(args.length == 2, "Pass one filename as argument");
 auto lineChunks = File(args[1], 
"r").byLine.drop(4).chunks(10_000_000/10);

 new File("output.txt", "w").write(lineChunks.front.joiner);
}
```

dmd splitFile -g
./splitFile.exe UTF-8-test.txt

std.utf.UTFException@C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1380): 
Invalid UTF-8 sequence (at index 4)


0x004038D2 in pure dchar std.utf.decodeImpl!(true, 0, 
char[]).decodeImpl(ref char[], ref uint) at 
C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1529)
0x00403677 in pure @trusted dchar std.utf.decode!(0, char[]).decode(ref 
char[], ref uint) at C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1076)
0x00403575 in pure @property @safe dchar 
std.range.primitives.front!(char).front(char[]) at 
C:\D\dmd2\windows\bin\..\..\src\phobos\std\range\primitives.d(2333)
0x0040566D in pure @property dchar 
std.algorithm.iteration.joiner!(std.range.Chunks!(std.stdio.File.ByLineImpl!(char, 
char).ByLineImpl).Chunks.Chunk).joiner(std.range
.Chunks!(std.stdio.File.ByLineImpl!(char, 
char).ByLineImpl).Chunks.Chunk).Result.front() at 
C:\D\dmd2\windows\bin\..\..\src\phobos\std\algorithm\iteration.d(2491)


What is the purpose of `.drop(4)`? I'm pretty sure this is the reason of 
the exception.

Re: Splitting up large dirty file

2018-05-16 Thread Dennis via Digitalmars-d-learn


On Wednesday, 16 May 2018 at 02:47:50 UTC, Jon Degenhardt wrote:
Can you show the program you are using that throws when using 
byLine?


Here's a version that only outputs the first chunk:
```
import std.stdio;
import std.range;
import std.algorithm;
import std.file;
import std.exception;

void main(string[] args) {
enforce(args.length == 2, "Pass one filename as argument");
	auto lineChunks = File(args[1], 
"r").byLine.drop(4).chunks(10_000_000/10);

new File("output.txt", "w").write(lineChunks.front.joiner);
}
```

dmd splitFile -g
./splitFile.exe UTF-8-test.txt

std.utf.UTFException@C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1380): 
Invalid UTF-8 sequence (at index 4)

0x004038D2 in pure dchar std.utf.decodeImpl!(true, 0, 
char[]).decodeImpl(ref char[], ref uint) at 
C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1529)
0x00403677 in pure @trusted dchar std.utf.decode!(0, 
char[]).decode(ref char[], ref uint) at 
C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1076)
0x00403575 in pure @property @safe dchar 
std.range.primitives.front!(char).front(char[]) at 
C:\D\dmd2\windows\bin\..\..\src\phobos\std\range\primitives.d(2333)
0x0040566D in pure @property dchar 
std.algorithm.iteration.joiner!(std.range.Chunks!(std.stdio.File.ByLineImpl!(char, char).ByLineImpl).Chunks.Chunk).joiner(std.range
.Chunks!(std.stdio.File.ByLineImpl!(char, 
char).ByLineImpl).Chunks.Chunk).Result.front() at 
C:\D\dmd2\windows\bin\..\..\src\phobos\std\algorithm\iteration.d(2491)

Re: Splitting up large dirty file

2018-05-15 Thread Jon Degenhardt via Digitalmars-d-learn


On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:

I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gb 
would fit but I get an out of memory error when using 
std.file.read)
- It is dirty (contains invalid Unicode characters, null bytes 
in the middle of lines)


I want to write a program that splits it up into multiple 
files, with the splits happening every n lines. I keep 
encountering roadblocks though:


- You can't give Yes.useReplacementChar to `byLine` and 
`byLine` (or `readln`) throws an Exception upon encountering an 
invalid character.


Can you show the program you are using that throws when using 
byLine? I tried a very simple program that reads and outputs 
line-by-line, then fed it a file that contained invalid utf-8. I 
did not see an exception. The invalid utf-8 was created by taking 
part of this file: 
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (a 
commonly used file with utf-8 edge cases), plus adding a number 
of random hex characters, including null. I don't see exceptions 
thrown.


The program I used:

int main(string[] args)
{
import std.stdio;
import std.conv : to;
try
{
auto inputStream = (args.length < 2 || args[1] == "-") ? 
stdin : args[1].File;
foreach (line; inputStream.byLine(KeepTerminator.yes)) 
write(line);

}
catch (Exception e)
{
stderr.writefln("Error [%s]: %s", args[0], e.msg);
return 1;
}
return 0;
}

Re: Splitting up large dirty file

2018-05-15 Thread Jonathan M Davis via Digitalmars-d-learn

On Tuesday, May 15, 2018 20:36:21 Dennis via Digitalmars-d-learn wrote:
> I have a file with two problems:
> - It's too big to fit in memory (apparently, I thought 1.5 Gb
> would fit but I get an out of memory error when using
> std.file.read)
> - It is dirty (contains invalid Unicode characters, null bytes in
> the middle of lines)
>
> I want to write a program that splits it up into multiple files,
> with the splits happening every n lines. I keep encountering
> roadblocks though:
>
> - You can't give Yes.useReplacementChar to `byLine` and `byLine`
> (or `readln`) throws an Exception upon encountering an invalid
> character.
> - decodeFront doesn't work on inputRanges like
> `byChunk(4096).joiner`
> - std.algorithm.splitter doesn't work on inputRanges either
> - When you convert chunks to arrays, you have the risk of a split
> being in the middle of a character with multiple code units
>
> Is there a simple way to do this?

If you're on a *nix systime, and you're simply looking for a solution to
split files and don't necessarily care about writing one, I'd suggest trying
the split utility:

https://linux.die.net/man/1/split

If I had to write it in D, I'd probably just use std.mmap and operate on the
files as a dynamic array of ubytes, since if what you care about is '\n',
that can easily be searched for without needing any decoding, and using mmap
avoids having to chunk anything.

- Jonathan M Davis

Re: Splitting up large dirty file

2018-05-15 Thread Steven Schveighoffer via Digitalmars-d-learn


On 5/15/18 4:36 PM, Dennis wrote:

I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gb would fit 
but I get an out of memory error when using std.file.read)
- It is dirty (contains invalid Unicode characters, null bytes in the 
middle of lines)


I want to write a program that splits it up into multiple files, with 
the splits happening every n lines. I keep encountering roadblocks though:


- You can't give Yes.useReplacementChar to `byLine` and `byLine` (or 
`readln`) throws an Exception upon encountering an invalid character.

- decodeFront doesn't work on inputRanges like `byChunk(4096).joiner`
- std.algorithm.splitter doesn't work on inputRanges either
- When you convert chunks to arrays, you have the risk of a split being 
in the middle of a character with multiple code units


Is there a simple way to do this?



Using iopipe, you can split on N lines (iopipe doesn't autodecode when 
searching for newlines), or split on a pre-determined chunk size (and 
ensure you don't split a code point).


Splitting on N lines:

import iopipe.bufpipe;
import iopipe.textpipe;

auto infile = openDev("filename").bufd.assumeText.byLine;

foreach(i; 0 .. N) infile.extend(0); // ensure N lines in the buffer

Splitting on pre-determined chunk size

auto infile = openDev("filename")
.bufd!(ubyte, chunkSize) // use chunkSize as minimum read size
.assumeText // it's text, not ubyte
.ensureDecodeable; // do not end in the middle of a codepoint

The output isn't as straightforward. Ideally you would want to simply 
create an output pipe that split into multiple files, and process the 
whole thing at once. I haven't created such a thing yet though (will add 
an enhancement request to do so).


Easiest thing to do is to write the entire window of the input pipe into 
an output pipe, or cast it back to ubyte[] and write directly to an 
output device.


e.g.:

auto infile = ... // one of the above ideas
   .encodeText; // convert to ubyte

auto outfile = openDev("outputFilename1", "w");
outfile.write(infile.window);
outfile.close;
infile.release(infile.window.length); // flush the input buffer
... // refill the buffer using the chosen technique above.

-Steve

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

Re: Splitting up large dirty file

20 matches

Site Navigation

Mail list logo

Footer information