[xz-devel] List is now public

2011-01-27 Thread Lasse Collin
I added this to mail-archive.com. I will update the home page in a few 
hours once I see that this message is visible on mail-archive.com.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ Utils 5.0.1

2011-01-29 Thread Lasse Collin
XZ Utils 5.0.1 is available at <http://tukaani.org/xz/>. It fixes a few 
minor bugs. Here is an extract from the NEWS file:

  * xz --force now (de)compresses files that have setuid, setgid,
or sticky bit set and files that have multiple hard links.
The man page had it documented this way already, but the code
had a bug.

  * gzip and bzip2 support in xzdiff was fixed.

  * Portability fixes

  * Minor fix to Czech translation

(As written on <http://tukaani.org/xz/lists.html>, I will send release 
announcements to xz-devel also in the future, so there's no need to 
subscribe to both xz-devel and xz-announce.)

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Detecting .lzma-compressed files

2011-03-17 Thread Lasse Collin
On 2011-03-17 Mark wrote:
> What is the best way to detect data which was compressed using
> lzma_alone (i.e. .lzma files)?

There is no easy answer. It depends on if you want to detect the typical 
.lzma files (over 99.9 % of .lzma files) or also the uncommon ones. The 
typical .lzma files have been created with LZMA Utils 4.32.x (any 
compression settings), XZ Utils (with the most common settings) or LZMA 
SDK (default settings). LZMA SDK and LZMA Utils can decode the uncommon 
.lzma files too, but some of the uncommon files cannot be decompressed 
with XZ Utils.

> I'm developing a patch for the star archiver to support xz-compressed
> files. While detection of an XZ-format file is easy enough, .lzma
> doesn't seem to be. (This is so star invokes the correct program to
> decompress the data.)

With GNU tar I used a patch that checked that the first three bytes are 
0x5D 0x00 0x00 ("]\0\0"). It catched all typical .lzma files and didn't 
conflict with other compressors. It did have a false positive if the 
first file inside the .tar was named "]", so I don't know if this 
solution is acceptable to you.

A more complex hack with fewer false positives but also some false 
negatives is used in XZ Utils:

  - The first byte must be in the range [0x00, 0xE0]. In most
files it is 0x5D (']').

  - The next four bytes are read as unsigned 32-bit little
endian integer. This indicates the dictionary size. In
typical files it is 2^n or 2^n + 2^(n-1). XZ Utils accept
only these sizes and UINT32_MAX. The .lzma format allows
other sizes too though and LZMA Utils 4.32.x and LZMA SDK
accept any dictionary size.

  - The next eight bytes are read as unsigned 64-bit little
endian integer. This indicates the uncompressed size of
the file. It should be either UINT64_MAX (meaning that the
size is unknown) or some size as bytes. XZ Utils rejects
files having a known size greater than 2^38 bytes (256 GiB).

Parts of is_format_lzma() in src/xz/coder.c in the XZ Utils source tree 
might be useful to you. Reading doc/lzma-file-format.txt might help a 
little too.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Detecting .lzma-compressed files

2011-03-17 Thread Lasse Collin
On 2011-03-17 ma...@clara.co.uk wrote:
> Lasse Colin wrote:
> > > A more complex hack with fewer false positives but also some false
> > negatives is used in XZ Utils:
> > ...
> >   - The next eight bytes are read as unsigned 64-bit little
> > endian integer. This indicates the uncompressed size of
> > the file. It should be either UINT64_MAX (meaning that the
> > size is unknown) or some size as bytes. XZ Utils rejects
> > files having a known size greater than 2^38 bytes (256 GiB).
> 
> Can xz be forced to work with a file whose uncompressed size field is
> larger than 256GiB? That does seem a bit small these days; it's
> conceivable that some users might be compressing files that large.

Currently no. I will reconsider if it is a real-world problem for 
someone.

Note that the limit is only for .lzma files with a known uncompressed 
size. Files created in a pipe have unknown uncompressed size, and .lzma 
files created with XZ Utils always have unknown uncompressed size 
(simpler code). Most new files will use .xz instead of .lzma so the 
limitations of the .lzma support in XZ Utils don't matter much, I hope.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ in Java

2011-03-30 Thread Lasse Collin
There is now something for decompressing .xz files in Java:

http://tukaani.org/xz/java.html

It currently lacks BCJ filters but otherwise it supports everything from 
the .xz specification. It hasn't been tested much, but at least it 
behaves with the test files from XZ Utils.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ Utils 5.0.2

2011-04-01 Thread Lasse Collin
XZ Utils 5.0.2 is available at <http://tukaani.org/xz/>. It fixes a few 
minor bugs. Here is an extract from the NEWS file:

  * LZMA2 decompressor now correctly accepts LZMA2 streams with no
uncompressed data. Previously it considered them corrupt. The
bug can affect applications that use raw LZMA2 streams. It is
very unlikely to affect .xz files because no compressor creates
.xz files with empty LZMA2 streams. (Empty .xz files are a
different thing than empty LZMA2 streams.)

  * "xz --suffix=.foo filename.foo" now refuses to compress the
file due to it already having the suffix .foo. It was already
documented on the man page, but the code lacked the test.

  * "xzgrep -l foo bar.xz" works now.

  * Polish translation was added.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ Utils 5.1.1alpha

2011-04-12 Thread Lasse Collin
XZ Utils 5.1.1alpha is available at <http://tukaani.org/xz/>. Here is an 
extract from the NEWS file:

  * All fixes from 5.0.2

  * liblzma fixes that will also be included in 5.0.3:

  - A memory leak was fixed.

  - lzma_stream_buffer_encode() no longer creates an empty .xz
Block if encoding an empty buffer. Such an empty Block with
LZMA2 data would trigger a bug in 5.0.1 and older (see the
first bullet point in 5.0.2 notes). When releasing 5.0.2,
I thought that no encoder creates this kind of files but
I was wrong.

  - Validate function arguments better in a few functions. Most
importantly, specifying an unsupported integrity check to
lzma_stream_buffer_encode() no longer creates a corrupt .xz
file. Probably no application tries to do that, so this
shouldn't be a big problem in practice.

  - Document that lzma_block_buffer_encode(),
lzma_easy_buffer_encode(), lzma_stream_encoder(), and
lzma_stream_buffer_encode() may return LZMA_UNSUPPORTED_CHECK.

  - The return values of the _memusage() functions are now
documented better.

  * Support for multithreaded compression was added using the simplest
method, which splits the input data into blocks and compresses
them independently. Other methods will be added in the future.
The current method has room for improvement, e.g. it is possible
to reduce the memory usage.

  * Added the options --single-stream and --block-size=SIZE to xz.

  * xzdiff and xzgrep now support .lzo files if lzop is installed.
The .tzo suffix is also recognized as a shorthand for .tar.lzo.

  * Support for short 8.3 filenames under DOS was added to xz. It is
experimental and may change before it gets into a stable release.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] strerror-like functionality in liblzma

2011-05-17 Thread Lasse Collin
Implementing a function to convert lzma_ret to string is tricky, because 
the same return values have slightly different meanings when returned by 
different functions. This is a design mistake in the API, but it cannot 
be fixed without breaking the API, which I don't want to do.

One possibility would be to provide a few strerror-like functions that 
could be used with return values of different functions. This doesn't 
sound nice though.

Letting liblzma construct the error message when the error occurs allows 
more detailed error messages than what one could get by converting 
lzma_ret to a string. E.g. when LZMA_OPTIONS_ERROR is returned, the 
error message could include what compression option was the problem.

Functions that work on lzma_stream could store the message in the 
lzma_stream structure. This is what zlib does. liblzma has many 
functions that don't use lzma_stream, so this isn't a solution for those 
functions.

A thread-local variable to store an error message would work with all 
functions and also in threaded programs. Would this be OK? Does someone 
have alternative ideas?

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] strerror-like functionality in liblzma

2011-05-17 Thread Lasse Collin
On 2011-05-17 Thorsten Glaser wrote:
> Lasse Collin dixit:
> >A thread-local variable to store an error message would work with
> >all
> 
> This wouldn’t be portable at all.

To be more exact, I meant a function that would return a pointer to 
thread-specific char array. POSIX has pthread_key_create() and 
pthread_once() that can be used to implement this. I think those are 
fairly portable. I'm aware that Windows might give some gray hair, but I 
won't worry about that too much.

Maybe there is a problem if liblzma is loaded with dlopen() and later 
unloaded with dlclose(). It could leak the memory allocated for the 
thread-specific data and leak the resources associated with a 
pthread_key_t. glibc supports destructor functions that are called 
before dlcose() returns. I think it would prevent this issue, but such 
destructors aren't portable.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] strerror-like functionality in liblzma

2011-05-17 Thread Lasse Collin
On 2011-05-17 Thorsten Glaser wrote:
> Lasse Collin dixit:
> >To be more exact, I meant a function that would return a pointer to
> >thread-specific char array. POSIX has pthread_key_create() and
> 
> Oh sure. Let’s just force all xz users to link in libpthread…

It already does that unless you pass --disable-threads to configure when 
compiling XZ Utils. 5.1.1alpha does threaded compression, so most people 
don't want to disable threading support.

It's the dlopen/dlclose situation that worries me. GNU and Solaris call 
functions registered with atexit() when a library is unloaded, but that 
trick isn't supported e.g. on BSDs. GCC's __attribute__((destructor)) 
seems to work on a few other systems too, but it requires that the 
compiler supports GNU C extensions, which isn't an acceptable 
requirement in this case.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] strerror-like functionality in liblzma

2011-05-19 Thread Lasse Collin
On 2011-05-19 Jonathan Nieder wrote:
> Lasse Collin wrote:
> > It already does that unless you pass --disable-threads to configure
> > when compiling XZ Utils.
> 
> It seems like a sane worry.  If someone uses --disable-threads, does
> that mean that person won't need a thread-safe way to get error
> messages?

--disable-threads already means that liblzma might become thread unsafe. 
It is documented in --help and in INSTALL. Currently thread unsafe 
situation happens only if --enable-small is also used.

> (One reasonable answer might be "yes, such a person can
> read the documentation and figure out what happened from the error
> numbers, and at least they won't be worse off than they started."  If
> it proves to be annoying, it's possible to introduce _r variants that
> return the error message through a parameter later.)

It's possible that I will do something like this as the only method. 
With functions that use lzma_stream, the message can be stored there. 
For some other functions, a method to pass a pointer to a buffer to hold 
the message may be needed.

> > It's the dlopen/dlclose situation that worries me. GNU and Solaris
> > call functions registered with atexit() when a library is
> > unloaded, but that trick isn't supported e.g. on BSDs. GCC's
> > __attribute__((destructor)) seems to work on a few other systems
> > too, but it requires that the compiler supports GNU C extensions,
> > which isn't an acceptable requirement in this case.
> 
> C1X has[1] a _Thread_local keyword that might work well.  So in a
> decade or so a person will be able to write
> 
>   _Thread_local const char *lzma_error_message;
> 
> and rely on compilers setting up the appropriate constructors and
> destructors behind the scenes.  Today, GCC has[2] __thread and
> Microsoft C has[3] __declspec(thread).
> 
> I haven't played around with it much, but maybe that can help.

It can help when the new standard has been out for a few years. Before 
that, I cannot rely on GNU C extensions. Currently the code can be 
compiled with several compilers, and that's how I want it to be in the 
future too. The current portable method for thread-specific data is 
pthread_key_create().

My current understanding of the interactions of pthread_key_create() and 
dlclose():

  - With C++ I could use a global object whose destructor is run
when the library is unloaded with dlclose(). The destructor
would free the thread-specific data and call pthread_key_delete().

  - Non-portable operating system, C compiler, or linker extensions
would make it possible to have a destructor function that is
called when the library is unloaded.

  - I could require that developers call some initiazation and
destruction functions when they start and stop using the
library. This would be annoying and it's easy to forget to
call the destructor.

  - If the library is never unloaded with dlclose(), then there
is no problem with pthread_key_create(). This isn't an
acceptable limitation for liblzma.

In short, if one wants to only use C code and functionality provided by 
POSIX.1-2008, it's not possible to use thread-specific data in a shared 
library without restricting or complicating the use of that library. I'm 
happy if someone will show that I'm wrong.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [RFC/PATCH] using versioned symbols in liblzma

2011-05-19 Thread Lasse Collin
On 2011-05-19 Jonathan Nieder wrote:
> Well, that would be unpleasant.  Consider a program foo that links to
> both libkdecore5 and libdw1.  The installed version of libdw1 has
> been rebuilt against liblzma6, while the local copy of libkdecore5
> is still linked against liblzma5.  What happens?

If those two libraries exchange pointers to liblzma structures, things 
go wrong even with symbol versions, right? Most libraries don't do that, 
but I suppose you still need to carefully track which libraries do.

> -liblzma_la_LDFLAGS = -no-undefined -version-info 5:99:0
> +liblzma_la_LDFLAGS = -no-undefined -version-info 5:99:0 \
> +   -Wl,--version-script=$(top_srcdir)/src/liblzma/Versions

This option is specific to GNU ld, so it must not be used 
unconditionally. zlib enables symbol versioning if uname -s matches any 
of these:

Linux* | linux* | GNU | GNU/* | *BSD | DragonFly

zlib doesn't use Autoconf so those need to be converted to the format 
used by Autoconf, although it's not clear to me yet if symbol versioning 
is wanted on all these systems in upstream liblzma. E.g. FreeBSDs ships 
xz in the base system with its symbol versioning file. On the other 
hand, maybe FreeBSD's map file could be used elsewhere too.

http://svnweb.freebsd.org/base/head/lib/liblzma/

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Re: [RFC/PATCH] using versioned symbols in liblzma

2011-05-19 Thread Lasse Collin
On 2011-05-19 Jonathan Nieder wrote:
> Sadly the symbol versioning mechanism doesn't seem to be documented
> nicely in the style of a manpage anywhere.

Thanks for the links. I'm fine with Texinfo myself. :-)

> Short-term question: would you mind if Debian carries this patch for
> the time being?  In particular, do the version node names
[...]
> seem reasonable to standardize on (in environments that will be using
> symbol versions)?

I'm not sure about the names yet. See the FreeBSD example in another 
email.

> The main unfortunate effect is warnings when running binaries linked
> against the versioned symbols in an environment not providing them.

Those are annoying, but I guess it's not a big deal in this case.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [RFC/PATCH] using versioned symbols in liblzma

2011-05-19 Thread Lasse Collin
On 2011-05-19 Jonathan Nieder wrote:
> >> -liblzma_la_LDFLAGS = -no-undefined -version-info 5:99:0
> >> +liblzma_la_LDFLAGS = -no-undefined -version-info 5:99:0 \
> >> +   -Wl,--version-script=$(top_srcdir)/src/liblzma/Versions
> > 
> > This option is specific to GNU ld, so it must not be used
> > unconditionally. zlib enables symbol versioning if uname -s matches
> > any of these:
> > Linux* | linux* | GNU | GNU/* | *BSD | DragonFly
> > 
> > zlib doesn't use Autoconf so those need to be converted to the
> > format used by Autoconf, although it's not clear to me yet if
> > symbol versioning is wanted on all these systems in upstream
> > liblzma.
> 
> Would it make sense to add an autoconf test and to use
> --version-script by default on platforms that support it?  Then
> users and packagers could pass --disable-symbol-versioning to
> configure when appropriate.

I don't know. If there is GNU ld and some other linker available and 
only the GNU version supports symbol versions, does it make sense that 
the choice of linker will affect if versioning is used or not? Would 
that be a mess? Also, the other linker might support versioning too but 
use a different command line option (e.g. Solaris ld uses -M).

> > E.g. FreeBSDs ships
> > xz in the base system with its symbol versioning file. On the other
> > hand, maybe FreeBSD's map file could be used elsewhere too.
> > 
> > http://svnweb.freebsd.org/base/head/lib/liblzma/
> 
> Ah, thanks for the pointer.  I'll use FreeBSD's version names for the
> public symbols (so, XZ_5.0 and XZ_5.1).

I guess it is mostly OK. I think at least alpha versions should be e.g. 
XZ_5.1.1alpha, because I won't try to keep those symbols stable.

> Another question: what can I assume with regard to ABI stability of
> development versions?  For example, is every symbol that appears in a
> beta part of the ABI, while symbols in alphas are subject to change,

That's how I hope it will go, but if there is clear need, I will change 
things even in beta.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ Utils 5.0.3

2011-05-21 Thread Lasse Collin
XZ Utils 5.0.3 is available at <http://tukaani.org/xz/>. Here is an 
extract from the NEWS file:

  * liblzma fixes:

  - A memory leak was fixed.

  - lzma_stream_buffer_encode() no longer creates an empty .xz
Block if encoding an empty buffer. Such an empty Block with
LZMA2 data would trigger a bug in 5.0.1 and older (see the
first bullet point in 5.0.2 notes). When releasing 5.0.2,
I thought that no encoder creates this kind of files but
I was wrong.

  - Validate function arguments better in a few functions. Most
importantly, specifying an unsupported integrity check to
lzma_stream_buffer_encode() no longer creates a corrupt .xz
file. Probably no application tries to do that, so this
shouldn't be a big problem in practice.

  - Document that lzma_block_buffer_encode(),
lzma_easy_buffer_encode(), lzma_stream_encoder(), and
lzma_stream_buffer_encode() may return LZMA_UNSUPPORTED_CHECK.

  - The return values of the _memusage() functions are now
documented better.

  * Fix command name detection in xzgrep. xzegrep and xzfgrep now
correctly use egrep and fgrep instead of grep.

  * French translation was added.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [RFC/PATCH] using versioned symbols in liblzma

2011-05-22 Thread Lasse Collin
On 2011-05-21 Jonathan Nieder wrote:
> All else being equal, I'd prefer to allow testers to try
> 
>  1. update liblzma
>  2. update xz
> 
> without breaking xz in the window between steps 1 and 2, except in
> the obvious case when a function that turned out to be a bad idea
> was changed or removed.

It will work between stable releases, but I wouldn't like to make such a 
promise for alpha releases. Maybe I could promise it for beta versions, 
but I'm not sure yet.

Since it is likely that in some alpha-to-alpha upgrades your wish 
wouldn't work, I think it is simpler and safer to just assume that new 
things in development releases aren't stable. So with non-stable 
releases, keep xz and liblzma always in sync.

I'm not sure if distros should ship alpha or beta versions of *shared* 
liblzma at all.

> That would mean that even after the alpha
> is over, lzma_stream_encoder_mt would stay as
> 
>   lzma_stream_encoder_mt@XZ_5.1.1alpha

This made me think, what should I do when I extend old functions e.g. by 
adding support for a new flag? It doesn't affect backward compatibility, 
but it means that new applications that use the new functionality won't 
work with older liblzma versions. Should this kind of extensions be 
visible in symbol versions too, or is incrementing the minor soname 
enough?

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [RFC/PATCH] using versioned symbols in liblzma

2011-05-23 Thread Lasse Collin
On 2011-05-23 Jonathan Nieder wrote:
> Lasse Collin wrote:
> > This made me think, what should I do when I extend old functions
> > e.g. by adding support for a new flag? It doesn't affect backward
> > compatibility, but it means that new applications that use the new
> > functionality won't work with older liblzma versions. Should this
> > kind of extensions be visible in symbol versions too, or is
> > incrementing the minor soname enough?
> 
> The old version of the function would return LZMA_OPTIONS_ERROR,
> right?  So just incrementing the minor library version seems safe
> enough. :)

Right.

> If on the other hand you want to make running a new program with the
> old library into a hard error, then the only ways I know are ugly.

I suppose it's not required. On the other hand, I have understood that 
some systems will give a warning or error at program startup if the 
program was linked against newer minor soname than what is currently 
installed.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] strerror-like functionality in liblzma

2011-05-28 Thread Lasse Collin
On 2011-05-28 Guillem Jover wrote:
> Adding the _r (or _e for error or whatever) counterparts seems like
> the most portable solution (with the C/POSIX restictions you
> mention), at the cost of API bloat (as we discussed on IRC), and
> probably more code churn? The normal functions could then be made to
> be tiny shims just passing NULL as the error argument.

Yes, it needs more changes to the code than using a thread-local 
variable.

> Passing just a pointer to a buffer might be problematic due to the
> size being unknown to the caller, so ideally it should be a pointer
> to pointer to buffer, and the function allocating the message or
> assigning from a static string table, or a pointer to an int (or
> some other integral type) to just assign an extended lzma_ret code.

Defining big enough maximum size for the message should be enough, e.g. 
512 bytes. A caller can allocate such a buffer on stack. I don't want to 
dynamically allocate memory because that can fail and that needs to be 
freed too.

Using a custom string allows more specific messages, e.g. including what 
value was seen in the file vs. what was expected. It also is safer when 
liblzma is loaded with dlopen() because there is no static string that 
could disappear.

> I'd add the TLS storage class specifier as an option, as it seems to
> be suported by quite a few compilers, it obviously depends on which
> ones you want to support currently.

So far my goal has been to support anything that supports C99 enough, 
sometimes using something else than Autotools-based build system. In 
practice this has excluded GCC 2 and Microsoft's compilers.

It is very annoying if liblzma provides a function that is available 
only on some platforms. So if TLS is used, it needs to be supported 
almost everywhere to be acceptable for XZ Utils.

> It seems (from [0] and [1]) at least these compilers support
> something like __thread or __declspec(thread):
> 
>   * Borland C++ Builder
>   * Digital Mars C/C++
>   * GNU C/C++
>   * HP Tru64 UNIX C/C++
>   * IBM XL C/C++
>   * Intel C/C++
>   * Sun Studio C/C++
>   * Visual C++

That covers quite a lot, but I'm not confident that it is enough. One 
must keep in mind that compiler support isn't enough: it needs also 
operating system support. So having GCC >= 3.3 available doesn't imply 
that TLS is supported.

> >   - If the library is never unloaded with dlclose(), then there
> > is no problem with pthread_key_create(). This isn't an
> > acceptable limitation for liblzma.
> 
> Well, pthread_key_create() allows to provide a destructor function,
> so as long as that function is not part of liblzma (free(3)), then
> it can do proper cleanup once the thread terminates, ragardless of
> liblzma having been unloaded.

The pthread_key_t is in the memory of liblzma and thus may be gone when 
liblzma is unloaded. Pointer to the destructor function might be stored 
in the key. The key would also be leaked, which can be a real-world 
problem if the library is loaded and unloaded multiple times, because 
the system might support only a limited number of keys per process:

https://svn.boost.org/trac/boost/ticket/4639

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [RFC/PATCH] using versioned symbols in liblzma

2011-05-28 Thread Lasse Collin
I have added symbol versioning to liblzma. Please check that it looks 
sane. It is enabled by default on GNU-based systems and FreeBSD.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Straightforward memory-to-memory compression&decompress in C?

2011-06-26 Thread Lasse Collin
On 2011-06-24 Dan Stromberg wrote:
> I'm looking for some example code (C preferred, something else if
> need be) that will:
> 
> 1) Demonstrate using liblzma (or whatever library xz-utils produces),
> but producing output in the xz format, not the lzma format.
> 
> 2) Demonstrate using liblzma (or whatever) for memory-to-memory
> compression, and memory-to-memory decompression (I'm only
> compressing smallish chunks, and wish to do my own I/O so I can
> sidestep the buffer cache)

There are two example programs in doc/examples directory in XZ Utils 
source. They use multi-call mode to compress big files in a pipe. Data 
is passed to and from liblzma via buffers.

If you want a single-call interface (one function to encode or decode a 
buffer holding a complete .xz file), see lzma_easy_buffer_encode, 
lzma_stream_buffer_encode, and lzma_stream_buffer_decode in 
src/liblzma/api/lzma/container.h (or /usr/include/lzma/container.h).

Note that the above functions work on .xz files, not .lzma files, even 
though the names might suggest otherwise. The names are what they are 
for historical reasons: originally .xz was supposed to be .lzma, 
replacing the old .lzma format.

It would be nice to have more tutorial programs for different use cases, 
but so far I haven't written anything like that.

> BTW, does the underlying library API (need to) change much? 

liblzma API is stable. Things will be added, but old things won't change 
in incompatible ways.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Straightforward memory-to-memory compression&decompress in C?

2011-07-07 Thread Lasse Collin
(No need to CC anyone since everyone who can post to the list are 
subscribers. Majordomo doesn't prevent delivery of duplicate emails if 
someone uses CC.)

On 2011-07-06 Dan Stromberg wrote:
> Is it safe to assume that lzma_stream_buffer_decode is the way to go
> for decompression, irrespective of whether one has used
> lzma_easy_buffer_encode or lzma_stream_buffer_encode to create the
> compressed input?

Yes. "easy" refers to the way the compression options are set. It 
doesn't affect the file format so there's no need to have an "easy" 
function for decompression.

> > It would be nice to have more tutorial programs for different use
> > cases, but so far I haven't written anything like that.
> 
> Maybe we can kill two birds with one stone here - since I'm
> prototyping in C, would you find it more useful if it were done as
> an example program, or as a unit test?

Example programs with good comments for every step can be used as 
tutorials.

The test suite is currently poor so improving the test suite would be 
welcome. But I think that improving the test suite doesn't help 
improving the documentation.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [PATCH] “xzdiff a.xz b.xz” exit status should reflect whether the files differ

2011-07-28 Thread Lasse Collin
On 2011-07-26 Jonathan Nieder wrote:
> xzdiff was clobbering the exit status from diff in a case statement
> used to analyze the exit statuses from "xz" when its operands were
> two compressed files.  Save and restore diff's exit status to fix
> this.

The fix looks OK. The test suite addition needs minor changes.

> +temporaries="tmp_preimage.xz tmp_samepostimage.xz
> tmp_otherpostimage.xz"
> +rm -f $temporaries
> +trap "rm -f $temporaries" 0

I'm not sure how well "trap" behaves with ancient shells. You can use
the included test files instead of temp files:

"$srcdir/files/good-1-check-crc32.xz"
"$srcdir/files/good-1-check-crc64.xz"
"$srcdir/files/good-1-lzma2-1.xz"

> +PATH=$(pwd)/../src/xz:$PATH

Ancient pre-POSIX /bin/sh implementations don't support $(pwd) so
it's better to use `pwd` here. The same shells need a separate "export
PATH" to update the environment.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [PATCH] “xzdiff a.xz b.xz” exit status should reflect whether the files differ

2011-07-31 Thread Lasse Collin
On 2011-07-29 Jonathan Nieder wrote:
> +sh "$XZDIFF" "$preimage" "$samepostimage" >/dev/null

I missed this during the first round. It's not necessarily sh that ends
up as @POSIX_SHELL@ into the scripts, so it's possible that this will
use a different shell to run xzdiff than normal use of xzdiff would.

Ancient pre-POSIX /bin/sh doesn't run xzdiff and other scripts
correctly. That's why there's @POSIX_SHELL@ which gets replaced by
configure. (The test suite doesn't rely on @POSIX_SHELL@ so the test
scripts themselves still need to work with old shells.)

Solaris 10 is an example with a problematic /bin/sh. However, there's a
better sh in the PATH first. So maybe it isn't a problem in practice;
I didn't test now.

Even though the above might be a problem, I have committed the patches.
Thank you.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [PATCH] “xzdiff a.xz b.xz” exit status should reflect whether the files differ

2011-08-03 Thread Lasse Collin
On 2011-07-31 Jonathan Nieder wrote:
> Maybe it could make sense to
> teach the Makefile instead of the configure script to generate the
> scripts so they could be marked executable at build time and then used
> directly.

Maybe. The current way was used because it was the laziest and had a low
risk of new build system bugs.

I haven't merged your patch to v5.0 yet. I haven't decided if the test
script is safe enough there yet. It would be annoying if someone got a
test failure in a stable release just because the test script uses a
wrong shell.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Re: [PATCH] “xzdiff a.xz b.xz” exit status should reflect whether the files differ

2011-08-06 Thread Lasse Collin
On 2011-08-03 Jonathan Nieder wrote:
> Makes sense.  Just for kicks, here's a try based on advice from
> <http://www.gnu.org/s/hello/manual/automake/Scripts.html>.  It
> probably makes more sense to use the "AC_CONFIG_FILE([src/my_script],
> [chmod +x src/my_script])" approach.

I used the AC_CONFIG_FILE approach. I guess I had missed that part from
the manual earlier. Thanks.

> > I haven't merged your patch to v5.0 yet. I haven't decided if the
> > test script is safe enough there yet. It would be annoying if
> > someone got a test failure in a stable release just because the
> > test script uses a wrong shell.
> 
> No need to hurry.  I don't mind if you merge the fix without the
> test. :)

Maybe it is OK now unless the chmod causes trouble on some not-so-POSIX
systems. :-)

I think 5.0.4 should be released before the end of this month.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Re: [PATCH] “xzdiff a.xz b.xz” exit status should reflect whether the files differ

2011-08-12 Thread Lasse Collin
On 2011-08-07 Jonathan Nieder wrote:
> On a completely unrelated note, I finally found time to start reading
> the xz-java implementation, and it's been very pleasant.  Thanks for
> writing it. :)

Thanks! I will need to do a little more Java coding, so new development
on XZ Utils will unfortunately need to wait a little more. The good news
is that the lessons learned while working on the Java code should help
with XZ Utils.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ for Java 0.4

2011-08-19 Thread Lasse Collin
XZ for Java 0.4 is now available:

http://tukaani.org/xz/java.html

There are some minor fixes to the old code. Support for random access
decompression is a new feature. Threading is missing but it's easier to
add it to this code than to liblzma in XZ Utils.

This is probably the last release before 1.0. I don't have anything
planned before 1.0 but maybe someone finds something that could be
improved. :-)

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Integration with TrueZIP

2011-09-12 Thread Lasse Collin
On 2011-09-12 Christian Schlichtherle wrote:
> I just found your website and would be interested to write a driver
> module for TAR.XZ files for TrueZIP (http://truezip.java.net). I
> wonder if anyone has already done this because I do not want to
> reinvent the wheel.

Probably not.

> I had a brief look at the code and I noticed that the 0.4
> distribution of XZ for Java contains more classes than what the
> online Javadoc has emmitted. Is this required?

Classes that aren't part of the public API aren't documented in the API
docs. Non-public classes are needed by the public classes.

It tries to be a complete and pedantic implementation, not a
size-optimized implementation. So it's a bit bloated.

> For integration with TrueZIP, I would like to add XZ for Java to the
> Maven Central directory. Would anybody mind if I do this for you?

It would be nice if you could do it. Having the code in a Maven
repository would be useful also for Apache Commons Compress integration.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Integration with TrueZIP

2011-09-12 Thread Lasse Collin
On 2011-09-12 Christian Schlichtherle wrote:
> > It would be nice if you could do it. Having the code in a Maven
> > repository would be useful also for Apache Commons Compress
> > integration.
> 
> No problem. I could write a pom.xml for you. I could then either
> upload the generated artifact to Maven Central or you could do it. If
> you want me to do it, as a side effect I would take "ownership" with
> the groupId at oss.sonatype.org. If that's not what you want, you
> would need to sign up for an account at oss.sonatype.org and deploy
> the artifacts yourself. Up to you, of course.

Let's start with pom.xml. I'll decide later if I want to upload it
myself. Thanks.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Use of XZ/LZMA compression in the ZIP file format

2011-09-21 Thread Lasse Collin
On 2011-09-14 Christian Schlichtherle wrote:
> I am looking into integrating LZMA compression to my TrueZIP Driver
> Zip
> <http://truezip.java.net/truezip-driver/truezip-driver-zip/index.html>
> as explained in the ZIP File Format Specification
> <http://www.pkware.com/documents/casestudies/APPNOTE.TXT> . Now I
> wonder what I need to do to restrict the compression to LZMA-only,
> not LZMA2 or XZ, or if I should not restrict it at all because
> supporting method 14 (LZMA) may imply supporting LZMA2 or XZ, too.

Internally LZMA2 uses the same code as the original LZMA, but it needs
some work to adapt the LZMA2 Java code to do LZMA streams:

  - See the comment about RangeEncoder.cacheSize in the code.

  - You will need to modify RangeEncoder.shiftLow so that it writes
directly to an output stream (instead of to a buffer).

  - You need to add a function to write LZMA end of stream marker:
rc.encodeBit(isMatch[state.get()], posState, 1);
rc.encodeBit(isRep, state.get(), 0);
encodeMatch(0x, MATCH_LEN_MIN, posState);

  - A little code is needed to glue the components together.
LZMA2OutputStream does this for LZMA2. Note that LZMA
doesn't support flush() like LZMA2 does.

LZMA might be needed for Apache Commons Compress. I don't know yet.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Use of XZ/LZMA compression in the ZIP file format

2011-09-21 Thread Lasse Collin
On 2011-09-14 Jonathan Nieder wrote:
>  - The version information header refers to the version of Igor
>Pavlov's LZMA SDK used to compress (one byte major, one byte
>minor).  The LZMA SDK never used versions in the range [5, 8], so
>maybe some lie like "5.0" would be appropriate. ;-)

Maybe it would be better to fake a low SDK version. Maybe decompressors
check it and reject too big versions as unsupported. I'm guessing here,
I have no idea about the real-world implementations.

>  - I am not sure if the constraints on compression parameters
>mentioned at [1] would ever trip in decompressing ZIP files.
>Probably not in practice.  The spec doesn't mention them, alas.

The Java version doesn't support LZMA1. If it is adapted to support it,
there's no similar limit of lc + lp <= 4 as there is in liblzma
because in Java the arrays are allocated one by one.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ for Java 1.0

2011-10-29 Thread Lasse Collin
XZ for Java 1.0 was released earlier this week:

http://tukaani.org/xz/java.html

The code is available also in Maven Central. The actual Java code
is identical to the version 0.4, but I made a new release to make
it clear that the code and API should now be stable.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Is the xz format stable?

2011-11-06 Thread Lasse Collin
On 2011-11-06 Tom Trauth wrote:
> I am trying to submit a patch to an open source project to add xz
> support to it, but before accepting it the maintainer wants me to get
> a promise from the xz developers that the xz format is now stable and
> will have no backwardly incompatible modifications in the future.

It is stable in sense that new tools will always be able to decompress
old .xz files that have been created with a stable release of XZ Utils.
It is possible and even somewhat likely that new features will be added
in the future which old programs won't support.

Compare to the .zip format. It has got support for new compression
methods and other features over the years, including LZMA support.
When maximum portability is needed, people stick to the Deflate
algorithm which all non-ancient .zip implementations support.

> But he
> apparently had a bad experience with the lzma format changing its
> format several times and therefore does not trust xz.

The old .lzma format hasn't changed since it was introduced in LZMA
SDK and also used by LZMA Utils. There were development versions of
the .xz format that used also the .lzma suffix, but no one has claimed
that those alpha versions would be stable. If someone has thought the
development versions were stable, it has been a major misunderstanding.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Is the xz format stable?

2011-11-06 Thread Lasse Collin
On 2011-11-06 Tom Trauth wrote:
> Given an xz file,
> is there a way to determine which version of the xz format it uses.
> Something like:
> 
> xz-get-version foo.xz --> foo.xz uses XZ format version 1.0.4

Right now there is no way to get a version number of the format.

I could make xz -lvv show the oldest XZ Utils version that will
decompress the file. It can only work for files that are supported by
the xz tool, so it's not possible to make an old xz tool to display how
much newer xz is required for a given file; the old tool could only tell
that it doesn't support it. I don't know if this could be good enough
for you.

To understand the reason for the above, it's good to understand how
incompatible additions may happen:

(1) A new filter/method ID may be added into the official .xz format
specification. Old tools will show that there is an unsupported
filter ID and cannot decompress such files (will display an error).

(2) Third-party developers may use custom filter IDs which aren't in
the official specification and aren't supported by XZ Utils. If
they don't deviate from the .xz specification in any other way,
this is OK. Old tools cannot distinguish this situation from (1).

(3) A new .xz format specification may add new features to the
container format. The old tools will detect such files as
unsupported (they won't claim them to be corrupt). With old tools,
the difference to (1) and (2) is that the old tools won't be able
to list even the filter IDs.

If incompatible additions are made, the xz tool won't use them by
default. Maybe they might become a default after several years have
passed and old xz versions aren't common anymore. But it won't be done
easily, because it would make people angry if the default settings
created files that many wouldn't be able to decompress without extra
work.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Is the xz format stable?

2011-11-06 Thread Lasse Collin
On 2011-11-06 Tom Trauth wrote:
> This is what the maintainer wants, in his own words:
> We need a way to verify that a specific  named version of xz, which
> might be older than the version someone has  installed, will be able
> to uncompress a particular file.

When compressing with a future version of xz, don't enable incompatible
features. If you already have a .xz file and want to find out if an old
xz will support it too, there's no simple way right now, but I think I
will make xz -lvv show the minimum required XZ Utils version.

> His main fear is that someone will create an xz file with a version
> of xz which is newer than the version of xz that his project
> supports, and then his project will not be able to read that file.

Such a situation will probably be possible in the future if someone
enables an incompatible feature when compressing.

A comparable situation is technically possible even now because it is
possible to have a .xz file that requires 4 GiB of memory to
decompress, which is too much for many systems. No one creates such
files in practice though. :-)

If one wants to be extra safe, one could define what features and
memory usage are allowed to guarantee compatibility. This is already
required when creating files for XZ Embedded, which doesn't support
all .xz features. XZ Embedded is used e.g. in Linux as an option to
compress the kernel and initramfs and for Squashfs images.

> However, your last paragraph implies xz, including future versions,
> will create a file that is unreadable by current versions of xz
> unless special parameters are used, because to do otherwise would
> anger a lot of people.  Is that true?  If so, I think it will help
> allay his fears.

Assuming that you meant "will not create", yes, it is true with one
possible exception: If a new very nice but incompatible feature is added
in 2012, I might consider making that a default in 2018-2020. At point
the old versions should have pretty much vanished. Even then it will
be possible to create files that are compatible with XZ Utils 5.0.0 by
using extra options.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Is the xz format stable?

2011-11-07 Thread Lasse Collin
On 2011-11-06 Tom Trauth wrote:
> From an e-mail from the maintainer, it looks like your proposal to
> add the minimum required version of xz-utils to the output of "xz
> -lvv" will be enough to meet his requirements.  He wants to be able
> to check which version of xz he needs, and this will enable him to do
> it.  Any idea which future version of xz-utils may contain this
> enhancement?

The feature is now available in the git repository. It will be in
5.1.2alpha, but I don't know when it will be released. It won't be in
5.0.x because I won't add any new features into a stable branch.

The info is also in xz -lvv --robot output so it should be easy to
parse. The idea of --robot is to make parsing simple and stable across
xz versions.

I didn't update the man page yet.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xz startup time for small files

2011-11-28 Thread Lasse Collin
On 2011-11-28 Stefan Westerfeld wrote:
> Now the problem is that for those files I cannot predict the size.
> Often they will be quite small, but they also could be 100 MB in size
> or more. So I use xz -9 to get the best compression.
> 
> The problem is now that xz takes a lot of time to start:
> 
> stefan@ubuntu:/tmp$ time echo "foo" | xz -9 >/dev/null
> 
> real0m0.155s
> user0m0.052s
> sys 0m0.096s

The match finder hash table has to be initialized. It cannot be avoided.
The bigger the dictionary, the bigger the hash table. It's about 64 MiB
when using 64 MiB dictionary (xz -9). With 8 MiB dictionary (xz -6)
it's about 16 MiB. So at a lower setting the initialization is faster.

xz allocates much more memory for other things. Most of that memory
isn't initialized beforehand. Uninitialized memory doesn't cause a
significant speed penalty because many kernels don't physically allocate
large allocations before the memory will actually be used.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] Memory usage limits again

2011-11-28 Thread Lasse Collin
ult memory usage
limits isn't liked by everyone who want to enable limits:

  - If you login interactively with ssh, the shell startup scripts are
executed and XZ_DEFAULTS will be set. But if ssh is used to run a
remote command (e.g. "ssh myhost myprogram"), the startup scripts
aren't read and XZ_DEFAULTS won't be there.

  - /etc/profile or equivalent usually isn't executed by initscripts
when starting daemons. Some daemons use xz.

  - People don't want to pollute the environment with variables that
affect only one program.

Having a configuration file would fix the above problems, but XZ Utils
is already an over-engineered pig, so I'm not so eager to add config
file support.

I have thought about adding configure options that would allow setting
default limits for compression and decompression. Someone may think
that it can confuse things even more, but on the other hand some people
already patch xz to have a limit for compression by default.


I haven't thought much about memory usage limits with threading, but
below are some preliminary thoughts.

With compression, -T0 in 5.1.1alpha sets the number of threads to match
the number of CPU cores. If no memory usage limit has been set, it may
end up using more memory than there is RAM. Pushing the system to swap
with threading is silly, because the point of threading in xz is to
make it faster. So it might make sense to have some kind of default
soft limit that is used to limit the number of threads when automatic
number of threads is requested.

With threaded decompression (not implemented yet) and no memory usage
limit, the worst case is that xz will try to read the whole input file
to memory, which is silly. So it probably will need some sort of soft
default limit to keep the maximum memory usage sane. The definition of
sane is unclear though. It's not necessarily the same as for
compression.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xz startup time for small files

2011-11-28 Thread Lasse Collin
On 2011-11-28 Stefan Westerfeld wrote:
> Just a thought: could performance be improved if xz requested the
> memory via mmap(), like
> 
>   char *buffer = (char *) mmap (NULL, 64 * 1024 * 1024,
> PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
> I wrote a little test program which seems to indicate that mmap() is
> much faster for getting zero initialized memory than malloc() +
> memset(). But thats for the case where the application does not
> access the memory. For xz the question is how much of the memory will
> be accessed, and how much not having to zero-initialize the memory
> will save.

With tiny input the memory won't be accessed much. With BT4 match
finder, it's one read and one write per uncompressed input byte. Each
read and write is a 32-bit integer. Since it's a hash table, it's
random access. There are actually three hash tables in BT4, which are
allocated at the same time, but the other two tables are small.

If you do a few thousand random 32-bit reads and writes, the mmap
method can still be faster, but it's not as huge difference as your
test makes it look like.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xz startup time for small files

2011-11-28 Thread Lasse Collin
On 2011-11-28 Thorsten Glaser wrote:
> Stefan Westerfeld dixit:
> 
> >Just a thought: could performance be improved if xz requested the
> >memory via mmap(), like
> 
> No, because any self-respecting modern malloc(3) implementation
> uses mmap(2) internally, see omalloc for example. (That’s Otto
> Moerbeek’s last one, found e.g. in OpenBSD.)

The point was that with mmap you are guaranteed that the memory is
already zeroed (or will be zeroed when kernel does the physical
allocation). With malloc the contents of the memory is undefined.
There's also calloc, but with a quick and inaccurate test with glibc,
it doesn't seem faster than malloc+memset.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xz startup time for small files

2011-11-28 Thread Lasse Collin
On 2011-11-28 Thorsten Glaser wrote:
> Lasse Collin dixit:
> If xz does indeed know it needs a zero’d allocation and
> can express that in page sizes (pretty non-portable),
> _and_ has fallback code for mmap-less architectutes (e.g.
> several POSIX-for-Windows systems or ancient OSes) then
> sure. But I’d say, leave malloc speedups to the OS. Or
> the porter; they should know what they do.

I'm not interested in playing with mmap in liblzma.

> (calloc is indeed faster than malloc+memset here for
> large allocations. About 1750 vs. 20 milliseconds.)

Add a few thousand random reads and writes, which liblzma will do even
with small files. Maybe the calloc is so much faster because it just
mmaps memory and doesn't touch it, so the kernel doesn't physically
allocate and initialize it either.

I know that using calloc is the right way to get zeroed allocation. In
liblzma I have allocations and initializations separated, because it
allows reusing the existing allocations when (de)compressing many
streams. I could still use calloc and skip memset as a special case,
but currently I think it's not worth it at all.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xz startup time for small files

2011-11-29 Thread Lasse Collin
On 2011-11-29 Stefan Westerfeld wrote:
> Of course its your code base, and you can use mmap() or not, there
> are some performance gains, which can to be bought with additional
> code complexity.

Your mmap test isn't very realistic because it doesn't touch the
allocated memory at all. Below is a modified version of your test
program. 5000 simulates the input file size as bytes. x is there just
in case to ensure that the memory reads aren't optimized away.

#include 
#include 
#include 
#include 
#include 

int
main (int argc, char **argv)
{
  unsigned int i;
  unsigned int x;
  unsigned int *buffer;
  assert (argc == 2);
  if (strcmp (argv[1], "malloc") == 0)
{
  buffer = malloc (64 * 1024 * 1024);
  memset (buffer, 0, 64 * 1024 * 1024);
}
  else
{
  buffer = mmap (NULL, 64 * 1024 * 1024,
 PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS,
 -1, 0);
}

  for (i = 0; i < 5000; ++i)
{
  x = buffer[rand() % (16 * 1024 * 1024)];
  buffer[rand() % (16 * 1024 * 1024)] = x + 1;
}

  return x & 1;
}

The mmap version will still be faster, but the difference isn't so
enormous anymore. If the mmap in the test program is replaced with
calloc, it's as slow as malloc+memset on GNU/Linux x86-64, but it may
very well be as fast as mmap on some other OS.

Creating a separate xz process for every file wastes even more time
than the hash table initialization. I tested with this on tmpfs:

TIMEFORMAT=%3R # To get times in seconds with bash
mkdir test
cd test
for I in {..4999} ; do printf '%300s\n' $I > $I ; done

for OPT in -1e -6 -9 ; do
echo
echo $OPT
echo Separate processes:
time for I in * ; do xz -k $OPT $I ; done
rm *.xz
echo Single process:
time xz -k $OPT *
rm *.xz
done

My results (times are in seconds):

-1e   -6-9
Separate processes:39.3 49.8   146
Single process: 2.6 14.757

So even at -9, using a single xz process would help much more than
optimizing the hash table initialization. One way to do this could be
to use --files or --files0 option with xz. You would leave xz running
and give it new filenames via stdin.

It's good to note that combining the single-process approach and mmap
is not a good idea. If you compress multiple files, the memory won't be
reallocated for every file. Using memset to reset the old allocation is
faster than munmap+mmap as long as the memory will also be accessed at
least a little like it will be in xz.

It would be possible to use a small hash table at first, and switch to a
bigger one if the input size exceeds a predefined value. This would
probably have some speed penalty too, and the code wouldn't look so fun
either.

> But I think maybe its better to take a step back and see what I was
> trying to do in the first place: compressing files which vary in
> size. From the documentation I've found that using levels bigger than
> -7, -8 and -9 doesn't change anything if the file is small enough. So
> I can do this:
> 
> def xz_level_for_file (filename):
>   size = os.path.getsize (filename)
>   if (size <= 8 * 1024 * 1024):
> return "-6"
>   if (size <= 16 * 1024 * 1024):
> return "-7"
>   if (size <= 32 * 1024 * 1024):
> return "-8"
>   return "-9"
> 
> in my code before calling xz. This will get around initializing the
> 64M of memory for small files, and results in quite a bit of a
> performance improvement in my test (probably even more than using
> mmap).
> 
> It would still be cool if xz could do this automatically (or do it
> with a special option) so that not every xz user needs to adapt the
> compression settings according to the file size. Basically, it could
> detect the file size and adjust the compression level downwards if
> that will not produce worse results.

Adding an option to do this shouldn't be too hard. I added it to my
to-do list.

At one point it was considered to enable such a feature by default. I'm
not sure if it is good as a default, because then compressing the same
data from stdin will produce different output than when the input size
is known. Usually this doesn't matter, but sometimes it does, so if it
is a default, there probably needs to be an option to disable it.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Memory usage limits again

2011-12-01 Thread Lasse Collin
On 2011-11-29 Jonathan Nieder wrote:
> Lasse Collin wrote:
> 
> > With compression, -T0 in 5.1.1alpha sets the number of threads to
> > match the number of CPU cores. If no memory usage limit has been
> > set, it may end up using more memory than there is RAM. Pushing the
> > system to swap with threading is silly, because the point of
> > threading in xz is to make it faster. So it might make sense to
> > have some kind of default soft limit that is used to limit the
> > number of threads when automatic number of threads is requested.
> 
> How about something like this patch, to start?  With it applied, I am
> happy using
> 
>   XZ_DEFAULTS='--no-adjust --threads=0 --memlimit=1080MiB

After the patch, --no-adjust doesn't prevent auto-adjusting the number
of threads. It would only prevent auto-adjustments of LZMA2 dictionary
size. I'm not sure if I like this or not.

I guess that your idea is to use --no-adjust to catch situations where
the specified settings are too high even for single-threaded operation,
and use more than one thread when the memory limit allows.

I don't know if someone would want to use --no-adjust to prevent xz
from scaling down the number of threads. Maybe your use case is more
likely.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] lzma orig file and decompressed file are different.

2011-12-12 Thread Lasse Collin
On 2011-12-09 stompdagg...@yahoo.com wrote:
> I'm writing a c++ program that takes a sql text file compressed in
> lzma (created by running lzma -9 ) and unlzma it. but for some
> reason the result file is different, see here:
> http://paste.pocoo.org/show/518724/ one of the line is cut and a 1
> char is has been added to some other ones. the relevant code can be
> found here: http://paste.pocoo.org/show/518725/

Maybe the problem happens if the output buffer is filled partially and
the input buffer becomes empty. After getting more input, the contents
of the partially filled output buffer is lost.

There is also a problem that the code doesn't check that the decoding
ends with LZMA_STREAM_END. This means that you won't detect if the file
is truncated.

The code may read past the end of the output output buffer (dDataArr)
when writing the data to the file on line 38. It could be better to
omit the memset on line 27 and use mystream.write instead of operator<<.

> OT: the list's registering procedure is very unclear, it would be
> wise to make the return make more clearer as it took me few mails to
> get registered and understand I'm registered, I'm registered to
> another few lists so I can say it can be done.

I'm sorry to hear that. I cannot affect how the actual subscribing is
done over email. My hosting provider has Majordomo and that's what I
have to use (or move the list elsewhere). I can improve the
instructions on tukaani.org web site if you can explain what should be
clarified. Thanks.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] lzma orig file and decompressed file are different.

2011-12-13 Thread Lasse Collin
On 2011-12-12 stompdagg...@yahoo.com wrote:
> in regards to the registration, the main issue is that the return
> mails are not clear, for example when I send confirmation, the answer
> was not clear if it was successful so I've had to resend it, only
> then I've scanned the return mail and found out that the original
> answer confirmed my registration.

Unfortunately I cannot change it without moving the list elsewhere and
to another mailing list software.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] lzma orig file and decompressed file are different.

2011-12-15 Thread Lasse Collin
On 2011-12-13 stompdagg...@yahoo.com wrote:
> back to topic, I've taken the pipe decompress example and started
> modifying it, when I got to read from file, decompress and write file
> using c functions it worked but when I changed it to c++ stream
> handling see here: http://dpaste.com/673199/ the output file is
> identical to the original but I get error code 10.
> 
> how is that possible?

I'm not sure. Using in.reasome looks suspicious. You may want to use
in.read and in.gcount instead.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] lzma orig file and decompressed file are different.

2011-12-16 Thread Lasse Collin
On 2011-12-16 stompdagg...@yahoo.com wrote:
> yes! problem solved, the code can be viewed at
> http://gitorious.org/open-source-soccer-manager/ossm/blobs/master/src/Utilities/Utils-General.cpp#line240

Good that it works. There's still a bug that it doesn't detect
truncated files. You need to check that lzma_code has returned
LZMA_STREAM_END to know that the end of the file was reached
successfully.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] lzma orig file and decompressed file are different.

2011-12-18 Thread Lasse Collin
On 2011-12-16 stompdagg...@yahoo.com wrote:
> in that case, what action should I take?

Simply check that the last call to lzma_code has returned
LZMA_STREAM_END. If it returned LZMA_OK, the decoder didn't decode the
last bytes of the file and thus the file was truncated or otherwise
corrupt.

I see you copied the bug from xz_pipe_decomp.c. I'm sorry about that. I
should have reviewed the example programs more carefully before
accepting them.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Memory requirement for linux kernel compression/decompression

2012-02-21 Thread Lasse Collin
On 2012-02-21 Gilles Espinasse wrote:
> On 2.6.32, only lzma option is available and by default, kernel use
> lzma -9 and this require much more memory than needed. As one
> compiler of my distrib reported a compilation breakage on a 512 MB VM
> during kernel compression, I started hacking scripts/Makefile.lib,
> removed -9 and added -vv. I then played with information displayed
> during compression to adjust xz memory requirement.

lzma -9 from LZMA Utils uses 32 MiB dictionary and requires 311 MiB of
memory. xz -9 uses 64 MiB dictionary and requires 674 MiB of memory.
The lzma emulation in xz uses the same presets as xz, so lzma -9 from
XZ Utils needs 674 MiB of memory. So the emulation isn't very good
although by default both XZ Utils and LZMA Utils use 8 MiB dictionary.

Using a dictionary bigger than the uncompressed file is waste of
memory. So if the kernel image is small, switching to a much smaller
dictionary doesn't affect compression ratio.

> Should not a patch be pushed on LKLM to at least remove the -9 part?

I don't know. If -9 is removed, then a kernel bigger than 8 MiB may
compress worse than it does now.

The -9 probably was put there when XZ Utils hadn't taken over LZMA
Utils, so the memory usage was much lower. Using a high setting is fine
from decompression point of view, because in the specific case of
kernel decompression the dictionary size doesn't affect the
decompressor memory usage. So from that point of view it is fine to use
a high setting "just in case".

scripts/xz_wrap.sh uses 32 MiB dictionary (370 MiB memory) to compress
a kernel image with xz. Maybe that would work on 512 MiB VMs but it can
still be a bit annoying on them.

An alternative to local patching is to set a memory usage limit for xz
when compiling the kernel:

$ XZ_OPT=-M100MiB make

xz will then scale down the dictionary size. It does it also when
emulating lzma.

> Secondly, could I trust the decompression memory requirement
> displayed by xz?

It can be trusted but:

  - It's rounded *up* to the nearest MiB, so it's not very precise
when memory requirements are low. This could be fixed since a
more accurate number is known internally already.

  - The number assumes that the decompressor needs to allocate
a separate dictionary buffer. This isn't always the case.
Linux kernel decompression doesn't need a dictionary buffer
but initramfs and initrd decompression does.

> Is the kernel decompressor really requiring the same
> memory size that xz display during compression?

No. Kernel decompression with a XZ-compressed kernel requires about
30 KiB of memory. The dictionary size doesn't matter because the output
buffer is used as the dictionary buffer. This is done even when a BCJ
filter is used.

I think with a LZMA-compressed kernel the memory usage is very similar
to XZ.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Oddities with --lzma2 options

2012-03-08 Thread Lasse Collin
On 2012-03-05 Gilles Espinasse wrote:
> I find strange here that with a dictionary size even a bit bigger
> than with bare -8e the compressed file is a bit bigger.

This can sometimes happen, but it shouldn't be the common case. A bigger
dictionary might allow encoding some sections of the file better, but
it can cause the internal state to be less optimal for other sections
of the file. So with bad luck a bigger dictionary gives a tiny bit
bigger output.

> Trying to add -e doesnt change the result (and time to compress) when
> nice=273 depth=512 are set.

If you specify something like

xz --lzma2 -e

the -e option is ignored. The -e only affects the presets -0 ... -9. If
you want to take -8e as the starting point and then adjust the
dictionary size, use this:

xz --lzma2=preset=8e,dict=${DICT}KiB

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] The right data for Embedded XZ?

2012-03-31 Thread Lasse Collin
On 2012-03-28 Mike Melanson wrote:
> *However!* Studying the source code in that directory demonstrated
> what was wrong in my own sample app-- I need to call xz_crc32_init()
> before the other functions. I see that's mentioned near the end of
> xz.h; perhaps it warrants an earlier mention.

XZ Embedded has been written primarily for the Linux kernel and there
you don't need xz_crc32_init() except in decompress_unxz.c. So in the
Linux context it's better to keep the xz_crc32_init() docs at the end of
xz.h. I'm sorry that you didn't notice this or the existence of
xzminidec earlier. I added a reference to xzminidec.c to README.

> Anyway, I got past that problem. It should be noted that
> "--check=crc32" really is necessary for compressed data if Embedded
> XZ will be chewing on it.

Yes. This is mentioned in linux/Documentation/xz.txt. The existence
of this file is pointed in README.

> The library returns XZ_OPTIONS_ERROR otherwise, but only on the first
> call. If you call xz_dec_run() again, decoding will proceed fine (and
> accurately). I figured this out when I made a mistake in my decode
> loop and didn't terminate on error.

Calling xz_dec_run() again after XZ_OPTIONS_ERROR leads to undefined
behavior in sense that I haven't thought what will happen. liblzma
would keep returning the same error code if one calls lzma_code() again
after an error, but in XZ Embedded I skipped such things to make the
code smaller.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] The right data for Embedded XZ?

2012-03-31 Thread Lasse Collin
On 2012-03-29 Thorsten Glaser wrote:
> Mike Melanson dixit:
> 
> >gcc -std=gnu89 -I../linux/include/linux -I. -DXZ_DEC_X86
>  ^^
> You probably want -std=gnu99 here.

gnu89 should work (gnu99 should work too). The Linux kernel is compiled
with gnu89 so XZ Embedded needs to conform to that too.

> >-DXZ_DEC_IA64 -DXZ_DEC_ARM -DXZ_DEC_ARMTHUMB -DXZ_DEC_SPARC
> >-DXZ_DEC_ANY_CHECK -ggdb3 -O2 -pedantic -Wall -Wextra -c -o
> >boottest.o boottest.c
> >In file included from ../linux/lib/decompress_unxz.c:235:0,
> > from boottest.c:22:
> >../linux/lib/xz/xz_dec_lzma2.c: In function ‘xz_dec_lzma2_run’:
> >/usr/include/bits/string3.h:56:1: sorry, unimplemented: inlining
> >failed in call to ‘memmove’: redefined extern inline functions are
> >not considered for inlining
> 
> Yes well, that will of course break.

It works for me but I'm not sure why. In boottest.c I want to use the
memmove() and other functions from decompress_unxz.c instead of libc.
Maybe it would be enough to avoid  in boottest.c and replace
strcmp() with something else.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] when a stable version with multithreaded compression support will be available?

2012-05-29 Thread Lasse Collin
On 2012-05-29 valentin wrote:
> I'm working on yoctoproject (www.yoctoproject.org) and xz it's used
> pretty much by the build system. I would like to upgrade the xz
> package to the development version (5.1.1alpha) for it's support for
> multithreaded compression to speed up some tasks. I know this version
> is not stable and I'm asking if someone knows when the stable version
> will be available !?

I don't know. A few days ago I started working on getting 5.0.4 ready
to be released. Before this it has been quiet for several months.

5.1.1alpha has a few annoying bugs that have been fixed later in the
git repository. I might release 5.1.2alpha around the same time with
5.0.4. There are no known critical bugs (e.g. data corruption) but
getting the code to beta or stable will still take more work.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] keep and hard links

2012-06-11 Thread Lasse Collin
On 2012-06-10 Ariel wrote:
> xz won't compress a file if it has hard links, even if --keep is 
> specified.
> 
> I think this should be changed since if the file is not being deleted
> hard links don't matter.

You aren't the first one requesting this change. I'm not sure if it is
safe to change it. In theory someone could rely on the current feature
although that doesn't sound so likely.

Currently --force does what you request, but --force also makes xz
overwrite existing files, which you might not want. If overwriting
isn't an issue, then --keep --force does exactly what you want.

If --keep is modified, I think it should also allow (de)compression of
symlinks and setuid, setgid, and sticky files. This way it would match
what --force does.

I would like to hear what people think about this.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] keep and hard links

2012-06-14 Thread Lasse Collin
On 2012-06-11 Christoph Biedl wrote:
> A while ago I considered asking for a few --force-* options that allow
> finer control about what things are acceptable that usually are not.
> Their names would be something like --force-overwrite,
> --force-symlink, --force-links and so on. That would allow overriding
> xz's sane defaults in certain aspects only without using --force and
> doing something potientially really harmful and undesired.

It's not hard to add these, but I'm unsure how useful these would be in
practice. Maybe you had some specific use case in mind.

Anyway, special --force-foo switches aren't the answer to the original
question of what --keep should do. If you need it often, it isn't
practical enough to type such long options on the command line when "xz
-k foo" or "xz -kf foo" could be enough.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] Example program bug fix and new example programs

2012-06-14 Thread Lasse Collin
A bug was fixed in doc/examples/xz_pipe_decomp.c. It didn't detect
truncated files.

I recently wrote new example programs that have more comments. I moved
xz_pipe_comp.c and xz_pipe_decomp.c to doc/examples_old so that new
example programs can be put into doc/examples. I will keep the old
programs for now in examples_old. If someone has copied the structure
from xz_pipe_decomp.c he can then see how to easily fix the bug.

I would like to get feedback about the new example programs. They are
now in the master branch in the git repository. Gitweb:


http://git.tukaani.org/?p=xz.git;a=commit;h=3a0c5378abefaf86aa39a62a7c9682bdb21568a1

I would like to include them into 5.0.4, so if there's something wrong
with them, I would like to hear about it soon.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ Utils 5.0.4

2012-06-22 Thread Lasse Collin
XZ Utils 5.0.4 is available at <http://tukaani.org/xz/>. Here is an 
extract from the NEWS file:

  * liblzma:

  - Fix lzma_index_init(). It could crash if memory allocation
failed.

  - Fix the possibility of an incorrect LZMA_BUF_ERROR when a BCJ
filter is used and the application only provides exactly as
much output space as is the uncompressed size of the file.

  - Fix a bug in doc/examples_old/xz_pipe_decompress.c. It didn't
check if the last call to lzma_code() really returned
LZMA_STREAM_END, which made the program think that truncated
files are valid.

  - New example programs in doc/examples (old programs are now in
doc/examples_old). These have more comments and more detailed
error handling.

  * Fix "xz -lvv foo.xz". It could crash on some corrupted files.

  * Fix output of "xz --robot -lv" and "xz --robot -lvv" which
incorrectly printed the filename also in the "foo (x/x)" format.

  * Fix exit status of "xzdiff foo.xz bar.xz".

  * Fix exit status of "xzgrep foo binary_file".

  * Fix portability to EBCDIC systems.

  * Fix a configure issue on AIX with the XL C compiler. See INSTALL
for details.

  * Update French, German, Italian, and Polish translations.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] next development release

2012-06-28 Thread Lasse Collin
On 2012-06-28 Denis Excoffier wrote:
> Is a xz-5.1.2alpha (or xz-5.1.1beta) release planned soon?

5.1.2alpha will be released soon.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ Utils 5.1.2alpha

2012-07-04 Thread Lasse Collin
XZ Utils 5.1.2alpha is available at <http://tukaani.org/xz/>. Here is
an extract from the NEWS file:

  * All fixes from 5.0.3 and 5.0.4

  * liblzma:

  - Fixed a deadlock and an invalid free() in the threaded
encoder.

  - Added support for symbol versioning. It is enabled by default
on GNU/Linux, other GNU-based systems, and FreeBSD.

  - Use SHA-256 implementation from the operating system if one is
available in libc, libmd, or libutil. liblzma won't use e.g.
OpenSSL or libgcrypt to avoid introducing new dependencies.

  - Fixed liblzma.pc for static linking.

  - Fixed a few portability bugs.

  * xz --decompress --single-stream now fixes the input position after
successful decompression. Now the following works:

echo foo | xz > foo.xz
echo bar | xz >> foo.xz
( xz -dc --single-stream ; xz -dc --single-stream ) < foo.xz

Note that it doesn't work if the input is not seekable
or if there is Stream Padding between the concatenated
.xz Streams.

  * xz -lvv now shows the minimum xz version that is required to
decompress the file. Currently it is 5.0.0 for all supported .xz
files except files with empty LZMA2 streams require 5.0.2.

  * Added an *incomplete* implementation of --block-list=SIZES to xz.
It only works correctly in single-threaded mode and when
--block-size isn't used at the same time. --block-list allows
specifying the sizes of Blocks which can be useful e.g. when
creating files for random-access reading.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ for Java 1.1

2012-07-04 Thread Lasse Collin
XZ for Java 1.1 is available at <http://tukaani.org/xz/java.html> and
in the Maven Central (groupId = org.tukaani, artifactId = xz). Here is
an extract from the NEWS file:

  * The depthLimit argument in the LZMA2Options constructor is
no longer ignored.

  * LZMA2Options() can no longer throw UnsupportedOptionsException.

  * Fix bugs in the preset dictionary support in the LZMA2 encoder.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Wrong content of sources JAR on Maven Central

2012-07-16 Thread Lasse Collin
Sorry about slow reply.

On 2012-07-08 Christian Schlichtherle wrote:
> The sources JAR on Maven Central seems to contain a copy of the source
> repository so that you could rebuild XZ 1.1 from it. However, this is
> not what should be in there. A sources JAR is not meant to be used for
> rebuilding the release. Instead, it should exactly match the
> directory tree of the classes JAR so that tools like an IDE can look
> up the sources by substituting the .class suffix with .java. So the
> sources JAR should just contain a top directory with the name org
> which contains the rest of the package structure for XZ 1.1

OK, thanks for reporting the bug. I didn't know this and I still don't
know where it is documented. I have hopefully fixed it in the git
repository now. I assume that -sources.jar doesn't require any manifest.

> If you want to make the source code available for rebuilding, then
> this is better done by providing an online source code repository
> (it's Git for XZ, isn't it) with a special tag for the release, say
> "xz-1.1".

Right. There is a source .zip on tukaani.org and releases have been
tagged in the git repository.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] cache-aware match finder for blocks of 2**17 bytes

2012-10-11 Thread Lasse Collin
On 2012-10-09 John Reiser wrote:
> I'm interested in speeding up compression for mksquashfs, which uses
> independent blocks of input length 2**17 bytes.  I have in mind a
> specialized match finder which would take advantage of the small
> fixed block size, and tailor its memory usage to the common L2 cache
> size of 256KB.  Is anyone else looking into this?

I'm not aware of anyone working on something like this.

I think one needs to modify more than the match finder to fit all data
structures into 256 KiB. For example, the dictionary buffer has some
fixed extra size to prevent too frequent memmove calls. Even then
256 KiB might be hard to achieve without affecting compression ratio
much. You may need to use mode=fast since mode=normal uses slightly more
memory. Maybe that is OK for you since you are looking for fast
compression anyway.

If you have trouble reading the code, see also XZ for Java, which I
think is currently the most readable version. (I'm not suggesting that
you should use the Java code, I just mean that it might help
understanding liblzma.) liblzma should be made more readable too.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] random-access reading and the "--block-size" option

2012-11-16 Thread Lasse Collin
On 2012-11-14 Jack Duston wrote:
> Given time involved in compressing and the quantity of data, I am 
> hesitant to use the 5.1.2alpha code.
> I am paying attention when your web page says it should be considered 
> unstable!

It is good to be cautious with unstable releases.

There are people using 5.1.2alpha and I haven't got bug reports. I'm
not aware of any data corruption bugs. So in this particular case I
think it's not too dangerous to use the development version. You get
threading in addition to the --block-size option.

Another option is to use for example pixz:

https://github.com/vasi/pixz

XZ Utils 5.1.2alpha and pixz both can create a single .xz stream that
contains many blocks, and thus make random access reading possible. I
haven't used truly pixz myself so I cannot say anything else about it.

> I see in the Release Notes that the "--block-size" option was added
> to the April, 2011 alpha release, and we are fast approaching 2013.
> I don't know how complex the code change is, or if it goes against
> your release policy, but would you consider back-porting the
> "--block-size" option to a 5.0.5 Stable Release?  I surely can't be
> the only one who would love to make use of the option.

I don't like to add any new features in a stable branch. I'm doing this
to (hopefully) make it easier for downstream distributions to include
bug fixes in stable distributions where the distro maintainers want
only bug/security fixes to minimize the risk of new bugs.

Adding --block-size isn't a huge patch, but in this particular case I
think it should be safe to try 5.1.2alpha.

> My ultimate end is to incorporate the XZ library or Embedded into our 
> application to search and read the compressed files directly.
> In any case, thanks again for all the work you've put into xz, I will
> be compressing with your utility either way.

Random access can be done with liblzma, but the provided APIs are too
low level to make it nice to use. See src/xz/list.c in XZ Utils what
kind of things you need to do. There is an old plan to have a file I/O
library that makes things easy for the most common use cases. I even
started writing it long ago but didn't get very far.

There is random access support in XZ for Java, but I guess it doesn't
help you.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [PATCH/RFC] xzless: Make "less -V" parsing more robust

2012-11-21 Thread Lasse Collin
On 2012-11-19 Jonathan Nieder wrote:
> In v4.999.9beta~30 (xzless: Support compressed standard input,
> 2009-08-09), xzless learned to parse ‘less -V’ output to figure out
> whether less is new enough to handle $LESSOPEN settings starting
> with “|-”.  That worked well for a while, but the version string from
> ‘less’ versions 448 (June, 2012) is misparsed, producing a warning:
[...]

Thanks for the patch. I have committed it as is.

> The implementation uses "awk" for simplicity.  Hopefully that’s
> portable enough.

I guess it is.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [PATCH] Add manifest attributes required by OSGi

2013-01-07 Thread Lasse Collin
On 2013-01-04 Mikolaj Izdebski wrote:
> For xz-java to be usable as an OSGi bundle certain attributes
> required by the OSGi specification need to be present in the
> manifest.

K. Daniel visited #tukaani on Thursday but I got online five minutes too
late so I couldn't reply. Then I got an email from Stefan Bodewig
(Apache Commons Compress developer) on the same day about the same
subject. The next days I was busy, so I'm sorry that I didn't reply
earlier.

> The above patch was applied[1] in Fedora GNU/Linux distribution and it
> was tested by Fedora developers. It would be nice if OSGi manifests
> were included in upstream xz-java too.

In addition to Bundle-SymbolicName, Bundle-Version, and Export-Package,
Stefan Bodewig suggested adding Bundle-ManifestVersion, Bundle-Name,
and Bundle-DocURL. I don't have much clue about any OSGi stuff, but I
checked the OSGi wiki and these sounded reasonable.

> [1]
> http://pkgs.fedoraproject.org/cgit/xz-java.git/commit/?id=cd63efa72e4150b6303f995e829b31465bfcd6e9

Note that the committed patch hardcodes the bundle version number
while the patch you included in your email doesn't. On the other hand,
I think it is OK hardcode "org.tukaani.xz" in build.xml.

I committed a patch:


http://git.tukaani.org/?p=xz-java.git;a=commitdiff;h=101303b7e10e9618a83a06b1bfccd0b87e33acd6

Please let me know if you think it has a problem. Maybe I should make a
new release some day to include this and xz-x.x-sources.jar fixes even
though no actual Java code has been changed after 1.1.

Thanks!

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [PATCH] Add manifest attributes required by OSGi

2013-01-15 Thread Lasse Collin
I got an email about a small optimization and I want to include it in
the next release. I will make a new release once the reporter has
confirmed that the patch makes a difference. If anyone is interested, I
committed the patch already:

http://git.tukaani.org/?p=xz-java.git;a=commitdiff;h=ec224582e44776a53874346dfe703dbbfcc6bd15

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ for Java 1.2

2013-01-29 Thread Lasse Collin
XZ for Java 1.2 is available at <http://tukaani.org/xz/java.html> and
in the Maven Central (groupId = org.tukaani, artifactId = xz). Here is
an extract from the NEWS file:

  * Use fields instead of reallocating frequently-needed temporary
objects in the LZMA encoder.

  * Fix the contents of xz-${version}-sources.jar.

  * Add OSGi attributes to xz.jar.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [PATCH] xzless: There is no need to call awk for this.

2013-03-05 Thread Lasse Collin
Thanks. Committed.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Re: cache-aware match finder for blocks of 2**17 bytes

2013-03-25 Thread Lasse Collin
On 2013-03-18 John Reiser wrote:
> I've got my specialized match-finder working (minimize cache misses
> while processing blocks of size 2**17.)  Now I find that it is slow,
> mainly because it finds *all* matches, even those that "obviously"
> are not good candidates for encoding.
> 
> It seems to me that there are two missing parameters to the match
> finder:
> 
> 1) the current four offsets which have very low encoding cost (many
> bits less than other nearby offsets)

As the first step you could add some hack to pass that information to
the match finder. In the fast mode (lzma_encoder_optimum_fast.c) this
should be relatively straightforward. In the normal mode (_normal.c)
one needs to modify the code more to update the reps in the opts array
earlier, but the code is ugly. It should be cleaned up some day (e.g. in
XZ for Java the equivalent code is much nicer). So for now it may be
better to test with the fast mode code only.

> 2) for each of the four match lengths 2,3,4, and 5: the maximum offset
>that yields a savings when encoded, but ignoring the possibility of
>special savings due to using one of the four most-recent offsets.
>For instance, gzip won't even consider any offset greater than 4096
>("TOO_FAR") for gzip's minimum match length of 3.

I don't have a complete answer for that at the moment. In fast mode,
distances longer than 128 bytes are ignored when length is 2 bytes. In
other cases it's more complicated.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Re: cache-aware match finder for blocks of 2**17 bytes

2013-04-04 Thread Lasse Collin
I'm sorry that I'm so at replying.

On 2013-03-29 John Reiser wrote:
> Below is a summary of encoding costs that I gleaned by inspecting the
> code. Notable:
>   A match of length 3 is no shorter than 3 literal bytes when
> 64K<=offset. A match of length 1 at rep0==offset is an important
> special case.

Your calculations look correct to me. They are useful as is when in fast
mode, but in normal mode it's not so simple. In the normal mode one
shouldn't make so simplified assumptions about the costs.

> LZMA encoding costs (in bits) before RangeEncoder (after
> RangeDecoder.) RangeEncoder often reduces the cost in bits, but it
> depends on history and is difficult to compute.  [On average is it a
> small constant factor?]

A key thing in the normal mode (compared to the fast mode) is that
the algorithm takes into account the costs after the range encoding (the
code uses the term "price"). Up to 4 KiB of uncompressed data is
analyzed at a time. The cheapest combination of LZMA symbols to
represent the analyzed range as a whole is chosen. To speed it up, some
things are cached in lookup tables that are updated only now and then.
This means that the calculation isn't always done with the exact prices
as the real prices drift away from the cached values.

The price of a symbol depends on the alignment in the uncompressed data
via pos_state (position bits (pb) setting). Alignment also affects
literal encoding via literal position bits (lp), but that is usually
zero to indicate one-byte alignment.

The price of a symbol also depends on the previous two or three
LZMA symbols(*) via the "state" variable. For example, if the situation
"the previous symbols were a normal match and a literal, and the current
position % pos_state == 3" has occurred several times earlier and in
most cases the next symbol has been for example a repeated match,
encoding a repeated match in such a situation has become a little bit
cheaper than it would be in the base state. This may make the encoder
choose, for example, a shorter repeated match over a longer normal match
in the same situation in the future. (Not a great example but you get
the idea, I hope.)

(*) lzma_common.h seems to talk about events, but later I've switched
to using the term "LZMA symbol" or plain "symbol" to mean a
literal, normal match, or repeated match. There are variables
named "symbol" too, but I don't speak about those in this email.
Some code cleanup would be good to do. :-|

If you want to read the normal mode code, see LZMAEncoderNormal.java
and other files in XZ for Java. Those are way more readable than the
equivalent code currently in XZ Utils even if you had never seen Java
code before.

Your original question was about what kind of "obviously bad" matches a
match finder could throw away. The results you calculated might be
useful for the fast mode, but using those for the normal mode may harm
the compression ratio. Maybe with some safety margins it would work
well, but that is purely a guess. You could try different values or you
could even analyze what kind of symbol combinations the encoder
currently creates.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xzgrep and '-h' option

2013-04-05 Thread Lasse Collin
On 2013-04-03 Pavel Raiskup wrote:
> Hi all, would you please consider the following patch?  It is adding
> support for the '-h' grep option into xzgrep also.  The author is Jeff
> Bastian.

Thanks. Committed.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [PATCH] doc: man and --help/--long-help sync

2013-04-15 Thread Lasse Collin
On 2013-04-09 Pavel Raiskup wrote:
> * src/xz/message.c (message_help): Cover --uncompress, --to-stdout and
> mention that --memory is alias for --memlimit.

No thanks for these reasons:

  - Those spellings probably shouldn't have been supported in the
first place.

  - The --help text should be as short as reasonably possible.
Documenting --uncompress and --to-stdout would add two new lines.

  - Listing just one spelling in the easiest-to-find location
hopefully encourages people to use only that spelling.

  - People should be able to find the alternative spellings from
the manual if they find a script that uses those options and
cannot otherwise guess the meaning of those options.

> * src/xz/xz.1: Mention obsoleted --memory option.

If --memory is marked as an old/alternative spelling, then --to-stdout
and --uncompress should be too. I'm not sure if such a change would
clarify things much.

> * src/xzdec/xzdec.c: Mention --to-stdout and --uncompress option in
> help, better describe why options are ignored and move all ignored
> options to the end of whole list.

I like to keep the order similar to xz --help instead of putting the
ignored options to the end. Now I see that the order didn't match
xz --help or xzdec's man page, so I've fixed it.

I added descriptions but left them in parenthesis even if it looks a
bit silly. This way I find it easier to distinguish the ignored vs.
non-ignored options when skimming the list.

Thanks!

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ for Java 1.3

2013-05-12 Thread Lasse Collin
XZ for Java 1.3 is available at <http://tukaani.org/xz/java.html> and
in the Maven Central (groupId = org.tukaani, artifactId = xz). Here is
an extract from the NEWS file:

  * Fix a data corruption bug when flushing the LZMA2 encoder or
when using a preset dictionary.

  * Make information about the XZ Block positions and sizes available
in SeekableXZInputStream by adding the following public functions:
  - int getStreamCount()
  - int getBlockCount()
  - long getBlockPos(int blockNumber)
  - long getBlockSize(int blockNumber)
  - long getBlockCompPos(int blockNumber)
  - long getBlockCompSize(int blockNumber)
  - int getBlockCheckType(int blockNumber)
  - int getBlockNumber(long pos)
  - void seekToBlock(int blockNumber)

  * Minor improvements to javadoc comments were made.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Random access to xz files

2013-06-23 Thread Lasse Collin
y return less than "size" without hitting
end of file or error. I don't know if Linux makes extra guarantees
over POSIX when reading from a regular file, but even if it does, I
still wouldn't rely on it.

After those small things I think it should have a good chance to work
once you add code to decompress the requested part of the block into
xzfile_pread(). While that is some work still, don't get discouraged
now: you have the messiest parts mostly done already. Obviously what
you are doing should have been abstracted into nice file I/O library
long ago but so far that doesn't exist.

> is this stuff documented anywhere?

The documentation is poor. The API headers have reference-like docs but
so far there only are example programs for the most basic compression
and decompression, so there are no examples about random access. (I
don't count list.c in xz sources as an example program.)

The liblzma APIs for random access are low level and thus require
a lot of code to use. One also needs to understand the .xz file format
structure. A reason for so low-level APIs is that liblzma takes its
input and gives its output via buffers provided by the application.
Callback functions or file I/O functions aren't used.

My idea was and still is to have a separate file I/O library that would
handle not only .xz files but also uncompressed, .gz, and .bz2 files.
There is some old pre-pre-alpha code in libxzfile.git on
git.tukaani.org, but in its current state it's not interesting since
it's so incomplete and there's almost no compression related code yet.

It is a bit backwards that right now, compared to XZ Utils, XZ for Java
has much cleaner code, better docs, and an *easy*-to-use random-access
decompressor class. On the other hand XZ for Java works on streams
instead of passing *both* input and output via caller-provided buffers.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Random access to xz files

2013-06-26 Thread Lasse Collin
On 2013-06-24 Richard W.M. Jones wrote:
> I have now completed a simple random-access XZ NBD server plugin:
> 
> https://github.com/libguestfs/nbdkit/tree/master/plugins/xz
> 
> which may be of interest.  It works enough that I can read out some
> xz-compressed Windows guest disks, which is a fairly good test.

Nice that you got it working.

Below are a few more things that I noticed that you may or may not find
interesting.

"xz --block-size=SIZE" is available only in 5.1.x branch which is still
officially in alpha stage. I don't know if the required xz version is
worth mentioning in the docs. A modified version of 5.1.x is shipped in
Debian and maybe some other distros (Fedora maybe) ship 5.1.x too. In
practice it seems to work quite OK since I haven't got bug reports.

With "xz --block-size=SIZE" SIZE can be e.g. 16MiB which is easier to
type than 16777216. Most options in xz that take numbers accept such
suffixes. It is documented on the man page but it's not mentioned again
for each option (maybe it should be?).

I noticed that both list.c and your code lack a check for the
lzma_stream_flags.version field (see ). It doesn't
affect anything for now since the version is always 0. But it quite
probably won't always be zero in the future (e.g. if metadata support
is added to .xz) and then the current code may misbehave instead of
giving a clear error message. Here's what I added to list.c:


http://git.tukaani.org/?p=xz.git;a=commitdiff;h=84d2da6c9dc252f441deb7626c2522202b005d4d

xzfile.c lines 467-488:

  - Changing the action argument to LZMA_FINISH from LZMA_RUN when no
more input is coming is fine but when decompressing blocks it is
not required. It is fine to only use LZMA_RUN if you want.

  - After successful decoding, lzma_code() must have returned
LZMA_STREAM_END. On line 488 the code accepts also LZMA_OK, which
looks suspicious. In practice there is no bug because the
"while" condition on line 486 ensures that the value cannot be
LZMA_OK, making the check for LZMA_OK on the line 488 a no-op.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xz-utils streaming patch

2013-06-27 Thread Lasse Collin
On 2013-06-26 Alexander Clouter wrote:
> Attached is a patch that enables 'streaming' support for xz output,
> in short LZMA_SYNC_FLUSH is called every X milliseconds.

I like the idea.

The patch uses LZMA_SYNC_FLUSH after every X milliseconds even if all
read() calls are able to fill the buffer without blocking. A possible
alternative could be to flush when at least X milliseconds have passed
and read() gives EAGAIN. That is, don't flush as long as input is
coming faster than xz can compress it. I don't know if this is a good or
bad idea. It might mean much higher latency especially in threaded mode
(which doesn't support LZMA_SYNC_FLUSH yet).

A few other thoughts:

The timeout must be disabled when --list (MODE_LIST) is used.

gettimeofday() shouldn't fail as long as the first argument is sane and
the second argument is NULL, so there's no need to test the return
value (I hope).

It could be good to use clock_gettime(CLOCK_MONOTONIC, ...) when it is
available. It makes a difference if the system time jumps for some
reason. The threading code in liblzma uses it already so it's not a new
dependency. Currently message.c uses gettimeofday() and that could use
clock_gettime() too.

If select() gives EINTR, there should be a test for user_abort.
Otherwise if there is no input, xz won't react to SIGINT until the
timeout has expired.

I noticed that there is a race condition in signal handling in the
existing xz code. If e.g. SIGINT is sent after the value of user_abort
has been checked but before a blocking read() or write(), the read/write
will block and another signal is needed to make xz notice that
user_abort has been set. This affects the same code as your patch so I
think this should be fixed first.

Could signals be a good way to set a flag when to flush? It would allow
triggering flushing from another process. xz already supports
SIGUSR1/SIGINFO for to show progress info if --verbose wasn't used.

A possible problem is how to raise such signals within xz.
timer_create() and friends look nice but after checking a few OSes I
think they aren't portable enough. setitimer() could be more portable
but in practice it would mean using SIGALRM. Currently xz uses alarm()
for the progress indicator. Creating a thread solely for sending timer
signals should work, but I'm not sure I like that. Maybe just polling
the time like your patch does is the way to go.

> The patch is for 5.0.0 (what is currently in Debian
> 'oldstable/squeeze') but if the community likes the look of the
> patch, I can roll a version for whatever is at the HEAD of the git
> tree.

It won't apply directly because there's new code that uses
LZMA_FULL_FLUSH. But let's not worry about it until I have fixed the
race condition with signals and user_abort. It may need select() or
poll(), which may then be used to implement flushing too.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ Utils 5.0.5

2013-06-30 Thread Lasse Collin
XZ Utils 5.0.5 is available at <http://tukaani.org/xz/>. Here is an 
extract from the NEWS file:

  * lzmadec and liblzma's lzma_alone_decoder(): Support decompressing
.lzma files that have less common settings in the headers
(dictionary size other than 2^n or 2^n + 2^(n-1), or uncompressed
size greater than 256 GiB). The limitations existed to avoid false
positives when detecting .lzma files. The lc + lp <= 4 limitation
still remains since liblzma's LZMA decoder has that limitation.

NOTE: xz's .lzma support or liblzma's lzma_auto_decoder() are NOT
affected by this change. They still consider uncommon .lzma
headers as not being in the .lzma format. Changing this would
give way too many false positives.

  * xz:

  - Interaction of preset and custom filter chain options was
made less illogical. This affects only certain less typical
uses cases so few people are expected to notice this change.

Now when a custom filter chain option (e.g. --lzma2) is
specified, all preset options (-0 ... -9, -e) earlier are on
the command line are completely forgotten. Similarly, when
a preset option is specified, all custom filter chain options
earlier on the command line are completely forgotten.

Example 1: "xz -9 --lzma2=preset=5 -e" is equivalent to "xz
-e" which is equivalent to "xz -6e". Earlier -e didn't put xz
back into preset mode and thus the example command was
equivalent to "xz --lzma2=preset=5".

Example 2: "xz -9e --lzma2=preset=5 -7" is equivalent to
"xz -7". Earlier a custom filter chain option didn't make
xz forget the -e option so the example was equivalent to
"xz -7e".

  - Fixes and improvements to error handling.

  - Various fixes to the man page.

  * xzless: Fixed to work with "less" versions 448 and later.

  * xzgrep: Made -h an alias for --no-filename.

  * Include the previously missing debug/translation.bash which can
be useful for translators.

  * Include a build script for Mac OS X. This has been in the Git
repository since 2010 but due to a mistake in Makefile.am the
script hasn't been included in a release tarball before.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xz-utils streaming patch

2013-07-04 Thread Lasse Collin
On 2013-06-26 Alexander Clouter wrote:
> Attached is a patch that enables 'streaming' support for xz output,
> in short LZMA_SYNC_FLUSH is called every X milliseconds.

There is now this kind of feature in the git repostory that can be
tested. I named the option --flush-timeout=TIMEOUT where the timeout is
in milliseconds.

In contrast to your patch, the committed code calls read() as long as
read() can fill the buffer completely. poll() is only called when read()
would block and only then is the flush-timeout checked. Thus, the
system time isn't polled with clock_gettime() or gettimeofday() on
every call to io_read() when the flush-timeout is active.

The --long-help or the man page haven't been updated yet. It is
possible that this feature isn't in its final form yet.

> We find it
> helpful so that we can effectively do:
> 
> tail -f foobar.log.xz | nc w.x.y.z 1234
> 
> 
> Meanwhile foobar.log.xz is effectively being generated with:
> 
> tail -f foobar.log | xz -c --select-timeout 500 > foobar.log.xz
> 
> 
> This means the receiver then gets something that is decodeable in X
> milliseconds rather than having to wait for a whole block to be
> generated and flushed, which might be a considerable time if whatever
> is writing to foobar.log is low volume (100 bytes per second for
> example).

For now, xz cannot be used for the decompression side because xz does
too much buffering. It is similar with XZ for Java unless one reads one
byte at a time.

xz should naturally be usable for the decompression side too. I haven't
decided yet how to fix it (e.g. require an option or perhaps always
disable buffering).

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ for Java 1.4

2013-09-22 Thread Lasse Collin
XZ for Java 1.4 is available at <http://tukaani.org/xz/java.html> and
in the Maven Central (groupId = org.tukaani, artifactId = xz). Here is
an extract from the NEWS file:

  * Add LZMAInputStream for decoding .lzma files and raw LZMA streams.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ Utils 5.1.3alpha

2013-10-26 Thread Lasse Collin
XZ Utils 5.1.3alpha is available at <http://tukaani.org/xz/>. Here is
an extract from the NEWS file:

  * All fixes from 5.0.5

  * liblzma:

  - Fixed a deadlock in the threaded encoder.

  - Made the uses of lzma_allocator const correct.

  - Added lzma_block_uncomp_encode() to create uncompressed
.xz Blocks using LZMA2 uncompressed chunks.

  - Added support for native threads on Windows and the ability
to detect the number of CPU cores.

  * xz:

  - Fixed a race condition in the signal handling. It was
possible that e.g. the first SIGINT didn't make xz exit
if reading or writing blocked and one had bad luck. The fix
is non-trivial, so as of writing it is unknown if it will be
backported to the v5.0 branch.

  - Made the progress indicator work correctly in threaded mode.

  - Threaded encoder now works together with --block-list=SIZES.

  - Added preliminary support for --flush-timeout=TIMEOUT.
It can be useful for (somewhat) real-time streaming. For
now the decompression side has to be done with something
else than the xz tool due to how xz does buffering, but this
should be fixed.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Parallel xzcat

2013-10-26 Thread Lasse Collin
On 2013-10-21 Richard W.M. Jones wrote:
> Here is a parallel implementation of xzcat:
> 
> http://git.annexia.org/?p=pxzcat.git;a=tree
> 
> Some test results:
> 
>   4 cores:  xzcat: 23.8 s  pxzcat: 8.1 s   speed up: 2.9
>   8 cores:  xzcat: 26.8 s  pxzcat: 10.5 s  speed up: 2.55
> 
> I just wrote this as a quick hack in a couple of hours, so while it
> may be of interest it's not a long term solution.  (It would be better
> to get the xzcat -T flag working).

Sounds nice!

Threaded decoding should be included in liblzma, but it will need to
wait past 5.2.0. In liblzma it will work for streamed decompression,
but it also means using quite a bit of memory.

> (2) I have not tested it with multi-stream files, but it should work
> with them.

I tested two-stream files without and with stream padding and neither
did work with pxzcat. Commands to create the files:

echo foobar | xz --block-size=3 > test1.xz
echo bazqux | xz --block-size=4 >> test1.xz

echo foobar | xz --block-size=3 > test2.xz
dd if=/dev/zero bs=100 count=1 >> test2.xz
echo bazqux | xz --block-size=4 >> test2.xz

I didn't investigate why it doesn't work, sorry.

> Notes on performance:
> 
> - Scalability is not too bad on my laptop (4 core machine above) but
> much worse on a theoretically higher performing machine with SSDs (8
> core machine above).  I don't really understand why that is.

A few wild guesses:

  - Eight cores or threads (hyperthreading)?

  - If all cores share the same L3 cache and memory controller, maybe
memory access becomes a bottle neck.

  - Maybe scattered I/O has something to do with it. Testing with the
write calls commented out might give some hints.

> - For reasons I don't understand, both regular xzcat and pxzcat cause
> the output file to be flushed to disk after the program exits.  This
> causes any program which consumes the output of the file to slow down.

I have no idea. I see you committed something that seems to be related
to this after your email. With a quick reading I don't understand it
well, it seems to be working around some issue with ftruncate() with
ext4.

xz doesn't use ftruncate() though so if xz has a problem, it cannot be
ftruncate(). If sparseness is the problem, test --no-sparse, although
with very sparse files it creates a different performance problem, of
course.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Creating an archive without timestamps

2013-11-10 Thread Lasse Collin
On 2013-11-10 Ernestas Lukoševičius wrote:
> How do I create an XZ compressed archive, which could be compared by
> md5?
> 
> Right now, running "tar cJfp" creates a tarball, gives it a timestamp
> and the rest is history, because the timestamp is always different
> and I cannot compare such an archive... Is it possible to avoid it?
> 
> gzip has -n, what about XZ and it's implementation on Tar?

GNU gzip has -n because by default gzip saves timestamp and other
metadata to the .gz header. xz doesn't do such things. In fact xz
doesn't even support metadata for now (it probably will in the future,
but it won't use it by default).

Probably your tar implementation creates a different .tar file on each
run. E.g. GNU tar 1.27 seems to do this if using --format=pax. With
--format=ustar the output doesn't vary.

It is also good to keep in mind that future xz versions might create
different output with the same command line options e.g. if the
compression engine is updated.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Parallel xzcat

2013-11-10 Thread Lasse Collin
On 2013-10-27 Richard W.M. Jones wrote:
> On Sat, Oct 26, 2013 at 08:06:45PM +0300, Lasse Collin wrote:
> > > - For reasons I don't understand, both regular xzcat and pxzcat
> > > cause the output file to be flushed to disk after the program
> > > exits.  This causes any program which consumes the output of the
> > > file to slow down.
> > 
> > I have no idea. I see you committed something that seems to be
> > related to this after your email. With a quick reading I don't
> > understand it well, it seems to be working around some issue with
> > ftruncate() with ext4.
> 
> That's right.  It turned out to be a misfeature in ext4: if you
> truncate a file from a non-zero size down to a zero size, ext4 flags
> the file and flushes it on close.  ("Truncate" includes both O_TRUNC
> and ftruncate).  This can be disabled with the noauto_da_alloc mount
> option, but of course that is not the default.

If I remember correctly, that "misfeature" was added because there are
too many programs that write config files by overwriting the old files
with O_TRUNC. By flushing quickly ext4 tries to avoid zero-length
config files after a crash or a power failure.

With xz there's nothing I can do to work around the flushes (and I'm not
sure if I wanted do anything for this even if I could). The truncation
is done by the shell when one redirects the output from xz to a file,
so to avoid the flushes one would need to patch the shell.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xz: Make --block-list and --block-size work together in

2013-11-10 Thread Lasse Collin
On 2013-11-02 James M Leddy wrote:
> This makes --block-list and --block-size work together in
> single-thread mode, as per the FIXME

Thanks and sorry for a slow reply. The patch looks very good. I will
commit it in a day or two.

> I've verified this works by testing with --block-size=3000 
> --block-list=1024,2048,4096 as well as stepping through the block
> decoder in the debugger.

xz -lv or xz -lvv is useful for checking block sizes.

> For some reason, the single threaded mode still yields smaller
> files. I'm looking into that.

It is because in single-threaded mode the encoder doesn't store the
block size information into block headers. In multi-threaded mode each
encoded block is fully buffered in RAM and as the last step the block
header is written. Currently there is no option to disable this. The
header info will allow streamed multi-threaded decompression some day.

Single-threaded mode doesn't buffer the blocks so it cannot write the
block header after finishing a block. I think this difference will
remain in 5.2.0.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xz: Make --block-list and --block-size work together in

2013-11-12 Thread Lasse Collin
On 2013-11-02 James M Leddy wrote:
> This makes --block-list and --block-size work together in
> single-thread mode, as per the FIXME

I have committed the patch. I made a few minor edits but hopefully I
didn't break anything. Thanks again.

The man page was updated too. I added a note about the difference in
output in single-threaded vs. multi-threaded mode to both --block-size
and --block-list.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] LZMA documentation

2013-12-17 Thread Lasse Collin
On 2013-12-16 Kevin Ingwersen wrote:
> But then I was kinda surprised to not find any LZMA documentation,
> although a lzma.h file is installed into the sytem’s default include
> path.
> 
> I also couldn’t find any link on the offical xz utils site. So if
> anyone could link me to the correct place with the documentation,
> that’d be nice.

The API docs are in the header files. See at least these:

$prefix/include/lzma/base.h
$prefix/include/lzma/container.h

Those docs alone aren't nice when learning the basics. There are a few
example programs (more would be needed though) in
$prefix/share/doc/xz/examples which work as a kind of a tutorial. It
helps a little if you already are familiar with zlib's API.

If you cannot find the example programs on your system, download the
source package. The examples are in doc/examples and the API headers
with the API docs in src/liblzma/api.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Inserting Compressed Data Into Compressed File

2014-02-03 Thread Lasse Collin
On 2014-02-02 Brandon Fergerson wrote:
> With that being said I've been trying to find out how to write 
> compressed data to a compressed file. My game's map structure is 
> formatted in a way such that a collection of tiles is grouped into a 
> block. So each XZ Block contains a certain amount of tiles. What I
> would like to do is when a player changes tiles of the map the map
> would find the XZ Block that the modified tile is in and rewrite that
> entire compressed block but with the new compressed data.

The new compressed XZ Block might be bigger than the old one because
the new data might be less compressible. Then the new XZ Block cannot
fit in the place of the old one and one has to rewrite the rest of the
file to make space for the new bigger XZ Block. This is slow if the file
is big and it's not really random access in practice.

> Is there any reason a SeekableXZOutputStream was not made or needed?

XZ isn't suitable for random access writing. The data is required to be
in sequential order and decompressible in streamed mode. Good random
access writing would probably allow the data be in non-sequential order
and thus break the streamability requirement.

> So my question is how possible is this and hard would it be? Are
> there indexes of the locations of the XZ Blocks somewhere that would
> have to be updated?

The XZ Index is near the end of the file. It stores the compressed and
uncompressed sizes of the XZ Blocks. But this doesn't help you due to
reasons explained above.

The simplest solution for you could be to write each group of tiles
into a separate compressed file. Then you can overwrite the file when
the tiles in it have been updated. A downside of this is that you may
end up creating even thousands of files and, depending on the file
system, things can slow down.

There are several ways to keep the tiles in a single file. For example,
you could put a fixed-size index to the beginning of the file and the
compressed tile groups after the index. The index would be small enough
to keep in RAM when the game is running. To update a tile group, find a
big enough unused hole (you need a way to track unused space) to store
the new data or if there is no big enough hole, append the data at the
end of the file.

There may be better ways to do it and I wouldn't be surprised if there
were good libraries to do this kind of things (I don't know any but I
haven't searched either). You don't need a full file system, something
relatively simple should be enough.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Inserting Compressed Data Into Compressed File

2014-02-04 Thread Lasse Collin
On 2014-02-03 Brandon Fergerson wrote:
> I realize the new XZ Block would be bigger which is why I'm looking
> to insert the data instead of overwrite it. I imagined something
> like: find block, insert new compressed data over block (while
> pushing down the blocks below), and then update indices to reflect
> changes. Or would this not be efficient?

Unfortunately most file systems don't support inserting new data in the
middle of a file (not even as multiples of the file system block size)
so "pushing down" means rewriting everything after that file offset. If
the file is big, it gets slow. If the file is small, the question
doesn't matter much since you could rewrite the whole file anyway.

> I suppose my problem is that I don't really understand what it is I'm 
> looking for. I knew I wanted the maps to be compressed and I knew I 
> wanted them to support random access. What else should I be looking
> for?

I fully agree with Alexander's advice. First write something simple that
works even if it wastes a few hundred megabytes of disk space per map.
Make the map available with getTile(int x, int y) or something like
that. When it and other major parts of the game work, you can replace
the map implementation without affecting the rest of the game code.

From your first post I guess you are using Java, and with my
limited Java experience I cannot say if there's good enough mmap()
equivalent, but it should be very easy for you to write a class that
writes the whole map into a single uncompressed file, each tile taking
a fixed amount of space.

Once you have that working, you could try a multi-file approach and
write for example 16x16 tiles per file, again each tile taking a fixed
amount of space. Naturally you need to name each file so that you know
which file contains which tiles. A new empty map can consist of no
files: any missing file is considered to be full of empty tiles. That
way small maps don't take much space. Later you can add compression
easily by compressing each file separately.

Don't worry about a single-file compressed map format for now.

Feel free to ask if you got more questions, but I hope you can at least
get started now.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [java] assert in SimpleInputStream's constructor

2014-02-28 Thread Lasse Collin
On 2014-02-28 Stefan Bodewig wrote:
> and SimpleInputStream does
> 
> SimpleInputStream(InputStream in, SimpleFilter simpleFilter) {
> ...
> assert simpleFilter == null;
> 
> which is obviously wrong.  I think != is intended (and matches the
> comment right in front of the assert).

Thanks! The fix is now in the git repository.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Solaris packages (done) and C99 code removal

2014-03-07 Thread Lasse Collin
On 2014-03-02 Mark Ashley wrote:
> I've compiled up xz 5.0.4 on the following machines:

Was there any reason to avoid 5.0.5?

> The Solaris 7 was more problematic, the C99 support is very minimal
> in Sun Studio 8.

I haven't tried myself but at least the Sun Studio 8 manual lists quite
a bit of C99 support (see the last link; I included the link chain
because the last page doesn't mention the Sun Studio version):

http://docs.oracle.com/cd/E19059-01/stud.8/index.html
-> Sun Studio 8: C User's Guide
http://docs.oracle.com/cd/E19059-01/stud.8/817-5064/index.html
-> D. Supported Features of C99
http://docs.oracle.com/cd/E19059-01/stud.8/817-5064/c99.app.html

According to the manual there shouldn't be too much trouble with the
compiler; at least one shouldn't need to make it C89. The C library is
another question, for example, snprintf() in Solaris 7 is pre-C99, I
think, but let's focus on the compiler first.

Seems that Sun Studio 8 should be in C99 mode by default. Just in case,
you could try to force it to C99 mode:

./configure CC="cc -xc99"

If configure still fails, try what the section 4.1 in INSTALL suggests:

./configure CC="cc -xc99" ac_cv_prog_cc_c99=

Or without -xc99:

./configure CC=cc ac_cv_prog_cc_c99=

Maybe you already tried all these and they didn't help. In that case
I'd like to know a little more about the problem. If the Sun Studio 8
manual is simply wrong about C99 support, that alone is useful
information.

If Sun Studio 8 really cannot be made to work, a recent enough GCC
should be available for Solaris 7. I don't know if that is an acceptable
solution to you.

> I took out the C99 specific code in the xz source
> tree, making it C89 friendly (and thus portable to a lot more
> compilers - you should do this to the main code base IMHO). See the
> attached diff. I didn't do this in the test/* files.

So far the list of non-C99 compilers are GCC 2.95.3 (released in 2001)
and Microsoft Visual C. The ancient GCC naturally won't get any C99
support but MSVC 2013 is getting close. Unfortunately with MSVC there's
still one really stupid MSVC bug that prevents it from compiling
liblzma from XZ Utils' git repository.

Of course there might be other compilers that people would like to use
to compile XZ Utils but which don't support enough C99, but I don't
remember hearing any C99-complaints about compilers other than the two I
mentioned.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ for Java 1.5

2014-03-08 Thread Lasse Collin
XZ for Java 1.5 is available at <http://tukaani.org/xz/java.html> and
in the Maven Central (groupId = org.tukaani, artifactId = xz). Here is
an extract from the NEWS file:

* Fix a wrong assertion in BCJ decoders.

* Use a field instead of reallocating a temporary one-byte buffer
  in read() and write() implementations in several classes.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] marking version 5.1 stable?

2014-05-25 Thread Lasse Collin
On 2014-05-23 Pavel Raiskup wrote:
> Looking at the http://tukaani.org/xz/ page for some time, I am curious
> whether we could "stabilize" the version 5.1.  Almost all
> distributions are shipping alpha/beta versions of xz* packages which
> is probably not what especially library users want.

Yes, the current situation isn't good.

> What are plans on this topic?  I checked the TODO file and didn't find
> what exactly we need to fix to mark xz 5.1 stable.

For the past year or more, the plan has been to just get the 5.2.0 out.
It's so horribly late that I don't plan to do anything except fix bugs
and possibly some do simple enhancements. The rest must wait past 5.2.0.

Somehow months just pass and I get little done (with xz or anything
else). Anyway, here are some things that I plan to do before 5.2.0:

  * Skim through some of the new code in case I can spot problems that
should be fixed before 5.2.0.

  * Ensure that the new APIs look OK for long-term support (I like to
keep API & ABI stable).

  * Check that the new xz features are correctly documented on
the xz man page.

  * Once I'm sure I won't change any message strings, I need to ask
for updated translations from the translators.

The test suite is very poor but considering that how few bug reports
I've got about the alpha versions, I guess the important features
work well enough. liblzma API is another question though: I guess no
one has used the threaded encoder API yet because in 5.1.3alpha the
preset support was completely broken and I only noticed it when writing
an example program for that.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xzgrep should success if at least one file matches

2014-06-11 Thread Lasse Collin
On 2014-06-11 Pavel Raiskup wrote:
> Hi, in RHBZ, there was reported problem with xzgrep, we should exit 0
> when at lest one file contains matching string.  Grep behaves
> similarly.
> 
> Original bugreport:
> https://bugzilla.redhat.com/show_bug.cgi?id=1108085

Thanks. I fixed a typo a in a comment in xzgrep (>=2 instead of >2) and
simplified the test quite a bit. The original test didn't work for
out-of-tree builds and there was a typo (exho). The new test doesn't
test as much but I didn't quickly see a good fix for it.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xzgrep should success if at least one file matches

2014-06-11 Thread Lasse Collin
On 2014-06-11 Pavel Raiskup wrote:
> Btw., I am just curious, what is the reason for '(exit X)' statements
> in the test_scripts.sh file?  Apart from that it sets the
> "last-command" exit status -- '$?', which is empty operation for us
> anyway, I don't see reason.  I followed that style but I doubt that
> it is necessary.

The Autoconf manual has a few examples where such a construct is needed
to workaround differences and bugs in shells:

info --i exit autoconf
info --i trap autoconf
info '(autoconf)Shell Functions'

However, it sounds that the situations mentioned in the manual don't
apply here.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xzgrep should success if at least one file matches

2014-06-13 Thread Lasse Collin
On 2014-06-12 Pavel Raiskup wrote:
> Just a note - I wanted the exact output to be
> compared (not just check the exit value), that is the reason why I
> test the '-h/-H/-l' options because that could reveal similar bugs
> like that which was fixed e.g. by commits bd5002f5 or 40277998.  But
> that really needs not-so naive testsuite.  Would you be interested in
> autotest solution?

Maybe in the future, but maybe not before 5.2.0. I should get familiar
with the new things too so that I know what I'm maintaining.

I committed something simple to get the result you wanted, I hope. I
used cmp -s (it's in SUSv2) instead of diff -u because diff -u isn't in
POSIX before POSIX-1.2008. Maybe I should have used plain diff, but
it's not much extra to type if the test happens to fail, which should
be rare.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Disabling CRC/SHA-256 checks on decompression

2014-08-03 Thread Lasse Collin
On 2014-07-31 Florian Weimer wrote:
> Would it be possible to add a flag to disable these checks during 
> decompression?

I think so. I will look at this relatively soon since it shouldn't be
hard to implement.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Disabling CRC/SHA-256 checks on decompression

2014-08-05 Thread Lasse Collin
On 2014-07-31 Florian Weimer wrote:
> Would it be possible to add a flag to disable these checks during 
> decompression?

This feature is available in xz.git now.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Disabling CRC/SHA-256 checks on decompression

2014-08-08 Thread Lasse Collin
On 2014-08-05 Florian Weimer wrote:
> Could you add something similar to the xz-java as well?

Probably. I try to look at it next week.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Disabling CRC/SHA-256 checks on decompression

2014-08-14 Thread Lasse Collin
On 2014-08-05 Florian Weimer wrote:
> Could you add something similar to the xz-java as well?

Done. I don't have any plans about a new release of XZ for Java yet, but
if one is needed for this feature, let me know and I'll do it next week.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



[xz-devel] XZ Utils 5.0.6 and 5.1.4beta

2014-09-14 Thread Lasse Collin
XZ Utils 5.0.6 and 5.1.4beta are available at <http://tukaani.org/xz/>.
Here is an extract from the NEWS file:

5.0.6 (2014-09-14)

* xzgrep now exits with status 0 if at least one file matched.

* A few minor portability and build system fixes

5.1.4beta (2014-09-14)

* All fixes from 5.0.6

* liblzma: Fixed the use of presets in threaded encoder
  initialization.

* xz --block-list and --block-size can now be used together
  in single-threaded mode. Previously the combination only
  worked in multi-threaded mode.

* Added support for LZMA_IGNORE_CHECK to liblzma and made it
  available in xz as --ignore-check.

* liblzma speed optimizations:

- Initialization of a new LZMA1 or LZMA2 encoder has been
  optimized. (The speed of reinitializing an already-allocated
  encoder isn't affected.) This helps when compressing many
  small buffers with lzma_stream_buffer_encode() and other
  similar situations where an already-allocated encoder state
  isn't reused. This speed-up is visible in xz too if one
  compresses many small files one at a time instead running xz
  once and giving all files as command-line arguments.

- Buffer comparisons are now much faster when unaligned access
  is allowed (configured with --enable-unaligned-access). This
  speeds up encoding significantly. There is arch-specific code
  for 32-bit and 64-bit x86 (32-bit needs SSE2 for the best
  results and there's no run-time CPU detection for now).
  For other archs there is only generic code which probably
  isn't as optimal as arch-specific solutions could be.

- A few speed optimizations were made to the SHA-256 code.
  (Note that the builtin SHA-256 code isn't used on all
  operating systems.)

* liblzma can now be built with MSVC 2013 update 2 or later
  using windows/config.h.

* Vietnamese translation was added.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



  1   2   3   >