emergency file server cleanup

Antonio Diaz Diaz antonio at gnu.org
Mon Sep 29 19:20:19 UTC 2014


Hello Alexandre,

Alexandre Oliva wrote:
>> Why? Lzip can compress more than xz with a bit of tuning via --options.
> 
> Maybe it can, but when I compared the sizes of the files to decide which
> one to keep, .xz files were consistently (if slightly) smaller than .lz
> ones.

I guess you mainly mean tarballs, because lzip compresses patches and 
xdeltas better than xz, sometimes even when passing the --extreme option 
to xz. (Updating lzip to 1.16 gives even better results):

    98923 2014-09-29 03:49 linux-libre-3.17-rc7-gnu.xdelta.lz (1.16)
    99065 2014-09-29 03:49 linux-libre-3.17-rc7-gnu.xdelta.lz
    99096 2014-09-29 03:49 linux-libre-3.17-rc7-gnu.xdelta.xz

  7268517 2014-09-29 05:19 patch-3.16-gnu-3.17-rc7-gnu.lz (1.16)
  7284746 2014-09-29 05:19 patch-3.16-gnu-3.17-rc7-gnu.lz
  7272508 2014-09-29 05:19 patch-3.16-gnu-3.17-rc7-gnu.xz (-9e)
  7344044 2014-09-29 05:19 patch-3.16-gnu-3.17-rc7-gnu.xz (-9)

    81530 2014-09-29 05:40 patch-3.17-rc6-gnu-3.17-rc7-gnu.lz (1.16)
    81638 2014-09-29 05:40 patch-3.17-rc6-gnu-3.17-rc7-gnu.lz
    81724 2014-09-29 05:40 patch-3.17-rc6-gnu-3.17-rc7-gnu.xz (-9e)
    82104 2014-09-29 05:40 patch-3.17-rc6-gnu-3.17-rc7-gnu.xz (-9)


> Maybe I'm not using the best options to compress tarballs, vcdiffs and
> xdeltas with lzip.  Suggestions are certainly welcome.

Vcdiff is already a compressed format. I guess the best option is not to 
compress it again and just distribute one plain .vcdiff file per 
release. You save about 66% in size and the (re)compressing time.

About tarballs, when LZMA-utils was renamed to XZ-utils its developers 
changed the name of the "algoritnm" to LZMA2 and at the same time 
increased the dictionary size of option -9 from 32 MiB to 64 MiB, 
misleading users into thinking that the increase in compression ratio 
was because of the new "algorithm". (BTW, LZMA2 is not an algorithm, but 
a container format).

As you can see near the end of the lzip benchmark[1], passing to lzip 
the arguments equivalent to those of "xz -9" (or to xz the arguments 
equivalent to those of "lzip -9"), will usually make lzip compress more 
than xz. But I do not recommend you to do it because using plain "-9" on 
both compressors, lzip usually compresses large files about as much as 
xz, but using half the RAM and requiring half the RAM to decompress.

In the case of small files the difference of memory required to 
decompress is even larger. The massif tool of valgrind finds that lzip 
uses 443,384 bytes to decompress 'patch-3.17-rc6-gnu-3.17-rc7-gnu', 
while xz uses 67,154,552 bytes.

In the lzip benchmark you can also see that each and every one of the 43 
xz tarballs being distributed in ftp.gnu.org were better compressed by lzip.

[1] http://www.nongnu.org/lzip/lzip_benchmark.txt


>> Lzip was designed for long-term archiving, having a
>> tool to recover corrupt files.
> 
> I very much doubt it could recover corrupt files to the point that the
> original signature would match, because that would require a lot of
> redundancy to be added, which is the opposite of what a compressor is
> supposed to do.  And if the original signature doesn't match, I wouldn't
> trust the result, especially given that we have alternate paths to
> obtain the tarballs.

Lziprecover is so awesome that people can't believe it. :-) Most think 
it is just like bzip2recover.

Lziprecover can repair perfectly most files with a single-byte error on 
them, without the need of any extra redundance at all. The repaired file 
will be identical bit for bit to the original.

Just get one linux-libre tarball and modify the value of a byte (near 
the beginning for a quick test). For example, I modified the byte at 
offset 1000 in 'linux-libre-3.12.5-gnu.tar.lz' and lziprecover repaired 
it in 12 seconds:

e871ba7561ed4833e9349f40d2975f53  linux-libre-3.12.5-gnu.tar.lz
e871ba7561ed4833e9349f40d2975f53  linux-libre-3.12.5-gnu1k.tar_fixed.lz


One byte may seem small, but most file corruptions not produced by I/O 
errors just affect one byte, or even one bit, of the file. Also, unlike 
magnetic media, where errors usually affect a whole sector, solid-state 
devices tend to produce single byte errors, making of lzip the perfect 
format for data stored on such devices.

Even if the repair capability of lziprecover is not needed for 
linux-libre files it may save the irreplaceable data of many users, 
which they would lose if they use bzip2 or xz.

As the author of GNU ddrescue I know about the tragedy of losing data 
and how to increase the probability of recovering it. If I have spent 6 
years developing a whole family of tools around a compression format you 
can be sure that it is the best for users. If it weren't, I would just 
have continued developing my projects and using the best format for my 
tarballs. Data compression should not be seen as a popularity contest, 
but as a service to humankind.

Be the change you wish to see in the world. Drop xz tarballs altogether. ;-)


Best regards,
Antonio.



More information about the linux-libre mailing list