The
SATOCONOR.COM
J.G.
van der Galiën ‘State-of-the-Art Compressors as Tools for True Entropy
Estimations’ 4.4. (2005)
State-of-the-Art Compressors as Tools for True Entropy
Estimations
Theory of information
compared to the practice of compression
By Johan G. van der Galiën
For comments e-mail: galien8@zonnet.nl
Version 1.1. April 6, 2006
(version 1.0. from August 27, 2005)
Download the ENTnORD1-programs, 5 in one ZIP file.
Abstract:
Two different files (9,539 bytes and 4,121,418 both
mainly English text) were subjected to PASCAL programs to calculate the first,
second, third and fourth order entropies. Three methods were used to
extrapolate to the (near) infinite order or in other words true entropy (H).
The 9,539 bytes file gives 0.5 < H < 0.6 bits per byte for the source,
which corresponds to a Shannon Limit of Compression (SLC) of 93% - 94%.
A State-of-the-Art compressor gave 69.68% compression
for this file (PAQ6V2 option –3). The 4,121,418 bytes file gave 0.8 < H <
1.0 bits per byte, which corresponds to 88% < SLC < 90%.
The best compressor found gives a literature value of
89.54% for this file (PASQDA 4.1. option –6e). So in case of the larger file
one can only say that the best compressors are at or near the
But what is also clear from these results is that a
Universal Compression Algorithm that always reaches the SLC has a long way to
go.
Compressing near the Shannon Limit by State-of-the-Art
compressors make it possible to do a good estimation of the True Entropy when
the compressed files are truly (analytical pure) random according to very
critical Randomness Test Suites like DIEHARD. It is demonstrated in this paper
for DNA sequences from Takifugu rubripes and Homo sapiens.
1. Introduction
Entropy is a concept that comes from Thermodynamics. In Physics entropy has the dimension of energy of a statistical ensemble divided by its temperature. An ensemble is a dynamical system with a finite number of energy states that all are occupied to some extent (ergodic). The Second Law of Thermodynamics states that the entropy of a system can only increase (arrow of time). The formula for thermo dynamical entropy in Joules per degree Kelvin is S = klog(W) where k is the Boltzmann constant and W is the number of possible molecular configurations. This formula is equivalent to the conceptual definition of information (negentropy). So information (I) and the entropies that will be calculated for information sources (H in bits per symbol) in this paper formally differ only in sign. (I give the formula’s here below.) But most of the times the sign is neglected and H = I.1 That the entropy can only increase with time is of course also true for an (near) infinite source like the English language, where in the future new words will be added, increasing the information and the entropy of the source.
The Shannon Limit of Compression Limit (SLC) is related to true entropy (H) of a source by dividing the entropy in bits per symbol by the number of bits that can code for all the possibilities of the alphabet.2 The entropy also gives the average number of bits that are needed to code per symbol by means of a compression table, for a maximal compressed source (SLC). To make an estimation of the entropy, in order to calculate SLC, one can calculate it in several orders and then do an extrapolation, of the curve by means of a graph or a non-linear regression fit, to the (near) infinite order (H). To illustrate this I give hereby the formula’s to calculate the different orders of entropy as estimations of H.
Zero Order: H0 = log2(m), m = size of the alphabet, for ASCI symbols, one symbol in a byte, m = 256 (log2(256) = 8 bits per byte) (1.1.)
First Order: H1 = - Σ p(xn)log2(p(xn)), n = 1 to 256 for bits per byte, p(xn) = (number of xn occurrences) / (number of bytes * 256) (1.2.)
Second Order H2 = - 1/2 Σ Σ p(xn1,xn2)log2(p(xn1,xn2), n1, n2 = 1 to 256 for a bits per byte
True Entropy (Near) Infinite Order H = lim q → ∞ - 1/q Σ Σ … Σ p(xn1, xn2, … xnq)log2(p(xn1, xn2, … xnq)), n1, n2, … nq = 1 to 256 for bits per byte (1.3.)
For example these numbers become for DNA sequences:
Zero Order: 1.1. m = 4, for DNA alphabet A,C, G and T symbols, one symbol per bit pair, (log2(4) = 2 bits per base)
First Order: 1.2. n = 1 to 4 for bits per base, p(xn) = (number of A (n = 1) or C (n = 2) or G (n = 3) or T (n = 4) occurrences) / (total number of bases)
Second Order 1.3. n1, n2 = 1 to 4 for a bits per base
True Entropy (Near) Infinite Order H = lim q → ∞ - 1/q Σ Σ … Σ p(xn1, xn2, … xnq)log2(p(xn1, xn2, … xnq)), n1, n2, … nq = 1 to 4 for bits per base (1.4.)
2. Results
In order to estimate the (near) infinite order entropy of the INSTALL.TXT (9,539 bytes) and VCFIU.HLP (4,121,418 bytes) files the first, second, third and fourth order entropies were calculated with the PASCAL programs ENT1ORD1, ENT2ORD1, ENT3ORD1 and ENT4ORD1. (These programs can be downloaded through the hyperlink above the abstract.) The results of these calculations can be found in Table 1 and 2.
|
Order |
Entropy in bits per byte no overlap |
Entropy in bits per byte with overlap |
|
First |
5.20700 |
5.20700 |
|
Second |
4.58893 |
2,54551 |
|
Third |
3.89330 |
0,550086 |
|
Fourth |
3.35755 |
3.39403 |
Table 1: Entropies of the first up to the fourth order of VCFIU.HLP. This is done in two ways with overlap and without overlap.
The results from Table 1 need an illustration. No overlap means that ...ABCDEFGHI... is divided in parts. For third order entropy this would be P(ABC), P(DEF), P(GHI) etc. These are independent triplets. With overlap means that P(ABC), P(BCD), P(CDE) are dependent triplets. The ENTnORD1 programs need to be adapted for with overlap. In the WHILE loop there must be an IF NOT EOF(sourcefile) THEN SEEK(sourcefile, bytecounter – 3); and in the formula for the entropy the division (FILESIZE(sourcefile) DIV n) must be (FILESIZE(sourcefile) DIV n) * n.
|
Order of entropy |
Value in bits per byte |
|
First order |
4.744 |
|
Second order |
3.899 |
|
Third order |
3.074 |
|
Fourth order |
2.415 |
Table 2: Entropies of the first up to the fourth order of INSTALL.TXT.
It is very time consuming to calculate entropies with an order > 4. I programmed ENT5ORD1 and the INSTALL.TXT would take 12 days to calculate the fifth order entropy. My estimation is that the larger VCFIU.HLP takes 14 years. That higher entropies take an increasing amount of time comes from the fact that because my computer has an available memory limitation of 150 Mb. The data structure of the needed ARRAY[0..255, 0..255, 0..255, 0..255, 0..255] OF CARDINAL takes 2565 * 4 = 12 Tb. So ENT5ORD1 needs to be adapted to use less memory but longer execution time.
For finding the asymptote of the hyperbolic function of entropy versus order I used the methods of extrapolation of a graph drawn by hand with a curve ruler (it can do hyperbolic functions). This method is of course not very precise, so there is only one significant digit and a range in the results of Table 3.
|
File |
Asymptote range in bits per byte |
|
VCFIU.HLP |
0.7 < H < 1.0 |
|
INSTALL.TXT |
0.5 < H < 0.6 |
Table 3: Results of finding the asymptote by means of a graph and a curve ruler.
There is very good software package, free downloadable!, from the National Institute for Standardisation and Technology (NIST).3 It is called DATAPLOT. It can do all kind of statistical tests and modelling. It can also do non-linear regression fits. I did this fit with the assumption that Hn = (a / n2) + (b / n) + H. This because for the INSTALL.TXT file Hnn2 versus Order is a straight line (this results in: Hnn2 = a + nb + Hn2 where Hn2 is the smallest term) with a good Linear Regression Coefficient of 0.99979.4 The results of this fit performed on VCFIU.HLP and INSTALL.HLP is given in Table 4.
|
Parameter |
VCFIU.HLP |
INSTALL.TXT |
||
|
|
Value |
σ |
Value |
Σ |
|
a |
-5.0 |
0.5 |
-5.8 |
0.8 |
|
b |
8.7 |
0.6 |
10 |
1 |
|
H |
1.5 |
0.2 |
0.2 |
0.2 |
Table 4: The results from the non-linear regression fit with DATAPLOT. (σ = Standard Deviation)4
The results of some popular compression algorithms performed on the files are given in Table 5.
|
Compression algorithm |
VCFIU.HLP |
INSTALL.TXT |
|
PAQ6V2 –3 |
87.1370% (lit: 89.5%) |
69.68% |
|
WINZIP |
79.0773% (lit: 80.7%) |
60.81% |
|
GZIP –9 (1.2.4) |
79.6068% (lit: 79.7%) |
61.90% |
Table 5: Some compression results for popular algorithms. Literature values from reference5.
3. Discussion
From Table 2 can be seen that the entropies with overlap cannot be trusted. The entropy initially goes down and the goes up again with increasing order. This is also in conformation with a posting in the comp.compression news groep where is stated that the sequences should be independent.6
From Table 1 and 2 one can see that the trend is that entropy asymptotically decreases with increasing order. This is in conformation with theory that says that entropy can only decrease when the tested sequence of n symbols grows (n the order of the entropy). The true entropy is found when n goes to (near) infinity. This can only be approximated of course with an infinite source and not with finite files. With an infinite source the limit where n goes to (near) infinity goes asymptotically to the true entropy. I use (near) infinity and approximated here because when n = infinity then P(x1, x2, ….. xn) = 1, log2(P(x1, x2, ... xn)) = 0, Hn = 0.
Because the true entropy of the English language lies between 0.6 < HEnglish < 1.3 bits per byte.2 And the first order entropy HEnglish = 4.1 bits per ASCI symbol (= bits per byte). An approximation of the true entropy of VCFIU.HLP is 5.207 * 0.6 / 4.1 < HVCFIU.HLP < 5.207 * 1.3 / 4.1 → 0.8 < HVCFIU.HLP < 1.7 and for the true entropy of INSTALL.TXT 0.7 < HINSTALL.TXT < 1.5. I think these approximations are fair because both files contain mainly English text. They confirm the results of Table 3 more or less in the way that the true entropy can be smaller than 1.0 bits per byte. The first order entropies of VCFIU.HLP and INSTALL.TXT are about 5.2 and 4.7 respectively, which also comes close to the 4.1 bits per byte of the first order approximation of the English language.
In Table 5 there is a standard deviation mentioned for each parameter. With these values one can calculate the range of 95.5% probability of the normal distribution (± 2σ).7 These are the ranges mentioned in Table 6.
There is a good internet site for comparing compression algoritms.5 On this site there are results for 114 compressors done on the VCFIU.HLP file. The range is 71.5% – 89.5%. The results of Table 5 for the VCFIU.HLP file fall in this range. But there is some difference between the exact figure with this literature source because there can be difference in the used option and version of the compressor. The literature source uses the best compression option, for PAQ6V2 I could not do so because I do not have enough memory, I only have 128 Mb and PAQ6V2 –9 requires 1.6 Gb. The latest versions where not available to me. That the results for VCFIU.HLP and INSTALL.TXT differ so much in Table 4 comes from the fact that larger files give better compression. The compression table which is also stored in the compressed file, one needs this data to decompress the file, and this becomes relatively smaller when the source file gets larger.
4. Conclusions
|
VCFIU.HLP |
|
|
Estimate by English Language |
0.8 < H < 1.7 bits per byte |
|
|
79% < SLC < 90% (84%) |
|
|
|
|
Estimate by Graph |
0.7 < H < 1.0 bits per byte |
|
|
88% < SLC < 91% (90%) |
|
|
|
|
Estimate by DATAPLOT |
1.1 < H < 1.7 bits per byte |
|
|
79% < SLC < 86% (83%) |
|
INSTALL.TXT |
|
|
Estimate by English Language |
0.7 < H < 1.5 bits per byte |
|
|
81% < SLC < 91% |
|
|
|
|
Estimate by Graph |
0.5 < H < 0.6 bits per byte |
|
|
93% < SLC < 94% |
|
|
|
|
Estimate by DATAPLOT |
0 < H < 0.6 bits per byte |
|
|
93% < SLC < 100% |
|
Compression data VCFIU.HLP |
|
|
Maximal compression found |
87.1370% |
|
Maximal compression literature5 |
89.5% |
|
Compression data INSTALL.TXT |
|
|
Maximal compression found |
69.68% |
Table 6: Summary of the results of the calculated SLC from the measurements and the reference data.
As I made clear in the introduction it is impossible to do an exact measurement of H. Even in theory this is only possible on infinite sources, which do not exist in practice, and not on finite samples of the source (files). That’s why there are ranges in Table 6. Nevertheless one can draw conclusions about how effective State of the Art compressors are.
From Table 6 one
can see that the H of VCFIU.HLP must lie somewhere between 0.8 and 1.0 bits per
byte, because this is the overlapping range of the English language and graph
estimates. This leads to a SLC between 88% and 90%. The compression measured by
myself (87.1370%) falls outside this range, but the absolute maximum literature
value for PAQ6V2 of 89.5% falls in this range.5 So one cannot
exclude that PAQ6V2 can indeed compress at the SLC for this file. Better
insight would be given when a grid computer calculates a more precise value for
the true entropy of the VCFIU.HLP file. But for now: State of the Art
compressors are at or near the
The values for
Graph and DATAPLOT do overlap for INSTALL.TXT. This is a strong argument to say
that H lies somewhere between 0.5 – 0.6 bits per byte (93% < SLC < 94%).
The fact that the DATAPOINT range also overlaps this is another strong argument
that it is correct. A confirmation also comes from the fact that the lower
border of HEnglish is 0.6 bits per byte and INSTALL.TXT is a pure
English text file. The compression found by experiment, value 69.68%, is much
lower than the estimated SLC in this case. One thing is also clear that the
entropy of INSTALL.TXT is lower than that of VCFIU.HLP although the found
compressions are also lower. But as I said in the Discussion this maybe due
from the fact that the compression table that also stored in the compressed
file is relatively bigger. This brings me to the following: As a kind of
experiment one can program a compressor with the best algorithm available and
let him store the compression table in a separate file. Or the compression
table is part of the compressor / decompressor, such a program can of course
only process one particular file. But both ideas are based on the same
principle. And maybe that’s what the Information Theory is trying to say: This
is the only way to reach the
Since State-of-the-Art compressors are that good, so close to the SLC, it is possible to use them for True (Near Infinite Order) Entropy estimations of for example DNA sequences. I tried this for fugu_rubripes.FUGU2.nov.dna.scaffold.fa, homo_sapiens.NCB135.jul.dna.chromosome.Y.fa and homo_sapiens.NCB135.jul.dna.chromosome.2.fa downloaded from the greatest DNA database of the internet called Ensembl.8 Since these files store only one base per byte, because they are database flatfiles which contain also yet unknown bases designated as N (aNy base) and database delimiters after every 60 bases segments, I had to delete the N’s and the delimiters and put four bases in a byte by means of a PASCAL program. The resulting files FUGUCompress.dat, HSYCompress.dat and HS2Compress.dat where compressed even further by the record holder5 PASQDA 4.1. (option –6e) to FUGURandom.dat, HSYRandom.dat and HS2Random.dat.
|
Item |
FuguRandom.dat |
HSYRandom.dat |
HS2Random.dat |
|
Size source file |
78,868,306 bytes |
5,607,324 bytes |
59,375,844 bytes |
|
Size random file |
75,819,935 bytes |
4,666,028 bytes |
52,729,742 bytes |
|
Compress ratio |
0.9613 |
0.8321 |
0.8881 |
|
True Entropy source |
1.923 bits per base |
1.664 bits per base |
1.776 bits per base |
|
DIEHARD p’s = 1 or 0 out of |
0 out of 229 |
0 out of 119 |
0 out of 229 |
|
DIEHARD KS-test |
0.081867 |
0.991629 |
0.135248 |
|
Tests passed out of |
33 out of 35 |
28 out of 30 |
34 out of 35 |
Table 7: The results from compressing the four bases per byte files from the Scaffold and Chromosomes.
Randomness tests where conducted from the suites: ENT, RanTests, RaBenZi1 and DIEHARD v0.2 beta. If PASQDA compresses them near the SLC than the files should pass most of the randomness tests and then the compression ratio is a good indication of the True Entropy in bits per bit, which can be recalculated to bits per base by multiplication by two. I think this is fair because in the literature True Entropy in bits per ASCII symbol = bits per byte of the English Language is recalculated to the compression ratio or in other words True Entropy in bits per bit by dividing by eight.2 Results are given in Table 7.
Since the DIEHARD is passed the compressed files can be regarded as analytical pure, or in other words true, random. Than the True Entropy estimations from Table 7 must also be very close. Actually for a precise value of the True Entropy of the source file the compressed file must have a True Entropy value of exact 1 bits per bit, in other words all possible bit patterns are equally likely to occur, but first of all the number of zero and one bits must be exactly the same. Also formally can True Entropy only be calculated or measured by my method for an infinite source. The "True" Entropy estimation of is of the FuguCompress.dat file is near the 315,473,224th order, so not nearly infinite order! It is only a finite sample of the infinite source of DNA sequences called the evolution.
-o0o-
References & Notes:
1) Daugman J.G. ‘Information theory and coding’ Computer
Science Tripos Part II, Michaelmas Term 12 Lectures
http://www.cl.cam.ac.uk/Teaching/2003/InfoTheory/Notes.pdf
2) Meinsma G. ‘Data compression & information theory’
http://wwwhome.math.utwente.nl/~meinsmag/shannon.pdf
3) Anonymous ‘Dataplot’ NIST
http://www.itl.nist.gov/div898/software/dataplot.html/
4) The (0,8) datapoint was not used because then the
hyperbolic fit program of DATAPLOT has to divide by zero, which gives an error
message.
5) Anonymous ‘Help file compression test’ (2003 - 2004)
http://www.maximumcompression.com/data/hlp.php
6) Tate S.R.‘Re: Entropy’ (1992-07-22)
7) Hays W.H. ‘Statistics’ 5ed
8) Anonymous ‘Browse a genome’ e! Ensembl