Entropy and information theory¶

Based on [Li18], in information theory the shannon entropy \(H\) is an optimal lower bound on the expected number of bits needed to represent symbols \(a_i\) in a data set \(S\), and it is calculated as:

\[H=-\sum_{i_1}^n p(a_i)\log_{2}{p(a_i)}\]

Where \(p(a_i)\) is the probability of symbol \(a_i\) to appear in the data set \(S\) and the quantity \(\log_{2}{p(a_i)}\) can be interpreted as the amount of information that symbol \(a_i\) posess. Analyzing the expression, it is possible to observe that \(H\) is maximized if all symbols in the data set \(S\) have the same probability to appear, and minimized if the probaility of a single event is 1, while the rest is zero. It is therefore possible to observe that the entropy is also a measure of the variance or uniformity of the data set \(S\).

A family of compression algorithms exist that try to represent data sets in such a way that the information is stored at the minimun size bounded by \(H\), however it is clear that if the variance in the data set is high, such that \(H\) is high itself, then the size of data set \(S\) after being optimally written to the disk might still be large.

Steps for data compression in turbulent fields.¶

Given that turbulence is a multi-scale and seemlingly random phenomenon, data sets that describe it tend to have quite a large entropy. Making lossless compression ineficient. It is for this reason that in order to achieve high compression ratios, one alternative is to first include steps that reduce the entropy, generally converting the compression into a lossy process.

Lossy data truncation¶

There are many ways to perform lossy compression, however, transfromation-based aproaches are among the most used. JPEG [Wallace92] for image compression, for example, uses this aproach. Given that the data in Nek5000 is distributed in a mesh with certain characteristics, the aproach taken in this library was to truncate the data in the legendre space while controlling the error. This is beneficial given that the user has total control in the fidelity of the data after it is stored. We proceed to expand these ideas in the following sections.

Legendre Polynomials¶

Spectral element solvers such as Nek5000, Neko [Jansson21] and NekRS [Fischer22] discretize the domains with high order finite elements, each of them containing a number of Gauss-Lobatto-Legendre collocation points that allow them to perform numerical differentiation and integration thanks to numerical quadrature. A good characteristic of these points is that a set of polynomials known as the Legendre polynomials form an orthogonal basis \(\phi\) around them, therefore the value \(u\) of a known function at point \(x\) can be represented as a sum of the values of the orthogonal basis at different orders scaled by a set of Legendre coefficients \(\hat{u}\) as follow:

\[u(x)=\sum_{i=0}^{N}\hat{u}_i\phi_i(x)\]

The polynomials of order 0 and 1 are \(\phi_0(x)=1\) and \(\phi_1(x)=x\), respectively, and the rest are obtained with the recursive formula:

\[(i+1)\phi_{i+1}(x)=(2i+1)\phi_{i}(x)-(i)\phi_{i-1}(x), \quad i=1,..N,\]

For convenience we also define now the integration weight matrix produced by the GLL quadrature:

\[W_{ij}=\frac{2}{N(N+1)}\frac{1}{\phi^2_N(x_i)}\delta{ij}.\]

Although this equation works for the value in a single point, in numerical methods we are often interested in a set of them. Therefore, if we define \(\textbf{u}=[u_0, u_1, ... u_{N-1}]\), it is possible to transform the entire array by using a transformation matrix as follows:

\[\textbf{u}=T\hat{\textbf{u}},\]

where \(T_{ij}^T=f_j\phi_{i}(x_j)\) and \(f_j\) are scaling factors that make \(T\) orthogonal with respect to the integration weights such that \(T^TWT=I\) and are defined as:

\[\begin{split}f_j= \begin{cases} \sqrt{(2(j+1)-1)/2} & 0\leq j\leq N-1 \\ \sqrt{(N-1)/2} & j=N \end{cases}\end{split}\]

Based on the previous statements, it is straight forward to realize that a physical signal \(u\) has an equivalent representation in the Legendre space \(\hat{u}=T^{-1}u\), the main difference however, is that not all entries in \(\hat{u}\) have the same relevance in the physical representation of the signal. Therefore, discarding or approximating some of the Legendre coefficients will provide fairly accurate physical reconstruction of the signal as long as some key elements are maintained. It is precisely this characteristic that makes this sort of approach a relevant case study for compression algorithms.

Down-sampling¶

In signal processing, often times the high frequencies can not be perceived by humans or are not relevant for the signal representation in physical space. Turbulence is not an exception to this characteristic, as it is a process with high dissipation primarily at the smallest eddies, i.e, the higher frequencies. For this reason a natural choice of information to discard is is precisely that contained in higher frequencies.

As mentioned before, in Legendre space, the information about the signal is contained mainly in the coefficients \(\hat{u}\). The process of down-sampling consist of filtering a number of the coefficients that represent the importance of the high order Legendre polynomials (the higher frequencies) and transforming back such that an operation as the following is performed:

\[\tilde{u}=T^{-1}FTu,\]

Where \(F\) is a filtering operator that consist of a “broken identity matrix”. The previous equation sumarizes 3 main steps:

Transform the signal into the Legendre space: \(\hat{u}=Tu\).
Filter the signal: \(\tilde{\hat{u}}=F\hat{u}\).
Transform the down-sampled coefficients back to physical space: \(\tilde{u}=T^{-1}\tilde{\hat{u}}\).

It is important to note 2 main points: first, that \(\tilde{u} \neq u\) unless \(F=I\), so down-sampling undoubtedly brings with it some errors and second, that the down-sampled coefficients contain less information than the original ones since only \(k\) coefficients are kept, while the rest are set to 0 trhough the filtering matrix \(F\). The content of \(\tilde{\hat{u}}\) will be an array with entries arrayed in a similar fashion as shown follow:

\[\begin{split}\tilde{\hat{u}}= \begin{bmatrix} \hat{u}_0 \\ \hat{u}_1 \\ ...\\ \hat{u}_{k-1} \\ \hat{u}_k \\ 0 \\ .. \\ 0 \\ \end{bmatrix}.\end{split}\]

It is possible to see that since \(\tilde{\hat{u}}\) has many zeros in it, it is much easier to represent it in memory than the reconstructed signal \(\tilde{u}\) (has less variance, i.e., less entropy) and through lossless compression algorithms that include run-length encoding or Huffman encoding, the array in disk will only occupy the size that is needed for the \(k\) entries kept while the extra zeros will represent only a minor storage overhead.

Discreet Legendre truncation¶

[Otero18] recognized the use of down-sampling but extended it for turbulence in spectral element methods. They realized that although higher frequencies should have the lower energies (lower magnitude coefficients), numerically in the Legendre space, this is not the case, particularly when considering 3D simulations with the data structures used in spectral element solvers.

Therefore they introduced extra steps to the algorithm that ensure that it is not the higher frequencies that are filtered out, instead it is the frequencies that posses the lower energies. The principles are still the same as in down-sampling but instead of immediately filtering out the coefficients, first a sorting operation is performed. The algorithm is the following:

Transform the signal into the Legendre space: \(\hat{u}=Tu\).
Sort the coefficients \(\hat{u}^{s}=sort(\hat{u})\)
Filter the sorted signal: \(\tilde{\hat{u}}^{s}=F\hat{u}^{s}\).
Unsort the signal: \(\tilde{\hat{u}}=unsort(\tilde{\hat{u}}^{s})\).
Transform the truncated coefficients back to physical space: \(\tilde{u}=T^{-1}\tilde{\hat{u}}\).

For this case, the array \(\tilde{\hat{u}}\) will not possess all the zeros grouped at the end as would be the case in down-sampling, however with the correct encoding technique, the storage overhead is negligible. This method has the advantage that if the same number \(k\) of frequencies are kept, lower errors will be obtained due to the fact that in this method the most influential frequencies are stored.

Additionally to this procedure, [Otero18] also introduced a way to calculate the reconstruction error at run-time without the need to transforms back to physical space by taking advantage of: 1. how the L2 norm of a physical variable is calculated and 2. the Legendre quadrature for integration, as follows:

\[\lVert u\rVert_{2}=\sqrt{\int_{\Omega}u^2 d\Omega}=\sqrt{u^TWu}=\sqrt{(T\hat{u})^TW(T\hat{u})}=\sqrt{\hat{u}^TW\hat{u}}.\]

This provided an efficient way to control error and compress only up to a user predefined threshold \(\epsilon\) of the error.

The DLT method was initially proposed in a theoretical manner but has been since extended by to include run-time encoding and produce a fully functioning lossy compression algorithm that in fact matches well with the theoretical expectations one would have of such a lossy compression.

In more recent develoments, we have further extended the code to also control the Peak Signal To Noise ratio:

\[PSNR=20 \log{\frac{max(u)-min(u)}{\lVert u\rVert_{2}}}\]

Controling this error as well, allows the procedure to verify in a more localized manner where more compression is allowed, in such a way that in regions with high fluctuations, where the spectrum is flatter, less error is allowed to mantain fidelity over compression.

Lossless data encoding¶

After the entropy of the data set has been reduced by a lossy truncation step. It can be losslessly compressed without introducing new errors by a multitude of schemes that serve this purpose.

For this step, we use the library ADIOS2 [Godoy20] in order to have a data management framework that would allow us to transfer data in an efficient way, apart from seamlessly integrating many lossless compression algorithms, including bzip2, which we decided to use. In principle, any other form of lossless compression would serve the same purpose.

The input for this step is a velocity or pressure array that has been truncated, i.e., its entropy has been reduced. Then the bzip algorithm applies several layers of compression techniques in the following order:

Run-length encoding. Where portions of data are represented as a single data value and a count instead of the original, for example representing a string of “5555222” as “4532”. This is particularly useful for the truncated velocity fields as many entries has been set to zero.
Burrows–Wheeler transform. Which is a reversible transformation that rearranges the entries of a character string into runs of repeated characters. For example, a string "*BANANA|" will be rearranged as “BNN*AA|A”, where more identical characters can be seen next to each other.
Move-to-front transform. The main idea of the algorithm is that each symbol is replaced by its index in a list of most used symbols, if a symbol appears in sequence it would be replaced by zeros while if it does not appear frequently is replaced by a large number. Applying the algorithm to the string “bananaaa” having as a list of symbols the letters in the alphabet, will produce the following sequence: “1,1,13,1,1,1,0,0” .
Run-length encoding. once again.
Huffman coding. The algorithm produces a code table for encoding the source symbol. The table is obtained by calculating the probability (or frequency of occurrence) for each value in the source to occur. Symbols with high probability will be represented with codes that use less bits that those used for less common symbols.