6 Further modifications to make short lab algorithms practical
6.1 Encode length
In this modification, an Elias Gamma coding algorithm was added to the coding algorithm to indicate input string length in contextual_arithm.py
. This removes the dependency of the original algorithm on source string length. The implementation of the arithmetic decoding algorithm requires a parameter indicating the length of the source string to work as expected. This is because the encoder’s precision is based on inwardly rounded integer values and therefore the intervals will be larger than ideal and there will be extra bits in the output. The input length can be used for arithmetic decoder to decide where to terminate.
An Elias Gamma encoder works as the following: for any integer, \(z \geq 1\)
- Let \(N = \lfloor \log2z \rfloor\), encode a binary string with N zeros followed by a 1.
- convert \(z - 2^N\) to binary with length \(N\)
- Append string in 2 to the previous binary string in 1. This Elias Gamma encoder can then be combined with the source string code.
For decoding,
- Count the number of zeros at the beginning of the string will give us \(N\)
- Convert the next \(N+1\) bits from binary number to the original base, which gives \(z\). This works because we consider the extra 1 in the encoding step as a part of the binary number, which is equivalent to add \(2^N\) to the binary number in step 2 of the encoding process. The source string code is after \(2N+1 digits\).
Alternatively, it is also possible to use a termination code with an assigned probability in the probability dictionary to indicate where the file ends.
6.2 Distribution
The prior distribution in the implemented adaptive encoder is an uniform distribution. However, it may be possible to obtain distribution closer to the average distribution of the English text and this may have a better performance than the uniform distribution prior on the first few letters [8].
Besides, the above algorithms all assume that the source distribution fits every local distribution of the file, but this may not always be the case. In the prediction by partial matching algorithm, an escape key will be inside the prior probability model and will be sent if the recent observed probability distribution is not similar to the estimated probabiltiy distribution for all source string processed.
Furthermore, when implementing the adaptive encoder, we assumed the files will only have 128 ASCII symbols, but the assumption that encoder knows the alphabet of the source may not always hold. It is possible to have a literal key in the prior probability model and the encoder can send literal code followed by the unicode key of the source string to inform the decoder to add this unicode to its probability dictionary.