MaxEnt applies even if you don't start with a uniform prior

E.T. Jaynes has a very nice derivation in his book: “Probability Theory: The Logic of Science”, Chapter 9, which allows one to answer the following question:

After rolling a die many times it transpires that the expectation is \(4\). What is the probability distribution one must assign to such a die?

It turns out that if you start with ignorance knowledge: i.e. all rolls are independent and each number is equally likely one ought to maximise the information entropy, \(\sum - p_j \log p_j\), subject to the following constraints:

to arrive at the answer: \((0.11,0.12,0.14,0.17,0.21,0.25)\).

It is not immediately clear that this principle still applies even if you don’t start with ignorance knowledge. However, it is easy to incorporate your prior distribution into MaxEnt algorithm in some situations.

For example, start with a 3-sided die with a prior \((2/6, 1/6, 3/6)\). Then it is the same as starting with a 6-sided die with a prior \((1/6, 1/6, 1/6, 1/6, 1/6, 1/6)\), where rolls \(1, 2\) map to \(1\), \(3\) to \(2\) and \(4, 5, 6\) to \(3\).

So if you start with a prior \((2/6, 1/6, 3/6)\) and then learn that the true average is \(1.5\) you can apply MaxEnt algorithm under

and

and it will give you the posterior distribution, which you can collect back to \((p_1, p_2, p_3) = (f_1 + f_2, f_3, f_4 + f_5 + f_6)\). So the answer is: \((0.68, 0.14, 0.18)\).

This gave me a better understanding why Shannon’s theorem1 stipulates the following desired property of the entropy \(H\):

Edit: read more about why the above property is desired here.

  1. Shannon’s theorem states desired criteria for a to-be-constructed entropy function and then shows that \(\sum_i - p_i \log p_i\) is the only function that satisfies it.