Probability theory as an extension of logic

Artiom Fiodorov (Tom)

October, 2015

Motivation

Deductive and plausible reasoning

Weak syllogisms

Crucial difference

Quantifying weak syllogisms

Answer: yes, using probability theory.

Objectives of this talk

Preliminary notation

Introducing state of knowledge

Assumption: Real-valued

Assumption: "Common sense"

Assumption: Certainty of a belief is singled valued

Assumption: Universality

Last assumption: Conjunction

Cox's theorem

There exists a continuous, strictly increasing function \(p\) such that, for every \(A, B\) consistent with \(X\),

  1. \(p(A | X) = 0\) iff \(A\) is known to be false given the information in \(X\).
  2. \(p(A | X) = 1\) iff \(A\) is known to be true given the information in \(X\).
  3. \(0 \leq p(A | X) \leq 1\).
  4. \(p(A\wedge B | X) = p(A | X)p(B | A, X)\).
  5. \(p(\neg A | X) = 1 - p(A | X)\) if \(X\) is consistent.

What about Kolmogorov's axioms?

Kolmogorov's probability is defined on a sample space \(\Omega\) with an event space \(F\) which forms a \(\sigma\)-algebra

  1. \(P(E) \in \mathbb{R}, P(E) \geq 0 \quad \forall E \in F\)
  2. \(P(\Omega) = 1\)
  3. \(P \left( \bigcup_{i = 1}^{\infty} E_i \right) = \sum_{i = 1}^{\infty} P(E_i)\) for any countable sequence of disjoint events sets.

Example: Coin toss. Then \(\Omega = \{H, T\}\), \(F = \{ \emptyset, \{H\}, \{T\}, \{H, T\} \}\). Sample measure \(P\) is then just \(P(\{H\}) = P(\{T\}) = 1/2\), \(P(\emptyset) = 0\), \(P(\{H, T\}) = 1\).

Comparisons of Cox's approach with Kolmogorov's

Frequencies? Go via MaxEnt route

Derivation of MaxEnt. Notation.

Start with \(n\) independent trials, each with different \(m\) outcomes.

Each sample is then just a string of length \(n\):

\[ 1, m-1, m-2, 5, 6, 7, 3, 1, 2, 3, 5, 7\]

So sample space is \(S^n = \{1, 2, \dots, m\}^n\) such that \(|S^n| = m^n\).

Start with "ignorance knowledge" \(I_0\) i.e.

\[ P(A | I_0) = \frac{M(n, A)}{|S^n|} \,,\]

where the multiplicity of \(A\), \(M(n, A)\), is just the number of distinct strings in \(S^n\) such that \(A\) is true.

Breaking down multiplicity \(M(n, A)\) into a sum

Denote by \(n_1, n_2, \dots n_m\) the number of times the trial came up with result \(1, 2, \dots, m\) respectfully.

If the string is \(1, 2, 1, 2, 2, 2, 3\), then \((n_1, n_2, n_3) = (2, 4, 1)\).

Suppose we have a restriction \(R\) where \(A(n_1, n_2, \dots, n_m)\) is true. If \(A\) is linear in \(n_j\) then

\[ M(n, A) = \sum_{n_j \in R} \frac{n!}{n_1!n_2!\dots n_m!} \]

Explicit example:

Rolling a die \(20\) times. \(A = \text{average roll is 4}\). Then pick \((n_1, n_2, n_3, n_4, n_5, n_6)\) so that \(\sum n_i = 20\). If \(\sum i * n_i = 4 * 20 = 80\), include the multinomial coefficient in the calculation of multiplicity of \(A\):

\begin{align*} M(20, A) = \frac{20!}{1!1!1!12!4!1!} + \frac{20!}{1!1!1!13!2!2!} + \frac{20!}{1!1!2!10!5!1!} + \cdots \end{align*}

There are \(283\) terms in the summation.

Deriving MaxEnt principle. Finally.

Let \(W_{\max} = \max_{R} \frac{n!}{n_1! n_2! \dots n_m!}\). Then

\[ W_{\max} \leq M(n, A) \leq W_{\max} * \frac{(n + m - 1)!}{n! (m - 1)!} \]

Can be seen that \(\text{\# of terms} \sim n^{m-1} / (n-1)!\), so

\[ \frac{1}{n} \log M(n, A) \to \frac{1}{n} \log (W_{\max}) \text{ as } n \to \infty\]

Introduce frequency distribution \(f_j = n_j / n\). If \(f_j\)'s tend to constants as \(n \to \infty\), use Stirling's approximation

\[ \frac{1}{n} \log M(n, A) \to H := - \sum_{j = 1}^{m} f_j \log f_j \,.\]

So the multiplicity can be found by determining the frequency distribution \(\{ f_j \}\) which maximises entropy subject to \(R\).

Convergence of Bayesian probabilities with Frequentists

We can further show that for \(A = \sum_{i=1}^{m} g_i n_i\),

\[ P(\text{trial}_i = j | A, n, I_0) = \frac{M(n - 1, A - g_j)}{M(n, G)} = f_j \,. \]

(Trick) set \(g_1 = \pi, g_2 = e\). If \(A(n_2, n_2) = 3 \pi + 5 e\) is true, then \((n_1, n_2) = (3, 5)\). Can be shown that

\[ P(\text{trial}_i = j | \{n_j\}, n, I_0) = \frac{n_j}{n} \]

So why does heat flow from hot objects to cold objects?

disorder increases

disorder increases

Intro into Central Limit Theorem.

CLT

CLT

MaxEnt explains Central Limit Theorem.

Let \(X_1, X_2, \dots, X_n\) be independent, identically distributed random variables. Then

\[ \frac{X_1 + X_2 + \dots + X_n - n \mu}{\sqrt{n}} \to \mathcal{N}(0, 1)\]

Why convergence? Why Gaussian?

Turns out: convolution: \(X + Y\) is "forgetful", but keeps mean and variance fixed. And Gaussian is the MaxEnt-distribution with prescribed mean and variance.

The Wisdom of Crowds Revisited.

Advantages of Bayesian methods.

"Information theory must precede probability theory and not be based on it." -- Kolmogorov

Some objections to the use of Bayesian framework addressed

Whether data support the hypothesis depends on alternatives and prior information

There are 2 worlds:

Then observing a black crow is evidence against the hypothesis that all crows are black.

Same data can be evidence against and for same hypothesis.

Applications in everyday thinking.

Final remarks

Many of our applications lie outside the scope of conventional probability theory as currently taught. But we think that the results will speak for themselves, and that something like the theory expounded here will become the conventional probability theory of the future.

-- E.T. Jaynes. Probability: The Logic of Science

A scientist who has learned how to use probability theory directly as extended logic has a great advantage in power and versatility over one who has learned only a collection of unrelated ad hoc devices. As the complexity of our problems increases, so does this relative advantage. Therefore we think that, in the future, workers in all the quantitative sciences will be obliged, as a matter of practical necessity, to use probability theory in the manner expounded here.

-- E.T. Jaynes. Probability: The Logic of Science