Artiom Fiodorov (Tom)

October, 2015

- A policeman spots an armed person next to a robbery. He quickly concludes that the man is guilty. By what reasoning process?

- Policeman's conclusion couldn't be a logical deduction
- Deductive reasoning consists of the following strong syllogisms:
- if \(A\) is true, then \(B\) is true \begin{align*} \frac{\text{$A$ is true}}{\text{therefore, $B$ is true}} \end{align*}
- \begin{align*} \frac{\text{$B$ is false}}{\text{therefore, $A$ is false}} \end{align*}

The reasoning of our policeman consists of the following weak syllogisms:

- if \(A\) is true, then \(B\) is true \begin{align*} \frac{\text{$A$ is false}}{\text{therefore, $B$ becomes less plausible}} \end{align*}
- \begin{align*} \frac{\text{$B$ is true}}{\text{therefore, $A$ becomes more plausible}} \end{align*}

Strong syllogisms can be chained together without any loss of certainty

Weak syllogism have wider applicability

Most of the reasoning people do consists of weak syllogism.

- Question: can we quantify this everyday reasoning?

Answer: yes, using probability theory.

Probability theory is nothing but common sense reduced to calculation -- Laplace, 1819

- Cox's theorem (1946) states that any system for plausibility reasoning that satisfies certain commonsense requirements is isomorphic to probability theory.

- Cox's theorem relies on a few assumptions. Whilst those assumptions are not necessarily how humans always think, it might well something be something that rational people might want to adopt into their way of thinking.

- Goal 1: show that probability theory can be regarded as an extension of logic
- Goal 2: relate such notion of probability to other approaches of interpreting probability
- Goal 3: show applications of probability theory used as logic

- Degrees of plausibility are represented by real numbers.

- Correspondence with common sense:

- (2a) If a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result.
- (2b) The robot always takes into account all of the evidence it has relevant to a question. It does not arbitrarily ignore some of the information, basing its conclusions only on what remains.
- (2c) The robot always represents equivalent states of knowledge by equivalent plausibility assignments. That is, if in two problems the robot's state of knowledge is the same (except perhaps for the labelling of the propositions), then it must assign the same plausibilities in both.

There exists a continuous monotonic decreasing function \(S\) such that

\[ (\neg A|B) = S(A|B) \]

There exists a continuous function \(F\) such that

\[(A \wedge B|C) = F[(B|C), (A| B, C)]\]

Heuristic: for \(A \wedge B\) to be true \(B\) has to be true so \((B | C)\) is needed. If \(B\) is false then \(A \wedge B\) is false independently of \(A\), so \((A | C)\) is not needed if \((A | B, C)\) and \((B | C)\) are known.

There exists a continuous, strictly increasing function \(p\) such that, for every \(A, B\) and some background information \(X\),

- \(p(A | X) = 0\) iff \(A\) is known to be false given the information in \(X\).
- \(p(A | X) = 1\) iff \(A\) is known to be true given the information in \(X\).
- \(0 \leq p(A | X) \leq 1\).
- \(p(A\wedge B | X) = p(A | X)p(B | A, X)\).
- \(p(\neg A | X) = 1 - p(A | X)\).

Measure-theoretic. Opt-out: probability is any measure with certain properties.

The principles for assigning probabilities by logical analysis of incomplete information is not present at all in Kolmogorov system.

- Statistics:

- Bayesian: what we derived here. Probability = plausibility of a statement
- Frequentist: probability = long standing frequency of an event

Conduct \(n\) independent trials, where each trial has \(m\) outcomes.

Start with ignorance knowledge (\(I_0\)) that every trial is equally likely.

Can be shown that

\[ P(\text{trial}_i = j | \{n_j\}, n, I_0) = \frac{n_j}{n} \]

where \(\frac{n_j}{n}\) is just the observed frequency of an outcome \(j\).

Throw a die \(n\) times. Average is \(4\). What is the probability distribution of such a die as \(n \to \infty\)?

Let's first answer calculate the following:

\[ P(\text{Average is } 4 | I_0) \]

\[ P(\text{Average is } 4 | I_0) = \text{Multiplicity}(\text{Average is } 4) / 6^n \]

Fix \(n = 20\).

\begin{align*} \text{Multiplicity}(\text{Average is } 4 | I_0) = \frac{20!}{1!1!1!12!4!1!} &+ \frac{20!}{1!1!1!13!2!2!} \\ &+ \frac{20!}{1!1!2!10!5!1!} + \cdots \end{align*}There are \(283\) terms in the summation.

\[ \frac{1}{n} \log(\text{Multiplicity}) = \frac{1}{n} \log \max_{\substack{\sum_i n_i = n \\ \sum_i i*n_i = 4 * n}} \frac{n!}{n_1! n_2! \dots n_6!} + o(1) \,. \]

Now take \(n \to \infty\) under \(n_j / n \to f_j\) to see that

\[ P(\text{Average is } 4 | I_0) \approx \frac{e^{n \sum_i - f_j \log f_j}}{6^n}. \]

for \(f_j\)'s that maximise \(\sum_{j = 1}^{6} - f_j \log f_j\).

Out of all probability distributions that average to \(4\) pick the one which maximise the

*information*entropy: \(\sum_{j = 1}^{6} -p_j \log p_j\).The answer is: \(0.11, 0.12, 0.14, 0.17, 0.21, 0.25\).

- Such interpretation of probability unify statistical inference, probability theory, information theory under one mathematical framework.

"Information theory must precede probability theory and not be based on it." -- Kolmogorov

- E.T. Jaynes derived procedures for multiple hypotheses testing, parameter estimation, significance testing and many more directly from the Cox's theorem.

- Often times such derivations yield a new and deeper understanding of the statistical tools.

In the future, workers in all the quantitative sciences will be obliged, as a matter of practical necessity, to use probability theory in the manner expounded here.

-- E.T. Jaynes. Probability: The Logic of Science