15-08-2013, 07:26 PM
(This post was last modified: 15-08-2013, 07:50 PM by Albert Rossi.)
This is all interesting, and I don't want to throw a pedantic wrench into the discussion, but for the sake of accuracy, I believe the phrase quoted above is slightly misleading.
What is inversely proportional to probability in information theory is what is called the "surprisal" or surprise value of an event. In the formula developed by Shannon this is given by "-log(p)" (p=probability of an outcome).
The information content of a probability space (random variable) is actually the probabilistically weighted sum of all the surprisal values. The formula for this resembles Boltzmann's formula for thermodynamic entropy, so Shannon called this idea entropy as well. It turns out, from that formula, that maximum entropy, meaning the greatest uncertainty, and hence the system which carries the greatest information content (or requires the greatest amount of information to resolve that uncertainty -- the concepts are analogous), is one where all the surprisals are of equal probability: that is, the system is totally random.
This is directly related to the representation of such a sequence of events. Shannon used "bit strings" (hence the choice of log2). The maximum length of that string is equal to the maximum entropy achievable for that sequence of events. For instance, if you wanted to represent a single coin toss, the most you would need, if the coin is fair, is (a string of) 1 bit (e.g., on = head and off = tail). Entropy in this case = 1. If the coin is not fair (heads has higher probability, tails lower, or vice versa), the event becomes more predictable, and its entropy less than 1. The limit of the system is entropy = 0, i.e., total predictability.
See the illustration of this from http://en.wikipedia.org/wiki/Entropy_(in...on_theory) (good discussion there):
[ATTACH=CONFIG]5092[/ATTACH]
"Entropy H(X) (i.e. the expected surprisal) of a coin flip, measured in bits, graphed versus the fairness of the coin Pr(X=1), where X=1 represents a result of heads. Note that the maximum of the graph depends on the distribution. Here, at most 1 bit is required to communicate the outcome of a fair coin flip (2 possible values), but the result of a fair die (6 possible values) would require at most log[SUB]2[/SUB]6 bits."
In a maximal entropy string, the presence of each bit is necessary. If the entropy/information were not maximal, you could find a way to eliminate bits. Algorithms which do compression are based on this concept.
So, in terms of information, it's not really rarity that counts, it's randomness.
What is inversely proportional to probability in information theory is what is called the "surprisal" or surprise value of an event. In the formula developed by Shannon this is given by "-log(p)" (p=probability of an outcome).
The information content of a probability space (random variable) is actually the probabilistically weighted sum of all the surprisal values. The formula for this resembles Boltzmann's formula for thermodynamic entropy, so Shannon called this idea entropy as well. It turns out, from that formula, that maximum entropy, meaning the greatest uncertainty, and hence the system which carries the greatest information content (or requires the greatest amount of information to resolve that uncertainty -- the concepts are analogous), is one where all the surprisals are of equal probability: that is, the system is totally random.
This is directly related to the representation of such a sequence of events. Shannon used "bit strings" (hence the choice of log2). The maximum length of that string is equal to the maximum entropy achievable for that sequence of events. For instance, if you wanted to represent a single coin toss, the most you would need, if the coin is fair, is (a string of) 1 bit (e.g., on = head and off = tail). Entropy in this case = 1. If the coin is not fair (heads has higher probability, tails lower, or vice versa), the event becomes more predictable, and its entropy less than 1. The limit of the system is entropy = 0, i.e., total predictability.
See the illustration of this from http://en.wikipedia.org/wiki/Entropy_(in...on_theory) (good discussion there):
[ATTACH=CONFIG]5092[/ATTACH]
"Entropy H(X) (i.e. the expected surprisal) of a coin flip, measured in bits, graphed versus the fairness of the coin Pr(X=1), where X=1 represents a result of heads. Note that the maximum of the graph depends on the distribution. Here, at most 1 bit is required to communicate the outcome of a fair coin flip (2 possible values), but the result of a fair die (6 possible values) would require at most log[SUB]2[/SUB]6 bits."
In a maximal entropy string, the presence of each bit is necessary. If the entropy/information were not maximal, you could find a way to eliminate bits. Algorithms which do compression are based on this concept.
So, in terms of information, it's not really rarity that counts, it's randomness.

