In the context of machine learning, some of the concepts of information theory are used to characterize or compare probability distributions. Read up on the underlying math to gain a solid understanding of relevant aspects of information theory.click here https://getdailytech.com/

**Mathematics**

The field of information theory studies the quantification of information in signals. In the context of machine learning, some of these concepts are used to characterize or compare probability distributions. The ability to quantify information is also used in the decision tree algorithm, to select the variables associated with the maximum information gain. The concepts of entropy and cross-entropy are also important in machine learning because they lead to a widely used loss function in classification tasks: the cross-entropy loss or log loss.

**Shannon Information**

** Intuition**

The first step to understanding information theory is to consider the concept of quantity of information associated with a random variable. In information theory, this quantity of information is denoted as II and is called the Shannon information, information content, self-information, or surprisal. The main idea is that likely events convey less information than unlikely events (which are thus more surprising). For instance, if a friend from Los Angeles, California tells you: “It is sunny today”, this is less informative than if she tells you: “It is raining today”. For this reason, is can be helpful to think of the Shannon information as the amount of surprise associated with an outcome. You’ll also see in this section why it is also a quantity of information, and why likely events are associated with less information.

30 of 2500 https://getdailytech.com/30-of-2500/

**Units Of Information**

Common units to quantity information are the nat and the bit. These quantities are based on logarithm functions. The word nat, short for natural unit of information is based on the natural logarithm, while the bit, short for “binary digit”, is based on base-two logarithms. The bit is thus a

rescaled version of the nat. The following sections will mainly use the bit and base-two logarithms in formulas, but replacing it with the natural logarithm would just change the unit from bits to nats.

Bits represent variables that can take two different states (0 or 1). For instance, 1 bit is needed to encode the outcome of a coin flip. If you flip two coins, you’ll need at least two bits to encode the result. For instance, 00 for HH, 01 for HT, 10 for TH and 11 for TT. You could use other codes, such as 0 for HH, 100 for HT, 101 for TH and 111 for TT. However, this code uses a larger number of bits in average (considering that the probability distribution of the four events is uniform, as you’ll see)

Let’s take an example to see what a bit describes. Erica sends you a message containing the result of three coin flips, encoding ‘heads’ as 0 and ‘tails’ as 1. There are 8 possible sequences, such as 001, 101, etc. When you receive a message of one bit, it divides your uncertainty by a factor of 2. For instance, if the first bit tells you that the first roll was ‘heads’, the remaining possible sequences are 000, 001, 010, and 011. There are only 4 possible sequences instead of 8. Similarly, receiving a message of two bits will divide your uncertainty by a factor of 2222; a message of three bits, by a factor of 2323, and so on.

Note that we talk about “useful information”, but it is possible that the message is redundant and convey less information with the same number of bits.

**Example**

Let’s say that we want to transmit the result of a sequence of eight tosses. You’ll allocate one bit per toss. You thus need eight bits to encode the sequence. The sequence might be for instance “00110110”, corresponding to HHTTHTTH(four “heads” and four “tails”).

However, let’s say that the coin is biased: the chance to get “tails” is only 1 over 8. You can find a better way to encode the sequence. One option is to encode the index of the outcomes “tails”: it will take more than one bit, but ‘tails’ occurs only for a small proportion of the trials. With this strategy, you allocate more bits to rare outcomes.

This example illustrates that more predictable information can be compressed: a biased coin sequence can be encoded with a smaller amount of information than a fair coin. This means that Shannon information depends on the probability of the event.

**Mathematical Description**

Shannon information encodes this idea and converts the probability that an event will occur into the associated quantity of information. Its characteristics are that, as you saw, likely events are less informative than unlikely events and also that information from different events is additive (if the events are independent).

Mathematically, the function I(x) is the information of the event X=x that takes the outcome as input and returns the quantity of information. It is a monotonically decreasing function of the probability (that is, a function that never increases when the probability increases). Shannon information is described as:

**Equation**

The result is a lower bound on the number of bits, that is, the minimum amount of bits needed to encode a sequence with an optimal encoding.

The logarithm of a product is equal to the sum of the elements: Equation. This property is useful to encode the additive property of the Shannon information. The probability of occurrence of two events is their individual probabilities multiplied together (because they are independent, as you saw in Essential Math for Data Science):

**Equation**

This means that the information corresponding to the probability of occurrence of two events P(x,y) equals the information corresponding to P(x) added to the information corresponding to P(y). The information of independent events add together.