## If *y *can be estimated as a linear function of x does not imply that x can also be estimated as a linear function of y

Consider two real-valued variables *x *and *y*, for example, the height of a father and the height of his son. The central problem of regression analyses in statistics is to guess *y *by knowing *x*, e.g., to guess the height of the son based on the height of his father¹.

The idea in *linear* regression is to use a linear function of *x* as a guess for *y. *Formally, this means to consider *ŷ(x) = α₁x + α₀ *as our guess and find *α₀ *and *α₁ *by minimizing the mean squared error between *y* and *ŷ*. Now, let’s assume that we use a huge dataset and find the best possible values for *α₀ *and *α₁*, so we know how to find the best estimate of *y* based on *x.* How can we use these best values for *α₀ *and *α₁ *to find a guess *x̂(y) *about *x *based on *y*? For example, if we always knew the best guess about the son’s height based on his father’s, then what would be our guess about the father’s height based on his son’s?

Such questions are special cases of “How can we use *ŷ(x) *to find *x̂(y)*?” Even though it may sound trivial, this question appears to be really difficult to address. In this article, I study the link between *ŷ(x) *and* x̂(y) *in both deterministic and probabilistic settings and show that our intuition for how *ŷ(x) *and* x̂(y) *relate to each other in deterministic settings cannot be generalized to probabilistic settings.

## Deterministic settings

By deterministic settings, I mean situations where (i) there is no randomness and (ii) each value of *x *corresponds always to the same value of *y*. Formally, in these settings, I write *y = f(x) *for some function *f: R → R*. In such cases where *x *determines *y *with complete certainty (i.e., no randomness or noise), the best choice of *ŷ(x) *is *f(x) *itself. For example, if the height of a son is always 1.05 times his father’s height (let’s ignore the impossibility of the example for now!), then our best guess about the son’s height is to multiply the father’s height by 1.05.

If *f *is an invertible function, then the best choice of *x̂(y) *is equal to the inverse of *f*. In the example above, this means that the best guess about the height of a father is always the height of his son divided by 1.05*. *Hence, the link between *ŷ(x) *and *x̂(y) *in deterministic cases is straightforward and can be reduced to finding the function *f *and its inverse.

## Probabilistic settings

In probabilistic settings, *x *and *y *are samples of random variables *X *and *Y. *In such cases where a single value of *x *can correspond to several values of *y,* the best choice for *ŷ(x) *(in order to minimize the mean squared error)* *is the conditional expectation *E[Y|X=x] *— see footnote². In application-friendly words, this means that if you train a very expressive neural network to predict *y *given *x *(with a sufficiently big dataset), then your network would converge to *E[Y|X=x]*.

Similarly, the best choice for *x̂(y) *is *E[X|Y=y] — *if you train your very expressive network to predict *x* given *y*, then it converges, in principle, to *E[X|Y=y]. *Hence, the question of how *ŷ(x) *relates to *x̂(y) *in probabilistic settings can be rephrased as how the conditional expectations *E[Y|X=x] *and *E[X|Y=y] *relate to each other.

## The goal of this article

To simplify the problem, I focus on **linear **relationships, i.e., cases where *ŷ(x) *is linear in *x. *A linear deterministic relationship has a linear inverse, meaning that *y = αx* (for some *α≠0*) implies that *x = βy* with *β = 1/α *— see footnote³. The probabilistic linear relationship analogous to the deterministic relationship *y = αx* is

where *Z *is an additional random variable, often called ‘noise’ or ‘error term’, whose conditional average is assumed to be zero, i.e., *E[Z|X=x] = 0* for all *x*; note that we do not always assume that *Z *is independent of *X*. Using **Equation 1**, the conditional expectation of *Y *given *X=x* is (see footnote⁴)

**Equation 2 **states that the conditional expectation *ŷ(x) *is linear in *x*, so it can be seen as the probabilistic twin of the linear deterministic relationship *y = αx.*

In the rest of this article, I would ask two questions:

- Does
**Equation 2**imply that*x̂(y) := E[X|Y=y] = βy*for some*β≠0*? In other words, does the linear relationship in**Equation 2**have a linear inverse? - If it is indeed the case that
*x̂(y) = βy*, then can we write*β = 1/α*as in the deterministic case?

I use two counter examples and show that, as counter-intuitive as it may sound, the answer to both questions is negative!

As the first example, let me consider the most typical setup of linear regression problems, summarized in the following three assumptions (in addition to **Equation 1**; see **Figure 1A **for visualization):

- Error term
*Z*is independent of*X*. *X*has a Gaussian distribution with mean zero and variance 1.*Z*has a Gaussian distribution with mean zero and variance*σ²*.

It is straightforward to show, after a few lines of algebra, that these assumptions imply that *Y* has a Gaussian distribution with mean zero and variance *α² + σ²*. Moreover, the assumptions imply that *X* and *Y* are jointly Gaussian with mean zero and covariance matrix equal to

Since we have the full joint distribution of *X* and *Y*, we can derive their conditional expectations (see footnote⁵):

Hence, given the assumptions of our first example, **Equation 2** has a linear inverse of the form *x̂(y) = βy*, but *β* is not equal to its deterministic twin *1/α* — unless we have *σ = 0 *which ** is** equivalent to the deterministic case!

This result shows that our intuitions about deterministic linear relationships cannot be generalized to probabilistic linear relationships. To more clearly see the true insanity of what this result implies, let us first consider *α = 0.5* in a deterministic setting (*σ = 0*; blue curves in **Figure 2A and 2B**):

This means that, given a value of *x*, the value of *y* is half of *x*, and, given a value of *y*, the value of *x* is twice *y*, which appears to be intuitive. Importantly, we always have *x < y*. Now, let us again consider *α = 0.5* but this time with *σ² = 3/4* (red curves in **Figure 2A and 2B**). This choice of noise variance implies that *β = α = 0.5*, resulting in

This means that, given a value of *x*, our estimation of *y* is half of *x*, yet, given a value of *y*, our estimation of *x* is also half of *y*! Strangely, we always have *x̂(y) < y* **and** *ŷ(x) < x *— which would be impossible if the variables were deterministic. What appears to be counter-intuitive is that **Equation 1** can be rewritten as

However, this can only imply that (as opposed to **Equation 2**)

The twist is that, while we have *E[Z|X=x]=0 *by design, we cannot say anything about *E[Z|Y=y]* and its dependence on *y*! In other words, what makes *x̂(y)* different from *y/α* is that observation *y *has also information about error *Z*, e.g., if we observe a very large value of *y*, then it means that, with high probability, the error *Z *has also a large value, which should be taken into account when estimating *X*.

This is the simple explanation for seemingly contradictory statements like ‘tall fathers have sons who are (on average) tall but not as tall as themselves, and, at the same time, tall sons have fathers who are (on average) tall but not as tall as their sons’!

**To conclude**, our example 1 shows that even if the probabilistic linear relationship *ŷ(x) = αx* has a linear inverse of the form *x̂(y) = βy*, the slope *β* is ** not** necessarily equal to its deterministic twin

*1/α*.

Having an inverse of the form *x̂(y) = βy *is only possible if *E[Z|Y=y] *in** Equation 4 **is also a linear function of *y. *In the second example, I make a small modification to example 1 in order to break this condition!

In particular, I assume that the variance of the error term *Z* depends on the random variable *X *— as opposed to assumption 1 in example 1. Formally, I assume (in addition to **Equation 1**; see **Figure 1B **for visualization):

*X*has a Gaussian distribution with mean zero and variance 1 (same as assumption 2 in example 1).- Given
*X=x*, the error*Z*has a Gaussian distribution with mean zero and variance*σ² = 0.01 + 1/(1 + 2x²).*

These assumptions effectively mean that, given *X=x*, the random variable *Y* has a Gaussian distribution with mean *αx* and variance *0.01 + 1/(1 + 2x²)* (see **Figure 1B**). As opposed to example 1 where the joint distribution of *X* and *Y *was a Gaussian distribution, the joint distribution of *X* and *Y *in example 2 does not have an elegant form (see **Figure 1C**). However, we can still use the Bayes rule and find the relatively ugly conditional density of *X=x* given *Y=y* (see **Figure 3 **for some examples evaluated numerically):

where curly *N* denotes the probability density of the Gaussian distribution.

We can then use numerical methods and evaluate the conditional expectation

for a given *y* and *α*. **Figure 2C** shows *x̂(y)* as a function of *y* for *α = 0.5*. As counter-intuitive as it may sound, the inverse relationship is highly nonlinear — as a result of the *x*-dependent error variance shown in **Figure 1B**. This shows that the fact that *y* can be estimated well as a linear function of *x* does not imply that *x* can also be estimated well as a linear function of *y*. This is because *E[Z|Y=y] *in **Equation 4 **can have any strange functional dependence on *y* when we go beyond standard assumptions similar to those in example 1.

**To conclude**, our example 2 shows that the probabilistic linear relationship *ŷ(x) = αx* does ** not** necessarily have a linear inverse of the form

*x̂(y) = βy.*Importantly, the inverse relationship between

*x̂(y)*and

*y*is dependent on the characteristics of the error term

*Z.*

This post originally appeared on TechToday.