# Excess Over Mean¶

Using cross-sectional data on returns of many different stocks, each observation is labeled according to whether, or how much, its return exceeds the mean return. It is a common practice to label observations based on whether the return is positive or negative. However, this may produce unbalanced classes, as during market booms the probability of a positive return is much higher, and during market crashes they are lower (Coqueret and Guida, 2020). Labeling according to a benchmark such as mean market return alleviates this issue.

A dataframe containing forward returns is calculated from close prices. The mean return of all stocks at time \(t\) in the dataframe is used to represent the market return, and excess returns are calculated by subtracting the mean return from each stock’s return over the time period \(t\). The numerical returns can then be used as-is (for regression analysis), or can be relabeled to represent their sign (for classification analysis).

At time \(t\):

If categorical rather than numerical labels are desired:

If desired, the user can specify a resampling period to apply to the price data prior to calculating returns. The user can also lag the returns to make them forward-looking.

The following shows the distribution of numerical excess over mean for a set of 20 stocks for the time period between Jan 2019 and May 2020.

Note

**Underlying Literature**

The following sources elaborate extensively on the topic:

Machine Learning for Factor Investing, Chapter 5.5.1

*by*Coqueret, G. and Guida, T.

## Implementation¶

## Research Notebook¶

The following research notebooks can be used to better understand labeling excess over mean.