Sample Uniqueness

Let’s look at an example of 3 samples: A, B, C.

Imagine that:

  • A was generated at \(t_1\) and triggered on \(t_8\)

  • B was generated at \(t_3\) and triggered on \(t_6\)

  • C was generated on \(t_7\) and triggered on \(t_9\)

In this case we see that A used information about returns on \([t_1,t_8]\) to generate label-endtime which overlaps with \([t_3, t_6]\) which was used by B, however C didn’t use any returns information which was used by to label other samples. Here we would like to introduce the concept of concurrency.

We say that labels \(y_i\) and \(y_j\) are concurrent at \(t\) if they are a function of at least one common return at \(r_{t-1,t}\)

In terms of concurrency label C is the most ‘pure’ as it doesn’t use any piece of information from other labels, while A is the ‘dirtiest’ as it uses information from both B and C. By understanding average label uniqueness you can measure how ‘pure’ your dataset is based on concurrency of labels. We can measure average label uniqueness using get_av_uniqueness_from_triple_barrier function from the mlfinlab package.

This function is the orchestrator to derive average sample uniqueness from a dateset labeled by the triple barrier method.


Code implementation demo


An example of calculating average uniqueness given that we have already have our barrier events can be seen below:

Code example demo

We would like to build our model in such a way that it takes into account label concurrency (overlapping samples). In order to do that we need to look at the bootstrapping algorithm of a Random Forest.

Lets move onto the next section on Sequential Bootstrapping.

Research Notebook

The following research notebook can be used to better understand the previously discussed sampling method.

Notebook demo