Sample Uniqueness¶
Let’s look at an example of 3 samples: A, B, C.
Imagine that:
A was generated at \(t_1\) and triggered on \(t_8\)
B was generated at \(t_3\) and triggered on \(t_6\)
C was generated on \(t_7\) and triggered on \(t_9\)
In this case we see that A used information about returns on \([t_1,t_8]\) to generate label-endtime which overlaps with \([t_3, t_6]\) which was used by B, however C didn’t use any returns information which was used by to label other samples. Here we would like to introduce the concept of concurrency.
We say that labels \(y_i\) and \(y_j\) are concurrent at \(t\) if they are a function of at least one common return at \(r_{t-1,t}\)
In terms of concurrency label C is the most ‘pure’ as it doesn’t use any piece of information from other labels, while A is the ‘dirtiest’ as it uses information from both B and C. By understanding average label uniqueness you can measure how ‘pure’ your dataset is based on concurrency of labels. We can measure average label uniqueness using get_av_uniqueness_from_triple_barrier function from the mlfinlab package.
This function is the orchestrator to derive average sample uniqueness from a dateset labeled by the triple barrier method.
Implementation¶
Example¶
An example of calculating average uniqueness given that we have already have our barrier events can be seen below:
We would like to build our model in such a way that it takes into account label concurrency (overlapping samples). In order to do that we need to look at the bootstrapping algorithm of a Random Forest.
Lets move onto the next section on Sequential Bootstrapping.
Research Notebook¶
The following research notebook can be used to better understand the previously discussed sampling method.