Note

The following implementation and documentation closely follow the work of Gautier Marti: CorrGAN: Sampling Realistic Financial Correlation Matrices using Generative Adversarial Networks.

And the work of Donnat, P., Marti, G. and Very, P: Toward a generic representation of random variables for machine learning.

Data Verification




Data verification for synthetic data is needed to confirm if it shares some properties of the original data. Being able to examine and validate synthetically generated data is critical to building more accurate systems. Without verification, we would operate on data that might not have any significance in the real world. We present several methods to examine the properties of this type of data.

Note

Underlying Literature

The following sources elaborate extensively on the topic:

Stylized Factors of Correlation Matrices

Following the work of Gautier Marti in CorrGAN, we provide function to plot and verify a synthetic matrix has the 6 stylized facts of empirical correlation matrices.

The stylized facts are:

  1. Distribution of pairwise correlations is significantly shifted to the positive.

  2. Eigenvalues follow the Marchenko-Pastur distribution, but for a very large first eigenvalue (the market).

  3. Eigenvalues follow the Marchenko-Pastur distribution, but for a couple of other large eigenvalues (industries).

  4. Perron-Frobenius property (first eigenvector has positive entries).

  5. Hierarchical structure of correlations.

  6. Scale-free property of the corresponding Minimum Spanning Tree (MST).


Implementation

Code implementation demo
Code implementation demo
Code implementation demo
Code implementation demo

Example

Code example demo

Time Series Codependence Visualization

Note

The correlated random walks time series generation and GNPR codependence measure approaches are fully explored in our Correlated Random Walks and Codependence by Marti sections.

Following the work of Donnat, Marti, and Very (2016) we provide a method to plot the GNPR codependence matrix and visualize the different underlying distributions these time series may have. GNPR was shown to detect all underlying distributions more accurately than other methods, as it L2 distance, correlation distance, and GPR.


Implementation

Code implementation demo

Example

Mix Time Series Example

(Left) GPR codependence matrix. Only 5 correlation clusters are seen with no indication of a global embedded distribution. All 5 correlation clusters and 2 distribution clusters can be seen, as well as the global embedded distribution.

Code example demo

Optimal Hierarchical Clustering

This function plots the optimal leaf hierarchical clustering as shown in Marti, G. (2020) TF 2.0 DCGAN for 100x100 financial correlation matrices by arranging a matrix with hierarchical clustering by maximizing the sum of the similarities between adjacent leaves.


Implementation

Code implementation demo

Example

Optimal Clustering.

(Left) HCBM matrix. (Right) Optimal Clustering of the HCBM matrix.

Code example demo

Research Article



Presentation Slides