Computing Very Large Correlation Matrices in Parallel

[ipython notebook] [python script]

Note

Please acknowledge the authors and cite the use of this software when results are used in publications or published elsewhere. Various citation formats are available here: https://aip.scitation.org/action/showCitFormats?type=show&doi=10.1063%2F1.4952963

For your convenience, you can find the BibTex entry below:

@Article{traxl-2016-deep,
    author      = {Dominik Traxl AND Niklas Boers AND J\"urgen Kurths},
    title       = {Deep Graphs - A general framework to represent and analyze
                   heterogeneous complex systems across scales},
    journal     = {Chaos},
    year        = {2016},
    volume      = {26},
    number      = {6},
    eid         = {065303},
    doi         = {http://dx.doi.org/10.1063/1.4952963},
    eprinttype  = {arxiv},
    eprintclass = {physics.data-an, cs.SI, physics.ao-ph, physics.soc-ph},
    eprint      = {http://arxiv.org/abs/1604.00971v1},
    version     = {1},
    date        = {2016-04-04},
    url         = {http://arxiv.org/abs/1604.00971v1}
}

In this short tutorial, we’ll demonstrate how DeepGraph can be used to efficiently compute very large correlation matrices in parallel, with full control over RAM usage.

Assume you have a set of n_samples samples, each comprised of n_features features and you want to compute the Pearson correlation coefficients between all pairs of features (for the Spearman’s rank correlation coefficients, see the Note-box below). If your data is small enough, you may use scipy.stats.pearsonr or numpy.corrcoef, but for large data, neither of these methods is feasible. Scipy’s pearsonr would be very slow, since you’d have to compute pair-wise correlations in a double loop, and numpy’s corrcoef would most likely blow your RAM.

Using DeepGraph’s create_edges method, you can compute all pair-wise correlations efficiently. In this tutorial, the data is stored on disc and only the relevant subset of features for each iteration will be loaded into memory by the computing nodes. Parallelization is achieved by using python’s standard library multiprocessing, but it should be straight-forward to modify the code to accommodate other parallelization libraries. It should also be straight-forward to modify the code in order to compute other correlation/distance/similarity-measures between a set of features.

First of all, we need to import some packages

# data i/o
import os

# compute in parallel
from multiprocessing import Pool

# the usual
import numpy as np
import pandas as pd

import deepgraph as dg

Let’s create a set of variables and store it as a 2d-matrix X (shape=(n_features, n_samples)) on disc. To speed up the computation of the correlation coefficients later on, we whiten each variable.

# create observations
from numpy.random import RandomState
prng = RandomState(0)
n_features = int(5e3)
n_samples = int(1e2)
X = prng.randint(100, size=(n_features, n_samples)).astype(np.float64)

# uncomment the next line to compute ranked variables for Spearman's correlation coefficients
# X = X.argsort(axis=1).argsort(axis=1)

# whiten variables for fast parallel computation later on
X = (X - X.mean(axis=1, keepdims=True)) / X.std(axis=1, keepdims=True)

# save in binary format
np.save('samples', X)

Note

On the computation of the Spearman’s rank correlation coefficients: Since the Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables, it suffices to uncomment the indicated line in the above code-block in order to compute the Spearman’s rank correlation coefficients in the following.

Now we can compute the pair-wise correlations using DeepGraph’s create_edges method. Note that the node table v only stores references to the mem-mapped array containing the samples.

# parameters (change these to control RAM usage)
step_size = 1e5
n_processes = 100

# load samples as memory-map
X = np.load('samples.npy', mmap_mode='r')

# create node table that stores references to the mem-mapped samples
v = pd.DataFrame({'index': range(X.shape[0])})

# connector function to compute pairwise pearson correlations
def corr(index_s, index_t):
    features_s = X[index_s]
    features_t = X[index_t]
    corr = np.einsum('ij,ij->i', features_s, features_t) / n_samples
    return corr

# index array for parallelization
pos_array = np.array(np.linspace(0, n_features*(n_features-1)//2, n_processes), dtype=int)

# parallel computation
def create_ei(i):

    from_pos = pos_array[i]
    to_pos = pos_array[i+1]

    # initiate DeepGraph
    g = dg.DeepGraph(v)

    # create edges
    g.create_edges(connectors=corr, step_size=step_size,
                   from_pos=from_pos, to_pos=to_pos)

    # store edge table
    g.e.to_pickle('tmp/correlations/{}.pickle'.format(str(i).zfill(3)))

# computation
if __name__ == '__main__':
    os.makedirs("tmp/correlations", exist_ok=True)
    indices = np.arange(0, n_processes - 1)
    p = Pool()
    for _ in p.imap_unordered(create_ei, indices):
        pass

Let’s collect the computed correlation values and store them in an hdf file.

# store correlation values
files = os.listdir('tmp/correlations/')
files.sort()
store = pd.HDFStore('e.h5', mode='w')
for f in files:
    et = pd.read_pickle('tmp/correlations/{}'.format(f))
    store.append('e', et, format='t', data_columns=True, index=False)
store.close()

Let’s have a quick look at the correlations.

# load correlation table
e = pd.read_hdf('e.h5')
print(e)
               corr
s    t
0    1    -0.006066
     2     0.094063
     3    -0.025529
     4     0.074080
     5     0.035490
     6     0.005221
     7     0.032064
     8     0.000378
     9    -0.049318
     10   -0.084853
     11    0.026407
     12    0.028543
     13   -0.013347
     14   -0.180113
     15    0.151164
     16   -0.094398
     17   -0.124582
     18   -0.000781
     19   -0.044138
     20   -0.193609
     21    0.003877
     22    0.048305
     23    0.006477
     24   -0.021291
     25   -0.070756
     26   -0.014906
     27   -0.197605
     28   -0.103509
     29    0.071503
     30    0.120718
...             ...
4991 4998 -0.012007
     4999 -0.252836
4992 4993  0.202024
     4994 -0.046088
     4995 -0.028314
     4996 -0.052319
     4997 -0.010797
     4998 -0.025321
     4999 -0.093721
4993 4994 -0.027568
     4995  0.045602
     4996 -0.102075
     4997  0.035370
     4998 -0.069946
     4999 -0.031208
4994 4995  0.108063
     4996  0.144441
     4997  0.078353
     4998 -0.024799
     4999 -0.026432
4995 4996 -0.019991
     4997 -0.178458
     4998 -0.162406
     4999  0.102835
4996 4997  0.115812
     4998 -0.061167
     4999  0.018606
4997 4998 -0.151932
     4999 -0.271358
4998 4999  0.106453

[12497500 rows x 1 columns]

And finally, let’s see where most of the computation time is spent.

g = dg.DeepGraph(v)
p = %prun -r g.create_edges(connectors=corr, step_size=step_size)
p.print_stats(20)
         244867 function calls (239629 primitive calls) in 14.193 seconds

   Ordered by: internal time
   List reduced from 541 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      250    9.355    0.037    9.361    0.037 memmap.py:334(__getitem__)
      125    1.584    0.013    1.584    0.013 {built-in method numpy.core.multiarray.c_einsum}
      125    1.012    0.008   12.013    0.096 deepgraph.py:4558(map)
        2    0.581    0.290    0.581    0.290 {method 'get_labels' of 'pandas._libs.hashtable.Int64HashTable' objects}
        1    0.301    0.301    0.414    0.414 multi.py:795(_engine)
        5    0.157    0.031    0.157    0.031 {built-in method numpy.core.multiarray.concatenate}
      250    0.157    0.001    0.170    0.001 internals.py:5017(_stack_arrays)
        2    0.105    0.053    0.105    0.053 {pandas._libs.algos.take_1d_int64_int64}
      889    0.094    0.000    0.094    0.000 {method 'reduce' of 'numpy.ufunc' objects}
      125    0.089    0.001   12.489    0.100 deepgraph.py:5294(_select_and_return)
      125    0.074    0.001    0.074    0.001 {deepgraph._triu_indices._reduce_triu_indices}
      125    0.066    0.001    0.066    0.001 {built-in method deepgraph._triu_indices._triu_indices}
        4    0.038    0.009    0.038    0.009 {built-in method pandas._libs.algos.ensure_int16}
      125    0.033    0.000   10.979    0.088 <ipython-input-3-26c4f59cd911>:12(corr)
        2    0.028    0.014    0.028    0.014 function_base.py:4703(delete)
        1    0.027    0.027   14.163   14.163 deepgraph.py:4788(_matrix_iterator)
        1    0.027    0.027    0.113    0.113 multi.py:56(_codes_to_ints)
45771/45222    0.020    0.000    0.043    0.000 {built-in method builtins.isinstance}
        1    0.019    0.019   14.193   14.193 deepgraph.py:183(create_edges)
        2    0.012    0.006    0.700    0.350 algorithms.py:576(factorize)

As you can see, most of the time is spent by getting the requested features in the corr-function, followed by computing the correlation values themselves.