.. _tutorial_pairwise_correlations: Computing Very Large Correlation Matrices in Parallel ===================================================== [:download:`ipython notebook `] [:download:`python script `] .. note:: Please acknowledge the authors and cite the use of this software when results are used in publications or published elsewhere. Various citation formats are available here: https://dx.doi.org/10.1063/1.4952963 For your convenience, you can find the BibTex entry below: :: @Article{traxl-2016-deep, author = {Dominik Traxl AND Niklas Boers AND J\"urgen Kurths}, title = {Deep Graphs - A general framework to represent and analyze heterogeneous complex systems across scales}, journal = {Chaos}, year = {2016}, volume = {26}, number = {6}, eid = {065303}, doi = {http://dx.doi.org/10.1063/1.4952963}, eprinttype = {arxiv}, eprintclass = {physics.data-an, cs.SI, physics.ao-ph, physics.soc-ph}, eprint = {http://arxiv.org/abs/1604.00971v1}, version = {1}, date = {2016-04-04}, url = {http://arxiv.org/abs/1604.00971v1} } In this short tutorial, we'll demonstrate how DeepGraph can be used to efficiently compute very large correlation matrices in parallel, with full control over RAM usage. Assume you have a set of ``n_samples`` samples, each comprised of ``n_features`` features and you want to compute the `Pearson correlation coefficients `_ between all pairs of features (for the `Spearman's rank correlation coefficients `_, see the *Note*-box below). If your data is small enough, you may use `scipy.stats.pearsonr `_ or `numpy.corrcoef `_, but for large data, neither of these methods is feasible. Scipy's pearsonr would be very slow, since you'd have to compute pair-wise correlations in a double loop, and numpy's corrcoef would most likely blow your RAM. Using DeepGraph's :py:meth:`create_edges <.create_edges>` method, you can compute all pair-wise correlations efficiently. In this tutorial, the data is stored on disc and only the relevant subset of features for each iteration will be loaded into memory by the computing nodes. Parallelization is achieved by using python's standard library `multiprocessing `_, but it should be straight-forward to modify the code to accommodate other parallelization libraries. It should also be straight-forward to modify the code in order to compute other correlation/distance/similarity-measures between a set of features. First of all, we need to import some packages .. code:: python # data i/o import os # compute in parallel from multiprocessing import Pool # the usual import numpy as np import pandas as pd import deepgraph as dg Let's create a set of variables and store it as a 2d-matrix ``X`` (``shape=(n_features, n_samples)``) on disc. To speed up the computation of the correlation coefficients later on, we whiten each variable. .. code:: python # create observations from numpy.random import RandomState prng = RandomState(0) n_features = int(5e3) n_samples = int(1e2) X = prng.randint(100, size=(n_features, n_samples)).astype(np.float64) # uncomment the next line to compute ranked variables for Spearman's correlation coefficients # X = X.argsort(axis=1).argsort(axis=1) # whiten variables for fast parallel computation later on X = (X - X.mean(axis=1, keepdims=True)) / X.std(axis=1, keepdims=True) # save in binary format np.save('samples', X) .. note:: On the computation of the `Spearman's rank correlation coefficients `_: Since the Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables, it suffices to uncomment the indicated line in the above code-block in order to compute the Spearman's rank correlation coefficients in the following. Now we can compute the pair-wise correlations using DeepGraph's :py:meth:`create_edges <.create_edges>` method. Note that the node table :py:attr:`v <.DeepGraph.v>` only stores references to the mem-mapped array containing the samples. .. code:: python # parameters (change these to control RAM usage) step_size = 1e5 n_processes = 100 # load samples as memory-map X = np.load('samples.npy', mmap_mode='r') # create node table that stores references to the mem-mapped samples v = pd.DataFrame({'index': range(X.shape[0])}) # connector function to compute pairwise pearson correlations def corr(index_s, index_t): features_s = X[index_s] features_t = X[index_t] corr = np.einsum('ij,ij->i', features_s, features_t) / n_samples return corr # index array for parallelization pos_array = np.array(np.linspace(0, n_features*(n_features-1)//2, n_processes), dtype=int) # parallel computation def create_ei(i): from_pos = pos_array[i] to_pos = pos_array[i+1] # initiate DeepGraph g = dg.DeepGraph(v) # create edges g.create_edges(connectors=corr, step_size=step_size, from_pos=from_pos, to_pos=to_pos) # store edge table g.e.to_pickle('tmp/correlations/{}.pickle'.format(str(i).zfill(3))) # computation if __name__ == '__main__': os.makedirs("tmp/correlations", exist_ok=True) indices = np.arange(0, n_processes - 1) p = Pool() for _ in p.imap_unordered(create_ei, indices): pass Let's collect the computed correlation values and store them in an hdf file. .. code:: python # store correlation values files = os.listdir('tmp/correlations/') files.sort() store = pd.HDFStore('e.h5', mode='w') for f in files: et = pd.read_pickle('tmp/correlations/{}'.format(f)) store.append('e', et, format='t', data_columns=True, index=False) store.close() Let's have a quick look at the correlations. .. code:: python # load correlation table e = pd.read_hdf('e.h5') print(e) .. parsed-literal:: corr s t 0 1 -0.006066 2 0.094063 3 -0.025529 4 0.074080 5 0.035490 6 0.005221 7 0.032064 8 0.000378 9 -0.049318 10 -0.084853 11 0.026407 12 0.028543 13 -0.013347 14 -0.180113 15 0.151164 16 -0.094398 17 -0.124582 18 -0.000781 19 -0.044138 20 -0.193609 21 0.003877 22 0.048305 23 0.006477 24 -0.021291 25 -0.070756 26 -0.014906 27 -0.197605 28 -0.103509 29 0.071503 30 0.120718 ... ... 4991 4998 -0.012007 4999 -0.252836 4992 4993 0.202024 4994 -0.046088 4995 -0.028314 4996 -0.052319 4997 -0.010797 4998 -0.025321 4999 -0.093721 4993 4994 -0.027568 4995 0.045602 4996 -0.102075 4997 0.035370 4998 -0.069946 4999 -0.031208 4994 4995 0.108063 4996 0.144441 4997 0.078353 4998 -0.024799 4999 -0.026432 4995 4996 -0.019991 4997 -0.178458 4998 -0.162406 4999 0.102835 4996 4997 0.115812 4998 -0.061167 4999 0.018606 4997 4998 -0.151932 4999 -0.271358 4998 4999 0.106453 [12497500 rows x 1 columns] And finally, let's see where most of the computation time is spent. .. code:: python g = dg.DeepGraph(v) p = %prun -r g.create_edges(connectors=corr, step_size=step_size) .. code:: python p.print_stats(20) .. parsed-literal:: 244867 function calls (239629 primitive calls) in 14.193 seconds Ordered by: internal time List reduced from 541 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 250 9.355 0.037 9.361 0.037 memmap.py:334(__getitem__) 125 1.584 0.013 1.584 0.013 {built-in method numpy.core.multiarray.c_einsum} 125 1.012 0.008 12.013 0.096 deepgraph.py:4558(map) 2 0.581 0.290 0.581 0.290 {method 'get_labels' of 'pandas._libs.hashtable.Int64HashTable' objects} 1 0.301 0.301 0.414 0.414 multi.py:795(_engine) 5 0.157 0.031 0.157 0.031 {built-in method numpy.core.multiarray.concatenate} 250 0.157 0.001 0.170 0.001 internals.py:5017(_stack_arrays) 2 0.105 0.053 0.105 0.053 {pandas._libs.algos.take_1d_int64_int64} 889 0.094 0.000 0.094 0.000 {method 'reduce' of 'numpy.ufunc' objects} 125 0.089 0.001 12.489 0.100 deepgraph.py:5294(_select_and_return) 125 0.074 0.001 0.074 0.001 {deepgraph._triu_indices._reduce_triu_indices} 125 0.066 0.001 0.066 0.001 {built-in method deepgraph._triu_indices._triu_indices} 4 0.038 0.009 0.038 0.009 {built-in method pandas._libs.algos.ensure_int16} 125 0.033 0.000 10.979 0.088 :12(corr) 2 0.028 0.014 0.028 0.014 function_base.py:4703(delete) 1 0.027 0.027 14.163 14.163 deepgraph.py:4788(_matrix_iterator) 1 0.027 0.027 0.113 0.113 multi.py:56(_codes_to_ints) 45771/45222 0.020 0.000 0.043 0.000 {built-in method builtins.isinstance} 1 0.019 0.019 14.193 14.193 deepgraph.py:183(create_edges) 2 0.012 0.006 0.700 0.350 algorithms.py:576(factorize) As you can see, most of the time is spent by getting the requested features in the corr-function, followed by computing the correlation values themselves.