DeepGraph - Analyze Data with Pandas-based Networks

Release:0.2.3
Date:Jun 14, 2021
Code:GitHub

Anaconda Version Anaconda Downloads Documentation PyPi

Contents

What is DeepGraph

DeepGraph is an open source Python implementation of a new network representation introduced here. Its purpose is to facilitate data analysis by interpreting data in terms of network theory.

The basis of this software package is Pandas, a fast and flexible data analysis tool for the Python programming language. Utilizing one of its primary data structures, the DataFrame, we represent objects (i.e. the nodes of a network) by one DataFrame, and their pairwise relations (i.e. the edges of a network) by another DataFrame.

One of the main features of DeepGraph is an efficient and scalable creation of edges. Given a set of nodes in the form of a DataFrame (or an on disc HDFStore), DeepGraph’s core class provides methods to iteratively compute pairwise relations between the nodes (e.g. similarity/distance measures) using arbitrary, user-defined functions on the nodes’ features. These methods provide arguments to parallelize the computation and control memory consumption, making them suitable for very large data-sets and adjustable to whatever hardware you have at hand (from netbooks to cluster architectures).

Furthermore, once a graph is constructed, DeepGraph allows you to partition its nodes, edges or the entire graph by the graph’s properties and labels, enabling the aggregation, computation and allocation of information on and between arbitrary groups of nodes. These methods also let you express elaborate queries on the information contained in a deep graph.

DeepGraph is not meant to replace or compete with already existing Python network libraries, such as NetworkX or graph_tool, but rather to combine and extend their capabilities with the merits of Pandas. For that matter, the core class of DeepGraph provides interfacing methods to convert to common network representations and graph objects of popular Python network packages.

Deepgraph also implements a number of useful plotting methods, including drawings on geographical map projections.

It’s also possible to represent multilayer networks by deep graphs. We’re thinking of implementing an interface to a suitable package dedicated to the analysis of multilayer networks.

Note

Please acknowledge and cite the use of this software and its authors when results are used in publications or published elsewhere. You can use the following BibTex entry

@Article{traxl-2016-deep,
  author      = {Dominik Traxl AND Niklas Boers AND J\"urgen Kurths},
  title       = {Deep Graphs - A general framework to represent and analyze
                 heterogeneous complex systems across scales},
  journal     = {Chaos},
  year        = {2016},
  volume      = {26},
  number      = {6},
  eid         = {065303},
  doi         = {http://dx.doi.org/10.1063/1.4952963},
  eprinttype  = {arxiv},
  eprintclass = {physics.data-an, cs.SI, physics.ao-ph, physics.soc-ph},
  eprint      = {http://arxiv.org/abs/1604.00971v1},
  version     = {1},
  date        = {2016-04-04},
  url         = {http://arxiv.org/abs/1604.00971v1}
}

To get started, have a look at

Want to share feedback, or contribute?

So far the package has only been developed by me, a fact that I would like to change very much. So if you feel like contributing in any way, shape or form, please feel free to contact me, report bugs, create pull requestes, milestones, etc. You can contact me via email: dominik.traxl@posteo.org

Note

This documentation assumes general familiarity with NumPy and Pandas. If you haven’t used these packages, do invest some time in learning about them first.

Note

DeepGraph is free software; you can redistribute it and/or modify it under the terms of the BSD License. We highly welcome contributions from the community.

Installation

Quick Install

DeepGraph can be installed via pip from PyPI

$ pip install deepgraph

Depending on your system, you may need root privileges. On UNIX-based operating systems (Linux, Mac OS X etc.) this is achieved with sudo

$ sudo pip install deepgraph

Alternatively, if you’re using Conda, install with

$ conda install -c conda-forge deepgraph

Installing from Source

Alternatively, you can install DeepGraph from source by downloading a source archive file (tar.gz or zip).

Source Archive File
  1. Download the source (tar.gz or zip file) from https://pypi.python.org/pypi/deepgraph/ or https://github.com/deepgraph/deepgraph/
  2. Unpack and change directory to the source directory (it should have the files README.rst and setup.py).
  3. Run python setup.py install to build and install. As a developer, you may want to install using cython: python setup.py install --use-cython.
  4. (Optional) Run py.test to execute the tests if you have pytest installed.
GitHub
  1. Clone the deepgraph repostitory

  2. Change directory to deepgraph

  3. Run python setup.py install to build and install. As a developer, you may want to install using cython: python setup.py install --use-cython.

  4. (Optional) Run py.test to execute the tests if you have pytest installed.

Installing without Root Privileges

If you don’t have permission to install software on your system, you can install into another directory using the --user, --prefix, or --home flags to setup.py.

For example

$ python setup.py install --prefix=/home/username/python

or

$ python setup.py install --home=~

or

$ python setup.py install --user

Note: If you didn’t install in the standard Python site-packages directory you will need to set your PYTHONPATH variable to the alternate location. See here for further details.

Requirements

The easiest way to get Python and the required/optional packages is to use Conda (or Miniconda), a cross-platform (Linux, Mac OS X, Windows) Python distribution for data analytics and scientific computing.

Python

To use DeepGraph you need Python 2.7, 3.4 or later.

Pandas

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Pandas is the core dependency of DeepGraph, and it is highly recommended to install the recommended and optional dependencies of Pandas as well.

NumPy

NumPy is the fundamental package for scientific computing with Python.

Needed for internal operations.

Optional Packages

The following packages are considered to provide very useful tools and methods.

Scikit-Learn

sklearn is a Python module integrating classical machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib).

Sklearn-pandas

sklearn-pandas provides a bridge between Scikit-Learn’s machine learning methods and pandas-style Data Frames.

Tutorials

10 Minutes to DeepGraph

[ipython notebook] [python script] [data]

This is a short introduction to DeepGraph. In the following, we demonstrate DeepGraph’s core functionalities by a toy data-set, “flying balls”.

First of all, we need to import some packages

# for plots
import matplotlib.pyplot as plt

# the usual
import numpy as np
import pandas as pd

import deepgraph as dg

# notebook display
%matplotlib inline
plt.rcParams['figure.figsize'] = 8, 6
pd.options.display.max_rows = 10
pd.set_option('expand_frame_repr', False)

Loading Toy Data

Then, we need data in the form of a pandas DataFrame, representing the nodes of our graph

v = pd.read_csv('flying_balls.csv', index_col=0)
print(v)
      time            x          y  ball_id
0        0  1692.000000   0.000000        0
1        0  8681.000000   0.000000        1
2        0   490.000000   0.000000        2
3        0  7439.000000   0.000000        3
4        0  4998.000000   0.000000        4
...    ...          ...        ...      ...
1163    45  2812.552734  16.503178       39
1164    46  5686.915998  14.161693       10
1165    46  3161.729086  19.381823       14
1166    46  5594.233413  57.701712       37
1167    47  5572.216748  20.588750       37

[1168 rows x 4 columns]

The data consists of 1168 space-time measurements of 50 different toy balls in two-dimensional space. Each space-time measurement (i.e. row of v) represents a node.

Let’s plot the data such that each ball has it’s own color

plt.scatter(v.x, v.y, s=v.time, c=v.ball_id)
_images/10min_to_deepgraph_10_1.png
Creating Edges

In order to create edges between these nodes, we now initiate a dg.DeepGraph instance

g = dg.DeepGraph(v)
g
<DeepGraph object, with n=1168 node(s) and m=0 edge(s) at 0x7facf3b35dd8>

and use it to create edges between the nodes given by g.v. For that matter, we may define a connector function

def x_dist(x_s, x_t):
    dx = x_t - x_s
    return dx

and pass it to g.create_edges in order to compute the distance in the x-coordinate of each pair of nodes

g.create_edges(connectors=x_dist)
g
<DeepGraph object, with n=1168 node(s) and m=681528 edge(s) at 0x7facf3b35dd8>
print(g.e)
                    dx
s    t
0    1     6989.000000
     2    -1202.000000
     3     5747.000000
     4     3306.000000
     5     2812.000000
...                ...
1164 1166   -92.682585
     1167  -114.699250
1165 1166  2432.504327
     1167  2410.487662
1166 1167   -22.016665

[681528 rows x 1 columns]

Let’s say we’re only interested in creating edges between nodes with a x-distance smaller than 1000. Then we may additionally define a selector

def x_dist_selector(dx, sources, targets):
    dxa = np.abs(dx)
    sources = sources[dxa <= 1000]
    targets = targets[dxa <= 1000]
    return sources, targets

and pass both the connector and selector to g.create_edges

g.create_edges(connectors=x_dist, selectors=x_dist_selector)
g
<DeepGraph object, with n=1168 node(s) and m=156938 edge(s) at 0x7facf3b35dd8>
print(g.e)
                   dx
s    t
0    6     416.000000
     7     848.000000
     19   -973.000000
     24    437.000000
     38    778.000000
...               ...
1162 1167  -44.033330
1163 1165  349.176351
1164 1166  -92.682585
     1167 -114.699250
1166 1167  -22.016665

[156938 rows x 1 columns]

There is, however, a much more efficient way of creating edges that involve a simple distance threshold such as the one above

Creating Edges on a FastTrack

In order to efficiently create edges including a selection of edges via a simple distance threshold as above, one should use the create_edges_ft method. It relies on a sorted DataFrame, so we need to sort g.v first

g.v.sort_values('x', inplace=True)
g.create_edges_ft(ft_feature=('x', 1000))
g
<DeepGraph object, with n=1168 node(s) and m=156938 edge(s) at 0x7facf3b35dd8>

Let’s compare the efficiency

%timeit -n3 -r3 g.create_edges(connectors=x_dist, selectors=x_dist_selector)
3 loops, best of 3: 557 ms per loop
%timeit -n3 -r3 g.create_edges_ft(ft_feature=('x', 1000))
3 loops, best of 3: 167 ms per loop

The create_edges_ft method also accepts connectors and selectors as input. Let’s connect only those measurements that are close in space and time

def y_dist(y_s, y_t):
    dy = y_t - y_s
    return dy

def time_dist(time_t, time_s):
    dt = time_t - time_s
    return dt

def y_dist_selector(dy, sources, targets):
    dya = np.abs(dy)
    sources = sources[dya <= 100]
    targets = targets[dya <= 100]
    return sources, targets

def time_dist_selector(dt, sources, targets):
    dta = np.abs(dt)
    sources = sources[dta <= 1]
    targets = targets[dta <= 1]
    return sources, targets
g.create_edges_ft(ft_feature=('x', 100),
                  connectors=[y_dist, time_dist],
                  selectors=[y_dist_selector, time_dist_selector])
g
<DeepGraph object, with n=1168 node(s) and m=1899 edge(s) at 0x7facf3b35dd8>
print(g.e)
         dt         dy       ft_r
s   t
890 867  -1  19.311136  33.415831
867 843  -1  17.678482  33.415831
843 818  -1  16.045829  33.415831
818 792  -1  14.413176  33.415831
792 766  -1  12.780523  33.415831
...      ..        ...        ...
244 203  -1 -10.825226  15.455612
203 159  -1 -12.457879  15.455612
159 114  -1 -14.090532  15.455612
114 65   -1 -15.723185  15.455612
65  16   -1 -17.355838  15.455612

[1899 rows x 3 columns]

We can now plot the flying balls and the edges we just created with the plot_2d method

obj = g.plot_2d('x', 'y', edges=True,
                kwds_scatter={'c': g.v.ball_id, 's': g.v.time})
obj['ax'].set_xlim(1000,3000)
_images/10min_to_deepgraph_37_1.png
Graph Partitioning

The DeepGraph class also offers methods to partition nodes, edges and an entire graph. See the docstrings and the other tutorials for details and examples.

Graph Interfaces

Furthermore, you may inspect the docstrings of return_cs_graph, return_nx_graph and return_gt_graph to see how to convert from DeepGraph’s DataFrame representation of a network to sparse adjacency matrices, NetworkX’s network representation and graph_tool’s network representation.

Plotting Methods

DeepGraph also offers a number of useful Plotting methods. See plotting methods for details and have a look at the other tutorials for examples.

Computing Very Large Correlation Matrices in Parallel

[ipython notebook] [python script]

In this short tutorial, we’ll demonstrate how DeepGraph can be used to efficiently compute very large correlation matrices in parallel, with full control over RAM usage.

Assume you have a set of n_samples samples, each comprised of n_features features and you want to compute the Pearson correlation coefficients between all pairs of features (for the Spearman’s rank correlation coefficients, see the Note-box below). If your data is small enough, you may use scipy.stats.pearsonr or numpy.corrcoef, but for large data, neither of these methods is feasible. Scipy’s pearsonr would be very slow, since you’d have to compute pair-wise correlations in a double loop, and numpy’s corrcoef would most likely blow your RAM.

Using DeepGraph’s create_edges method, you can compute all pair-wise correlations efficiently. In this tutorial, the data is stored on disc and only the relevant subset of features for each iteration will be loaded into memory by the computing nodes. Parallelization is achieved by using python’s standard library multiprocessing, but it should be straight-forward to modify the code to accommodate other parallelization libraries. It should also be straight-forward to modify the code in order to compute other correlation/distance/similarity-measures between a set of features.

First of all, we need to import some packages

# data i/o
import os

# compute in parallel
from multiprocessing import Pool

# the usual
import numpy as np
import pandas as pd

import deepgraph as dg

Let’s create a set of variables and store it as a 2d-matrix X (shape=(n_features, n_samples)) on disc. To speed up the computation of the correlation coefficients later on, we whiten each variable.

# create observations
from numpy.random import RandomState
prng = RandomState(0)
n_features = int(5e3)
n_samples = int(1e2)
X = prng.randint(100, size=(n_features, n_samples)).astype(np.float64)

# uncomment the next line to compute ranked variables for Spearman's correlation coefficients
# X = X.argsort(axis=1).argsort(axis=1)

# whiten variables for fast parallel computation later on
X = (X - X.mean(axis=1, keepdims=True)) / X.std(axis=1, keepdims=True)

# save in binary format
np.save('samples', X)

Note

On the computation of the Spearman’s rank correlation coefficients: Since the Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables, it suffices to uncomment the indicated line in the above code-block in order to compute the Spearman’s rank correlation coefficients in the following.

Now we can compute the pair-wise correlations using DeepGraph’s create_edges method. Note that the node table v only stores references to the mem-mapped array containing the samples.

# parameters (change these to control RAM usage)
step_size = 1e5
n_processes = 100

# load samples as memory-map
X = np.load('samples.npy', mmap_mode='r')

# create node table that stores references to the mem-mapped samples
v = pd.DataFrame({'index': range(X.shape[0])})

# connector function to compute pairwise pearson correlations
def corr(index_s, index_t):
    features_s = X[index_s]
    features_t = X[index_t]
    corr = np.einsum('ij,ij->i', features_s, features_t) / n_samples
    return corr

# index array for parallelization
pos_array = np.array(np.linspace(0, n_features*(n_features-1)//2, n_processes), dtype=int)

# parallel computation
def create_ei(i):

    from_pos = pos_array[i]
    to_pos = pos_array[i+1]

    # initiate DeepGraph
    g = dg.DeepGraph(v)

    # create edges
    g.create_edges(connectors=corr, step_size=step_size,
                   from_pos=from_pos, to_pos=to_pos)

    # store edge table
    g.e.to_pickle('tmp/correlations/{}.pickle'.format(str(i).zfill(3)))

# computation
if __name__ == '__main__':
    os.makedirs("tmp/correlations", exist_ok=True)
    indices = np.arange(0, n_processes - 1)
    p = Pool()
    for _ in p.imap_unordered(create_ei, indices):
        pass

Let’s collect the computed correlation values and store them in an hdf file.

# store correlation values
files = os.listdir('tmp/correlations/')
files.sort()
store = pd.HDFStore('e.h5', mode='w')
for f in files:
    et = pd.read_pickle('tmp/correlations/{}'.format(f))
    store.append('e', et, format='t', data_columns=True, index=False)
store.close()

Let’s have a quick look at the correlations.

# load correlation table
e = pd.read_hdf('e.h5')
print(e)
               corr
s    t
0    1    -0.006066
     2     0.094063
     3    -0.025529
     4     0.074080
     5     0.035490
     6     0.005221
     7     0.032064
     8     0.000378
     9    -0.049318
     10   -0.084853
     11    0.026407
     12    0.028543
     13   -0.013347
     14   -0.180113
     15    0.151164
     16   -0.094398
     17   -0.124582
     18   -0.000781
     19   -0.044138
     20   -0.193609
     21    0.003877
     22    0.048305
     23    0.006477
     24   -0.021291
     25   -0.070756
     26   -0.014906
     27   -0.197605
     28   -0.103509
     29    0.071503
     30    0.120718
...             ...
4991 4998 -0.012007
     4999 -0.252836
4992 4993  0.202024
     4994 -0.046088
     4995 -0.028314
     4996 -0.052319
     4997 -0.010797
     4998 -0.025321
     4999 -0.093721
4993 4994 -0.027568
     4995  0.045602
     4996 -0.102075
     4997  0.035370
     4998 -0.069946
     4999 -0.031208
4994 4995  0.108063
     4996  0.144441
     4997  0.078353
     4998 -0.024799
     4999 -0.026432
4995 4996 -0.019991
     4997 -0.178458
     4998 -0.162406
     4999  0.102835
4996 4997  0.115812
     4998 -0.061167
     4999  0.018606
4997 4998 -0.151932
     4999 -0.271358
4998 4999  0.106453

[12497500 rows x 1 columns]

And finally, let’s see where most of the computation time is spent.

g = dg.DeepGraph(v)
p = %prun -r g.create_edges(connectors=corr, step_size=step_size)
p.print_stats(20)
         244867 function calls (239629 primitive calls) in 14.193 seconds

   Ordered by: internal time
   List reduced from 541 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      250    9.355    0.037    9.361    0.037 memmap.py:334(__getitem__)
      125    1.584    0.013    1.584    0.013 {built-in method numpy.core.multiarray.c_einsum}
      125    1.012    0.008   12.013    0.096 deepgraph.py:4558(map)
        2    0.581    0.290    0.581    0.290 {method 'get_labels' of 'pandas._libs.hashtable.Int64HashTable' objects}
        1    0.301    0.301    0.414    0.414 multi.py:795(_engine)
        5    0.157    0.031    0.157    0.031 {built-in method numpy.core.multiarray.concatenate}
      250    0.157    0.001    0.170    0.001 internals.py:5017(_stack_arrays)
        2    0.105    0.053    0.105    0.053 {pandas._libs.algos.take_1d_int64_int64}
      889    0.094    0.000    0.094    0.000 {method 'reduce' of 'numpy.ufunc' objects}
      125    0.089    0.001   12.489    0.100 deepgraph.py:5294(_select_and_return)
      125    0.074    0.001    0.074    0.001 {deepgraph._triu_indices._reduce_triu_indices}
      125    0.066    0.001    0.066    0.001 {built-in method deepgraph._triu_indices._triu_indices}
        4    0.038    0.009    0.038    0.009 {built-in method pandas._libs.algos.ensure_int16}
      125    0.033    0.000   10.979    0.088 <ipython-input-3-26c4f59cd911>:12(corr)
        2    0.028    0.014    0.028    0.014 function_base.py:4703(delete)
        1    0.027    0.027   14.163   14.163 deepgraph.py:4788(_matrix_iterator)
        1    0.027    0.027    0.113    0.113 multi.py:56(_codes_to_ints)
45771/45222    0.020    0.000    0.043    0.000 {built-in method builtins.isinstance}
        1    0.019    0.019   14.193   14.193 deepgraph.py:183(create_edges)
        2    0.012    0.006    0.700    0.350 algorithms.py:576(factorize)

As you can see, most of the time is spent by getting the requested features in the corr-function, followed by computing the correlation values themselves.

Building a DeepGraph of Extreme Precipitation

[ipython notebook] [python script] [data]

In the following we build a deep graph of a high-resolution dataset of precipitation measurements.

The goal is to first detect spatiotemporal clusters of extreme precipitation events and then to create families of these clusters based on a spatial correlation measure. Finally, we create and plot some informative (intersection) partitions of the deep graph.

For further details see section V of the original paper: https://arxiv.org/abs/1604.00971

First of all, we need to import some packages

# data i/o
import os
import xarray

# for plots
import matplotlib.pyplot as plt

# the usual
import numpy as np
import pandas as pd

import deepgraph as dg

# notebook display
from IPython.display import HTML
%matplotlib inline
plt.rcParams['figure.figsize'] = 8, 6
pd.options.display.max_rows = 10
pd.set_option('expand_frame_repr', False)
Selecting and Preprocessing the Precipitation Data
Selection

If you want to select your own spatiotemporal box of precipitation events, you may follow the instructions below and change the filename in the next box of code.

  • Go to https://disc.gsfc.nasa.gov/datasets/TRMM_3B42_V7/summary?keywords=TRMM_3B42_V7
  • click on “Simple Subset Wizard”
  • select the “Date Range” (and if desired a “Spatial Bounding Box”) you’re interested in
  • click on “Search for Data Sets”
  • expand the list by clicking on the “+” symbol
  • mark the check box “precipitation”
  • (optional, but recommended) click on the selector to change from “netCDF” to “gzipped netCDF”
  • click on “Subset Selected Data Sets”
  • click on “View Subset Results”
  • right click on the “Get list of URLs for this subset in a file” link, and choose “Save Link As…”
  • the downloaded file will have a name similar to “SSW_download_2016-05-03T20_19_28_23621_2oIe06xp.inp”. Note which directory the downloaded file is saved to, and in your Unix shell, set your current working directory to that directory.
  • register an account to get authentication credentials using these instructions: https://disc.gsfc.nasa.gov/information/howto/5761bc6a5ad5a18811681bae?keywords=wget
  • get the files via
os.system("wget --content-disposition --directory-prefix=tmp --load-cookies ~/.urs_cookies --save-cookies ~/.urs_cookies --auth-no-challenge=on --keep-session-cookies -i SSW_download_2016-05-03T20_19_28_23621_2oIe06xp.inp")
Preprocessing

Next, we need to convert the downloaded netCDF files to a pandas DataFrame, which we can then use to initiate a dg.DeepGraph

# choose "wet times" threshold
r = .1
# choose "extreme" precipitation threshold
p = .9

v_list = []
for file in os.listdir('tmp'):
    if file.startswith('3B42.'):

        # open the downloaded netCDF file
        # unfortunately, we have to decode times ourselves, since
        # the format of the downloaded files doesn't work
        # see also: https://github.com/pydata/xarray/issues/521
        f = xarray.open_dataset('tmp/{}'.format(file), decode_times=False)

        # create integer-based (x,y) coordinates
        f['x'] = (('longitude'), np.arange(len(f.longitude)))
        f['y'] = (('latitude'), np.arange(len(f.latitude)))

        # convert to pd.DataFrame
        vt = f.to_dataframe()

        # we only consider "wet times", pcp >= 0.1mm/h
        vt = vt[vt.pcp >= r]

        # reset index
        vt.reset_index(inplace=True)

        # add correct times
        ftime = f.time.units.split()[2:]
        year, month, day = ftime[0].split('-')
        hour = ftime[1]
        time = pd.datetime(int(year), int(month), int(day), int(hour))
        vt['time'] = time

        # compute "area" for each event
        vt['area'] = 111**2 * .25**2 * np.cos(2*np.pi*vt.latitude / 360.)

        # compute "volume of water precipitated" for each event
        vt['vol'] = vt.pcp * 3 * vt.area

        # set dtypes -> economize ram
        vt['pcp'] = vt['pcp'] * 100
        vt['pcp'] = vt['pcp'].astype(np.uint16)
        vt['latitude'] = vt['latitude'].astype(np.float16)
        vt['longitude'] = vt['longitude'].astype(np.float16)
        vt['area'] = vt['area'].astype(np.uint16)
        vt['vol'] = vt['vol'].astype(np.uint32)
        vt['x'] = vt['x'].astype(np.uint16)
        vt['y'] = vt['y'].astype(np.uint16)

        # append to list
        v_list.append(vt)
        f.close()

# concatenate the DataFrames
v = pd.concat(v_list)

# append a column indicating geographical locations (i.e., supernode labels)
v['g_id'] = v.groupby(['longitude', 'latitude']).grouper.group_info[0]
v['g_id'] = v['g_id'].astype(np.uint32)

# select `p`th percentile of precipitation events for each geographical location
v = v.groupby('g_id').apply(lambda x: x[x.pcp >= x.pcp.quantile(p)])

# append integer-based time
dtimes = pd.date_range(v.time.min(), v.time.max(), freq='3H')
dtdic = {dtime: itime for itime, dtime in enumerate(dtimes)}
v['itime'] = v.time.apply(lambda x: dtdic[x])
v['itime'] = v['itime'].astype(np.uint16)

# sort by time
v.sort_values('time', inplace=True)

# set unique node index
v.set_index(np.arange(len(v)), inplace=True)

# shorten column names
v.rename(columns={'pcp': 'r',
                  'latitude': 'lat',
                  'longitude': 'lon',
                  'time': 'dtime',
                  'itime': 'time'},
         inplace=True)

The created DataFrame of extreme precipitation measurements looks like this

print(v)
           lat      lon      dtime     r    x    y  area    vol   g_id  time
0       15.125 -118.125 2004-08-20  1084   28  101   743  24174   5652     0
1       44.875  -30.625 2004-08-20   392  378  220   545   6433  85341     0
2       45.125  -30.625 2004-08-20   454  378  221   543   7416  85342     0
3       45.375  -30.625 2004-08-20   909  378  222   540  14767  85343     0
4       45.625  -30.625 2004-08-20   907  378  223   538  14669  85344     0
...        ...      ...        ...   ...  ...  ...   ...    ...    ...   ...
382306  26.875  -46.625 2004-09-27   503  314  148   686  10385  70380   304
382307  38.375  -37.125 2004-09-27   453  352  194   603   8222  79095   304
382308   8.125 -105.125 2004-09-27   509   80   73   762  11663  17007   304
382309  21.875  -42.875 2004-09-27   260  329  128   714   5595  73875   304
382310   6.625 -111.125 2004-09-27   192   56   67   764   4428  11790   304

[382311 rows x 10 columns]

We identify each row of this table as a node of our DeepGraph

g = dg.DeepGraph(v)
Plot the Data

Let’s take a look at the data by creating a video of the time-evolution of precipitation measurements. Using the plot_map_generator method, this is straight forward.

# configure map projection
kwds_basemap = {'llcrnrlon': v.lon.min() - 1,
                'urcrnrlon': v.lon.max() + 1,
                'llcrnrlat': v.lat.min() - 1,
                'urcrnrlat': v.lat.max() + 1,
                'resolution': 'i'}

# configure scatter plots
kwds_scatter = {'s': 1.5,
                'c': g.v.r.values / 100.,
                'edgecolors': 'none',
                'cmap': 'viridis_r'}

# create generator of scatter plots on map
objs = g.plot_map_generator('lon', 'lat', 'dtime',
                            kwds_basemap=kwds_basemap,
                            kwds_scatter=kwds_scatter)

# plot and store frames
for i, obj in enumerate(objs):

    # configure plots
    cb = obj['fig'].colorbar(obj['pc'], fraction=0.025, pad=0.01)
    cb.set_label('[mm/h]')
    obj['m'].fillcontinents(color='0.2', zorder=0, alpha=.4)
    obj['ax'].set_title('{}'.format(obj['group']))

    # store and close
    obj['fig'].savefig('tmp/pcp_{:03d}.png'.format(i),
                       dpi=300, bbox_inches='tight')
    plt.close(obj['fig'])
# create video with ffmpeg
cmd = "ffmpeg -y -r 5 -i tmp/pcp_%03d.png -c:v libx264 -r 20 -vf scale=2052:1004 {}.mp4"
os.system(cmd.format('precipitation_files/pcp'))
# embed video
HTML("""
<video width="700" height="350" controls>
  <source src="precipitation_files/pcp.mp4" type="video/mp4">
</video>
""")

[download video]

Detecting SpatioTemporal Clusters of Extreme Precipitation

In this tutorial, we’re interested in local formations of spatiotemporal clusters of extreme precipitation events. For that matter, we now use DeepGraph to identify such clusters and track their temporal evolution.

Create Edges

We now use DeepGraph to create edges between the nodes given by g.v.

The edges of g will be utilized to detect spatiotemporal clusters in the data, or in more technical terms: to partition the set of all nodes into subsets of connected grid points. One can imagine the nodes to be elements of a \(3\) dimensional grid box (x,y,time), where we allow every node to have \(26\) possible neighbours (\(8\) neighbours in the time slice of the measurement, \(t_i\), and \(9\) neighbours in each the time slice \(t_i − 1\) and \(t_i + 1\)).

For that matter, we pass the following connectors

def grid_2d_dx(x_s, x_t):
    dx = x_t - x_s
    return dx

def grid_2d_dy(y_s, y_t):
    dy = y_t - y_s
    return dy

and selectors

def s_grid_2d_dx(dx, sources, targets):
    dxa = np.abs(dx)
    sources = sources[dxa <= 1]
    targets = targets[dxa <= 1]
    return sources, targets

def s_grid_2d_dy(dy, sources, targets):
    dya = np.abs(dy)
    sources = sources[dya <= 1]
    targets = targets[dya <= 1]
    return sources, targets

to the create_edges_ft method

g.create_edges_ft(ft_feature=('time', 1),
                  connectors=[grid_2d_dx, grid_2d_dy],
                  selectors=[s_grid_2d_dx, s_grid_2d_dy],
                  r_dtype_dic={'ft_r': np.bool,
                               'dx': np.int8,
                               'dy': np.int8},
                  logfile='create_e',
                  max_pairs=1e7)

# rename fast track relation
g.e.rename(columns={'ft_r': 'dt'}, inplace=True)

To see how many nodes and edges our graph’s comprised of, one may simply type

g
<DeepGraph object, with n=382311 node(s) and m=567225 edge(s) at 0x7f7a4c3de160>

The edges we just created look like this

print(g.e)
               dx  dy     dt
s      t
0      1362     0   1  False
       1432     1   0  False
       1433     1   1  False
       1696     1   0   True
       1699     1   1   True
...            ..  ..    ...
382284 382291   0   1  False
382295 382296   0   1  False
382296 382299   0   1  False
382299 382309   0   1  False
382304 382306   0   1  False

[567225 rows x 3 columns]

Logfile Plot

To see how long it took to create the edges, one may use the plot_logfile method

g.plot_logfile('create_e')
_images/precipitation_38_1.png
Find the Connected Components

Having linked all neighbouring nodes, the spatiotemporal clusters can be identified as the connected components of the graph. For practical reasons, DeepGraph directly implements a method to find the connected components of a graph, append_cp

# all singular components (components comprised of one node only)
# are consolidated under the label 0
g.append_cp(consolidate_singles=True)
# we don't need the edges any more
del g.e

the node table now has a component membership column appended

print(g.v)
           lat      lon      dtime     r    x    y  area    vol   g_id  time     cp
0       15.125 -118.125 2004-08-20  1084   28  101   743  24174   5652     0    865
1       44.875  -30.625 2004-08-20   392  378  220   545   6433  85341     0   5079
2       45.125  -30.625 2004-08-20   454  378  221   543   7416  85342     0   5079
3       45.375  -30.625 2004-08-20   909  378  222   540  14767  85343     0   5079
4       45.625  -30.625 2004-08-20   907  378  223   538  14669  85344     0   5079
...        ...      ...        ...   ...  ...  ...   ...    ...    ...   ...    ...
382306  26.875  -46.625 2004-09-27   503  314  148   686  10385  70380   304    609
382307  38.375  -37.125 2004-09-27   453  352  194   603   8222  79095   304      0
382308   8.125 -105.125 2004-09-27   509   80   73   762  11663  17007   304    174
382309  21.875  -42.875 2004-09-27   260  329  128   714   5595  73875   304      8
382310   6.625 -111.125 2004-09-27   192   56   67   764   4428  11790   304  15610

[382311 rows x 11 columns]

Let’s see how many spatiotemporal clusters g is comprised of (discarding singular components)

g.v.cp.max()
33169

and how many nodes there are in the components

print(g.v.cp.value_counts())
0        64678
1        16460
2         8519
3         6381
4         3403
         ...
29601        2
27554        2
25507        2
23460        2
20159        2
Name: cp, dtype: int64
Partition the Nodes Into a Component Supernode Table

In order to aggregate and compute some information about the precipitiation clusters, we now partition the nodes by the type of feature cp, the component membership labels of the nodes just created. This can be done with the partition_nodes method

# feature functions, will be applied to each component of g
feature_funcs = {'dtime': [np.min, np.max],
                 'time': [np.min, np.max],
                 'vol': [np.sum],
                 'lat': [np.mean],
                 'lon': [np.mean]}

# partition the node table
cpv, gv = g.partition_nodes('cp', feature_funcs, return_gv=True)

# append geographical id sets
cpv['g_ids'] = gv['g_id'].apply(set)

# append cardinality of g_id sets
cpv['n_unique_g_ids'] = cpv['g_ids'].apply(len)

# append time spans
cpv['dt'] = cpv['dtime_amax'] - cpv['dtime_amin']

# append spatial coverage
def area(group):
    return group.drop_duplicates('g_id').area.sum()
cpv['area'] = gv.apply(area)

The clusters look like this

print(cpv)
       n_nodes          dtime_amin          dtime_amax  time_amin  time_amax   lat_mean    vol_sum   lon_mean                                              g_ids  n_unique_g_ids               dt      area
cp
0        64678 2004-08-20 00:00:00 2004-09-27 00:00:00          0        304  17.609375  627097323  -63.40625  {0, 1, 2, 6, 7, 10, 12, 13, 14, 22, 23, 24, 25...           49808 38 days 00:00:00  34781178
1        16460 2004-09-01 06:00:00 2004-09-17 18:00:00         98        230  17.281250  351187150  -65.12500  {65536, 65537, 65538, 65539, 65540, 65541, 655...            6629 16 days 12:00:00   4803624
2         8519 2004-09-17 03:00:00 2004-09-24 15:00:00        225        285  26.906250  133698579  -44.62500  {73728, 73729, 73730, 73731, 73732, 73733, 737...            3730  7 days 12:00:00   2507350
3         6381 2004-08-26 09:00:00 2004-09-06 03:00:00         51        137  21.062500  113782748  -64.12500  {65555, 65556, 65557, 65558, 65559, 65560, 655...            2442 10 days 18:00:00   1749673
4         3403 2004-08-21 21:00:00 2004-08-24 12:00:00         15         36  10.578125   66675326 -111.93750  {8141, 14654, 11805, 16363, 8142, 11806, 20490...            1294  2 days 15:00:00    978604
...        ...                 ...                 ...        ...        ...        ...        ...        ...                                                ...             ...              ...       ...
33165        2 2004-08-23 18:00:00 2004-08-23 18:00:00         30         30  15.500000      20212 -103.87500                                     {18115, 18116}               2  0 days 00:00:00      1483
33166        2 2004-09-05 18:00:00 2004-09-05 18:00:00        134        134  27.250000       9366 -121.87500                                       {2688, 2687}               2  0 days 00:00:00      1368
33167        2 2004-08-30 15:00:00 2004-08-30 15:00:00         85         85   9.250000      43096    0.62500                                   {112116, 112117}               2  0 days 00:00:00      1519
33168        2 2004-09-09 03:00:00 2004-09-09 03:00:00        161        161   6.750000      24156  -13.62500                                   {100613, 100614}               2  0 days 00:00:00      1528
33169        2 2004-09-11 03:00:00 2004-09-11 03:00:00        177        177  15.500000      46798  -16.12500                                     {98523, 98524}               2  0 days 00:00:00      1483

[33170 rows x 12 columns]
Plot the Largest Component

Let’s see how the largest cluster of extreme precipitation evolves over time, again using the plot_map_generator method

# temporary DeepGraph instance containing
# only the largest component
gt = dg.DeepGraph(g.v)
gt.filter_by_values_v('cp', 1)

# configure map projection
from mpl_toolkits.basemap import Basemap
m1 = Basemap(projection='ortho',
             lon_0=cpv.loc[1].lon_mean + 12,
             lat_0=cpv.loc[1].lat_mean + 8,
             resolution=None)
width = (m1.urcrnrx - m1.llcrnrx) * .65
height = (m1.urcrnry - m1.llcrnry) * .45

kwds_basemap = {'projection': 'ortho',
                'lon_0': cpv.loc[1].lon_mean + 12,
                'lat_0': cpv.loc[1].lat_mean + 8,
                'llcrnrx': -0.5 * width,
                'llcrnry': -0.5 * height,
                'urcrnrx': 0.5 * width,
                'urcrnry': 0.5 * height,
                'resolution': 'i'}

# configure scatter plots
kwds_scatter = {'s': 2,
                'c': np.log(gt.v.r.values / 100.),
                'edgecolors': 'none',
                'cmap': 'viridis_r'}

# create generator of scatter plots on map
objs = gt.plot_map_generator('lon', 'lat', 'dtime',
                              kwds_basemap=kwds_basemap,
                              kwds_scatter=kwds_scatter)

# plot and store frames
for i, obj in enumerate(objs):

    # configure plots
    obj['m'].fillcontinents(color='0.2', zorder=0, alpha=.4)
    obj['m'].drawparallels(range(-50, 50, 20), linewidth=.2)
    obj['m'].drawmeridians(range(0, 360, 20), linewidth=.2)
    obj['ax'].set_title('{}'.format(obj['group']))

    # store and close
    obj['fig'].savefig('tmp/cp1_ortho_{:03d}.png'.format(i),
                       dpi=300, bbox_inches='tight')
    plt.close(obj['fig'])
# create video with ffmpeg
cmd = "ffmpeg -y -r 5 -i tmp/cp1_ortho_%03d.png -c:v libx264 -r 20 -vf scale=1919:1406 {}.mp4"
os.system(cmd.format('precipitation_files/cp1_ortho'))
# embed video
HTML("""
<video width="700" height="500" controls>
  <source src="precipitation_files/cp1_ortho.mp4" type="video/mp4">
</video>
""")

[download video]

Create and Plot Informative (Intersection) Partitions

In this last section, we create some useful (intersection) partitions of the deep graph, which we then use to create some plots.

Geographical Locations
# how many components have hit a certain
# geographical location (discarding singular cps)
def count(cp):
    return len(set(cp[cp != 0]))

# feature functions, will be applied to each g_id
feature_funcs = {'cp': [count],
                 'vol': [np.sum],
                 'lat': np.min,
                 'lon': np.min}

gv = g.partition_nodes('g_id', feature_funcs)
gv.rename(columns={'lat_amin': 'lat',
                   'lon_amin': 'lon'}, inplace=True)
print(gv)
        n_nodes  cp_count     lat  vol_sum      lon
g_id
0             2         1 -10.125    10142 -125.125
1             2         1  -9.875     8716 -125.125
2             2         0  -9.625     4372 -125.125
3             2         2  -9.375     5310 -125.125
4             2         2  -9.125     6409 -125.125
...         ...       ...     ...      ...      ...
115618        2         1  48.875    14319    5.125
115619        1         1  49.125    10129    5.125
115620        2         1  49.375    12826    5.125
115621        2         2  49.625     9117    5.125
115622        2         1  49.875    12101    5.125

[115623 rows x 5 columns]
Plot GeoLocational Information
cols = {'n_nodes': gv.n_nodes,
        'vol sum': gv.vol_sum,
        'cp count': gv.cp_count}

for name, col in cols.items():

    # for easy filtering, we create a new DeepGraph instance for
    # each component
    gt = dg.DeepGraph(gv)

    # configure map projection
    kwds_basemap = {'llcrnrlon': v.lon.min() - 1,
                    'urcrnrlon': v.lon.max() + 1,
                    'llcrnrlat': v.lat.min() - 1,
                    'urcrnrlat': v.lat.max() + 1}

    # configure scatter plots
    kwds_scatter = {'s': 1,
                    'c': col.values,
                    'cmap': 'viridis_r',
                    'alpha': .5,
                    'edgecolors': 'none'}

    # create scatter plot on map
    obj = gt.plot_map(lon='lon', lat='lat',
                      kwds_basemap=kwds_basemap,
                      kwds_scatter=kwds_scatter)

    # configure plots
    obj['m'].drawcoastlines(linewidth=.8)
    obj['m'].drawparallels(range(-50, 50, 20), linewidth=.2)
    obj['m'].drawmeridians(range(0, 360, 20), linewidth=.2)
    obj['ax'].set_title(name)

    # colorbar
    cb = obj['fig'].colorbar(obj['pc'], fraction=.022, pad=.02)
    cb.set_label('{}'.format(name), fontsize=15)
_images/precipitation_83_0.png _images/precipitation_83_1.png _images/precipitation_83_2.png
Geographical Locations and Families

In order to create the intersection partition of geographical locations and families, we first need to append a family membership column to v

# create F col
v['F'] = np.ones(len(v), dtype=int) * -1
gcpv = cpv.groupby('F')
it = gcpv.apply(lambda x: x.index.values)

for F in range(len(it)):
    cp_index = v.cp.isin(it.iloc[F])
    v.loc[cp_index, 'F'] = F

Then we create the intersection partition

# feature funcs
def n_cp_nodes(cp):
    return len(cp.unique())

feature_funcs = {'vol': [np.sum],
                 'lat': np.min,
                 'lon': np.min,
                 'cp': n_cp_nodes}

# create family-g_id intersection graph
fgv = g.partition_nodes(['F', 'g_id'], feature_funcs=feature_funcs)
fgv.rename(columns={'lat_amin': 'lat',
                    'lon_amin': 'lon',
                    'cp_n_cp_nodes': 'n_cp_nodes'}, inplace=True)

which looks like this

print(fgv)
            n_nodes  n_cp_nodes     lat  vol_sum      lon
F    g_id
-1   0            2           2 -10.125    10142 -125.125
     1            2           2  -9.875     8716 -125.125
     2            2           1  -9.625     4372 -125.125
     3            2           2  -9.375     5310 -125.125
     4            2           2  -9.125     6409 -125.125
...             ...         ...     ...      ...      ...
 998 26685        1           1  -8.875      593  -93.625
     26686        1           1  -8.625      411  -93.625
     26887        1           1  -9.375      364  -93.375
     26888        1           1  -9.125      478  -93.375
     26889        1           1  -8.875      456  -93.375

[186903 rows x 5 columns]
Plot Family Information
families = [0,1,2,3]

for F in families:

    # for easy filtering, we create a new DeepGraph instance for
    # each component
    gt = dg.DeepGraph(fgv.loc[F])

    # configure map projection
    kwds_basemap = {'llcrnrlon': v.lon.min() - 1,
                    'urcrnrlon': v.lon.max() + 1,
                    'llcrnrlat': v.lat.min() - 1,
                    'urcrnrlat': v.lat.max() + 1}

    # configure scatter plots
    kwds_scatter = {'s': 1,
                    'c': gt.v.n_cp_nodes.values,
                    'cmap': 'viridis_r',
                    'edgecolors': 'none'}

    # create scatter plot on map
    obj = gt.plot_map(
        lat='lat', lon='lon',
        kwds_basemap=kwds_basemap, kwds_scatter=kwds_scatter)

    # configure plots
    obj['m'].drawcoastlines(linewidth=.8)
    obj['m'].drawparallels(range(-50, 50, 20), linewidth=.2)
    obj['m'].drawmeridians(range(0, 360, 20), linewidth=.2)
    cb = obj['fig'].colorbar(obj['pc'], fraction=.022, pad=.02)
    cb.set_label('n_cps', fontsize=15)
    obj['ax'].set_title('Family {}'.format(F))
_images/precipitation_92_0.png _images/precipitation_92_1.png _images/precipitation_92_2.png _images/precipitation_92_3.png
Geographical Locations and Components
# feature functions, will be applied on each [g_id, cp] group of g
feature_funcs = {'vol': [np.sum],
                 'lat': np.min,
                 'lon': np.min}

# create gcpv
gcpv = g.partition_nodes(['cp', 'g_id'], feature_funcs)

gcpv.rename(columns={'lat_amin': 'lat',
                     'lon_amin': 'lon'}, inplace=True)
print(gcpv)
              n_nodes     lat  vol_sum      lon
cp    g_id
0     0             1 -10.125     5071 -125.125
      1             1  -9.875     4415 -125.125
      2             2  -9.625     4372 -125.125
      6             3  -8.375     1026 -125.125
      7             1  -8.125      594 -125.125
...               ...     ...      ...      ...
33167 112117        1   9.375    24618    0.625
33168 100613        1   6.625    11450  -13.625
      100614        1   6.875    12706  -13.625
33169 98523         1  15.375    31057  -16.125
      98524         1  15.625    15741  -16.125

[287301 rows x 4 columns]
Plot Component Information
# select the components to plot
comps = [1, 2, 3, 4]

fig, axs = plt.subplots(2, 2, figsize=[10,8])
axs = axs.flatten()

for comp, ax in zip(comps, axs):

    # for easy filtering, we create a new DeepGraph instance for
    # each component
    gt = dg.DeepGraph(gcpv[gcpv.index.get_level_values('cp') == comp])

    # configure map projection
    kwds_basemap = {'projection': 'ortho',
                    'lon_0': cpv.loc[comp].lon_mean,
                    'lat_0': cpv.loc[comp].lat_mean,
                    'resolution': 'c'}

    # configure scatter plots
    kwds_scatter = {'s': .5,
                    'c': gt.v.vol_sum.values,
                    'cmap': 'viridis_r',
                    'edgecolors': 'none'}

    # create scatter plot on map
    obj = gt.plot_map(lon='lon', lat='lat',
                      kwds_basemap=kwds_basemap,
                      kwds_scatter=kwds_scatter,
                      ax=ax)

    # configure plots
    obj['m'].fillcontinents(color='0.2', zorder=0, alpha=.2)
    obj['m'].drawparallels(range(-50, 50, 20), linewidth=.2)
    obj['m'].drawmeridians(range(0, 360, 20), linewidth=.2)
    obj['ax'].set_title('cp {}'.format(comp))
_images/precipitation_97_0.png

From Multilayer Networks to Deep Graphs

[ipython notebook] [python script]

In this tutorial we exemplify the representation of multilayer networks (MLNs) by deep graphs and demonstrate some of the advantages of deepgraph’s network representation.

We start by converting the Noordin Top Terrorist MLN into a graph g - comprised of two DataFrames, a node table g.v and an edge table g.e - that corresponds to the supra-graph representation of the multilayer network.

We then partition the graph g by the information attributed to its layers, resulting in different supergraphs on the partition lattice of g that correpsond to different representations of a MLN (including its tensor representation).

In the next part, we demonstrate how additional information that might be at hand or computed during the analysis can be used to induce further supergraphs, or metaphorically speaking, how additional information corresponds to “hidden layers” of a MLN.

Finally, we briefly show how to use the nodes’ properties to partition the edges of a MLN.

** References **

For a short summary of the multilayer network representation, see Appendix C of the Deep Graphs paper.

For a more in-depth introduction to MLNs, I recommend the following papers:

For a discussion of how Deep Graphs relates to the multilayer network representation, see Sec. IV B and Appendix D of the Deep Graphs paper.

The Noordin Top Terrorist Data
The Noordin Top Terrorist Network

[high-res version] [python plot script]

The data we use in this tutorial is the Noordin Top Terrorist Network, which has previously been represented as a multilayer network (e.g., http://arxiv.org/abs/1308.3182)

It includes relational data on 79 Indonesian terrorists belonging to the so-called Noordin Top Terrorist Network.

For information about the individual’s attributes and their relations, see http://www.thearda.com/archive/files/codebooks/origCB/Noordin%20Subset%20Codebook.pdf and http://arxiv.org/pdf/1308.3182v3.pdf.

Preprocessing

We download the data from here, and process it into two pandas DataFrames, a node table and an edge table. The preprocessing is quite lengthy, so you might want to proceed directly to the next section.

First of all, we need to import some packages

# data i/o
import os
import subprocess
import zipfile

# for plots
import matplotlib.pyplot as plt

# the usual
import numpy as np
import pandas as pd

import deepgraph as dg

# notebook display
%matplotlib inline
pd.options.display.max_rows = 10
pd.set_option('expand_frame_repr', False)
Preprocessing the Nodes
# zip file containing node attributes
os.makedirs("tmp", exist_ok=True)
get_nodes_zip = ("wget -O tmp/terrorist_nodes.zip "
                 "https://sites.google.com/site/sfeverton18/"
                 "research/appendix-1/Noordin%20Subset%20%28ORA%29.zip?"
                 "attredirects=0&d=1")
subprocess.call(get_nodes_zip.split())

# unzip
zf = zipfile.ZipFile('tmp/terrorist_nodes.zip')
zf.extract('Attributes.csv', path='tmp/')
zf.close()

# create node table
v = pd.read_csv('tmp/Attributes.csv')
v.rename(columns={'Unnamed: 0': 'Name'}, inplace=True)

# create a copy of all nodes for each layer (i.e., create "node-layers")
# there are 10 layers and 79 nodes on each layer
v = pd.concat(10*[v])

# add "aspect" as column to v
layer_names = ['Business', 'Communication', 'O Logistics', 'O Meetings',
               'O Operations', 'O Training', 'T Classmates', 'T Friendship',
               'T Kinship', 'T Soulmates']
layers = [[name]*79 for name in layer_names]
layers = [item for sublist in layers for item in sublist]
v['layer'] = layers

# set unique node index
v.reset_index(inplace=True)
v.rename(columns={'index': 'V_N'}, inplace=True)

# swap columns
cols = list(v)
cols[1], cols[10] = cols[10], cols[1]
v = v[cols]

# get rid of the attribute columns for demonstrational purposes,
# will be inserted again later
v, vinfo = v.iloc[:, :2], v.iloc[:, 2:]
Preprocessing the Edges
# paj file containing edges for different layers
get_paj = ("wget -O tmp/terrorists.paj "
           "https://sites.google.com/site/sfeverton18/"
           "research/appendix-1/Noordin%20Subset%20%28Pajek%29.paj?"
           "attredirects=0&d=1")
subprocess.call(get_paj.split())

# get data blocks from paj file
with open('tmp/terrorists.paj') as txtfile:
    comments = []
    data = []
    part = []
    for line in txtfile:
        if line.startswith('*'):
            # comment lines
            comment = line
            comments.append(comment)
            if part:
                data.append(part)
                part = []
        else:
            # vertices
            if comment.startswith('*Vertices') and len(line.split()) > 1:
                sublist = line.split('"')
                sublist = sublist[:2] + sublist[-1].split()
                part.append(sublist)
            # edges or partitions
            elif not line.isspace():
                part.append(line.split())
    # append last block
    data.append(part)

# extract edge tables from data blocks
ecomments = []
eparts = []
for i, c in enumerate(comments):
    if c.startswith('*Network'):
        del data[0]
    elif c.startswith('*Partition'):
        del data[0]
    elif c.startswith('*Vector'):
        del data[0]
    elif c.startswith('*Arcs') or c.startswith('*Edges'):
        ecomments.append(c)
        eparts.append(data.pop(0))

# layer data parts (indices found manually via comments)
inds = [11, 10, 5, 6, 7, 8, 0, 1, 2, 3]
eparts = [eparts[ind] for ind in inds]

# convert to DataFrames
layer_frames = []
for name, epart in zip(layer_names, eparts):
    frame = pd.DataFrame(epart, dtype=np.int16)
    # get rid of self-loops, bidirectional edges
    frame = frame[frame[0] < frame[1]]
    # rename columns
    frame.rename(columns={0: 's', 1: 't', 2: name}, inplace=True)
    frame['s'] -= 1
    frame['t'] -= 1
    layer_frames.append(frame)

# set indices
for i, e in enumerate(layer_frames):
    e['s'] += i*79
    e['t'] += i*79
    e.set_index(['s', 't'], inplace=True)

# concat the layers
e = pd.concat(layer_frames)

# edge table as described in the paper
e_paper = e.copy()
# alternative representation of e
e['type'] = 0
e['weight'] = 0
for layer in layer_names:
    where = e[layer].notnull()
    e.loc[where, 'type'] = layer
    e.loc[where, 'weight'] = e.loc[where, layer]
e = e[['type', 'weight']]
DeepGraph’s Supra-Graph Representation of a MLN, \(G = (V, E)\)

Above, we have processed the downloaded data into a node table v and an edge table e, that correspond to the supra-graph representation of a multilayer network. This is the preferred representation of a MLN by a deep graph, since all other representations are entailed in the supra-graph’s partition lattice, as we will demonstrate below.

g = dg.DeepGraph(v, e)
print(g)
<DeepGraph object, with n=790 node(s) and m=1014 edge(s) at 0x7fb8e13499e8>

Let’s have a look at the node table first

print(g.v)
     V_N        layer
0      0     Business
1      1     Business
2      2     Business
3      3     Business
4      4     Business
..   ...          ...
785   74  T Soulmates
786   75  T Soulmates
787   76  T Soulmates
788   77  T Soulmates
789   78  T Soulmates

[790 rows x 2 columns]

As you can see, there are 790 nodes in total. Each of the 10 layers,

print(g.v.layer.unique())
['Business' 'Communication' 'O Logistics' 'O Meetings' 'O Operations'
 'O Training' 'T Classmates' 'T Friendship' 'T Kinship' 'T Soulmates']

is comprised of 79 nodes. Every node has a feature of type V_N, indicating the individual the node belongs to, and a feature of type layer, corresponding to the layer the node belongs to. Each of the 790 nodes corresponds to a node-layer of the MLN representation of this data.

The edge table,

print(g.e)
                type  weight
s   t
9   67      Business     2.0
    69      Business     1.0
    77      Business     1.0
11  61      Business     1.0
20  59      Business     1.0
...              ...     ...
733 769  T Soulmates     1.0
755 769  T Soulmates     1.0
    787  T Soulmates     1.0
771 788  T Soulmates     1.0
783 788  T Soulmates     1.0

[1014 rows x 2 columns]

is comprised of 1014 edges between the nodes in v. Each edge has two relations. The first relation (of type type) is determined by the tuple of features \((layer_i, layer_j)\) of the adjacent nodes \(V_i\) and \(V_j\). The second relation (of type weight) indicates the “weight” of the connection.

This representation of the edges of a MLN deviates from the one you can find in the paper, which is described in the last section.

There are 10 types of relations in the above edge table

g.e['type'].unique()
array(['Business', 'Communication', 'O Logistics', 'O Meetings',
       'O Operations', 'O Training', 'T Classmates', 'T Friendship',
       'T Kinship', 'T Soulmates'], dtype=object)

which - in the case of this data set - correspond to the layers of the nodes. This is due to the fact that there are no inter-layer connections in the Noordin Top Terrorist Network (such as, e.g., an edge from layer Business to layer Communication would be). The edges here are all (undirected) intra-layer edges (e.g., Business \(\rightarrow\) Business, Operations \(\rightarrow\) Operations).

To see how the edges are distributed among the different types, you can simply type

g.e['type'].value_counts()
O Operations     267
Communication    200
T Classmates     175
O Training       147
T Friendship      91
O Meetings        63
O Logistics       29
T Kinship         16
Business          15
T Soulmates       11
Name: type, dtype: int64

Let’s have a look at how many “actors” (nodes with at least one connection) there are within each layer

# append degree
gtg = g.return_gt_graph()
g.v['deg'] = gtg.degree_property_map('total').a

# how many "actors" are there per layer?
g.v[g.v.deg != 0].groupby('layer').size()
layer
Business         13
Communication    74
O Logistics      16
O Meetings       26
O Operations     39
O Training       38
T Classmates     39
T Friendship     61
T Kinship        24
T Soulmates       9
dtype: int64

For the purpose of this tutorial, the fact that the Noordin Top Terrorist Network is a MLN with only one aspect, and without inter-layer edges, is of little importance. The generalization of what we’re showing in the following to more general MLNs is straight-forward (and explained in detail in Appendix D of the paper).

Let’s illustrate the supra-graph representation of this MLN by a plot

# create graph_tool graph for layout
import graph_tool.draw as gtd
gtg = g.return_gt_graph()
gtg.set_directed(False)

# get sfdp layout postitions
pos = gtd.sfdp_layout(gtg, gamma=.5)
pos = pos.get_2d_array([0, 1])
g.v['x'] = pos[0]
g.v['y'] = pos[1]

# configure nodes
kwds_scatter = {'s': 1,
                'c': 'k'}

# configure edges
kwds_quiver = {'headwidth': 1,
               'alpha': .3,
               'cmap': 'prism'}
# color by type
C = g.e.groupby('type').grouper.group_info[0]

# plot
fig, ax = plt.subplots(1, 2, figsize=(15, 7))
g.plot_2d('x', 'y', edges=True, C=C,
          kwds_scatter=kwds_scatter,
          kwds_quiver=kwds_quiver, ax=ax[0])

# turn axis off, set x/y-lim
ax[0].axis('off')
ax[0].set_xlim((g.v.x.min() - 1, g.v.x.max() + 1))
ax[0].set_ylim((g.v.y.min() - 1, g.v.y.max() + 1))

# plot adjacency matrix
adj = g.return_cs_graph().todense()
adj = adj + adj.T
inds = np.where(adj != 0)
ax[1].scatter(inds[0], inds[1], c='k', marker='.')
ax[1].grid()
ax[1].set_xlim(-1, 791)
ax[1].set_ylim(-1,791)
_images/terrorists_33_2.png

The supra-graph representation of a MLN is by itself a powerful representation and exploitable in various ways (see, e.g., section 2.3 of this paper). However, in the following, we will demonstrate how to use the additional information attributed to the layers of the MLN, in order to “structure” and partition the MLN into different representations.

Redistributing Information on the Partition Lattice of the MLN

Based on the types of features V_N and layer, we can now redistribute the information contained in the supra-graph g. This redistribution allows for several representations of the graph, which we will demonstrate in the following.

The SuperGraph \(G^L = (V^L, E^L)\)

Partitioning by the type of feature layer leads to the supergraph \(G^L = (V^L,E^L)\), where every supernode \(V^{L}_{i^L} \in V^{L}\) corresponds to a distinct layer, encompassing all its respective nodes. Superedges \(E^{L}_{i^L, j^L} \in E^{L}\) with either \(i^L = j^L\) or \(i^L \neq j^L\) correspond to collections of intra- and inter-layer edges of the MLN, respectively.

# partition the graph
lv, le = g.partition_graph('layer',
                           relation_funcs={'weight': ['sum', 'mean', 'std']})
lg = dg.DeepGraph(lv, le)
print(lg)
<DeepGraph object, with n=10 node(s) and m=10 edge(s) at 0x7fb8e1349c50>
print(lg.v)
               n_nodes
layer
Business            79
Communication       79
O Logistics         79
O Meetings          79
O Operations        79
O Training          79
T Classmates        79
T Friendship        79
T Kinship           79
T Soulmates         79
print(lg.e)
                             n_edges  weight_sum  weight_mean  weight_std
layer_s       layer_t
Business      Business            15        16.0     1.066667    0.258199
Communication Communication      200       200.0     1.000000    0.000000
O Logistics   O Logistics         29        58.0     2.000000    0.000000
O Meetings    O Meetings          63       170.0     2.698413    1.612801
O Operations  O Operations       267       574.0     2.149813    0.699107
O Training    O Training         147       334.0     2.272109    0.763534
T Classmates  T Classmates       175       175.0     1.000000    0.000000
T Friendship  T Friendship        91        91.0     1.000000    0.000000
T Kinship     T Kinship           16        16.0     1.000000    0.000000
T Soulmates   T Soulmates         11        11.0     1.000000    0.000000

Let’s plot the graph g grouped by its layers.

# append layer_id to group nodes by layers
g.v['layer_id'] = g.v.groupby('layer').grouper.group_info[0].astype(np.int32)

# create graph_tool graph object
gtg = g.return_gt_graph(features=['layer_id'])
gtg.set_directed(False)

# get sfdp layout postitions
pos = gtd.sfdp_layout(gtg, groups=gtg.vp['layer_id'], mu=.15)
pos = pos.get_2d_array([0, 1])
g.v['x'] = pos[0]
g.v['y'] = pos[1]

# configure nodes
kwds_scatter = {'s': 10,
                'c': 'k'}

# configure edges
kwds_quiver = {'headwidth': 1,
               'alpha': .4,
               'cmap': 'viridis'}
# color by weight
C = g.e.weight.values

# plot
fig, ax = plt.subplots(figsize=(12, 12))
obj = g.plot_2d('x', 'y', edges=True, C=C,
          kwds_scatter=kwds_scatter,
          kwds_quiver=kwds_quiver, ax=ax)

# turn axis off, set x/y-lim and name layers
ax.axis('off')
margin = 10
ax.set_xlim((g.v.x.min() - margin, g.v.x.max() + margin))
ax.set_ylim((g.v.y.min() - margin, g.v.y.max() + margin))
for layer in layer_names:
    plt.text(g.v[g.v['layer'] == layer].x.mean() - margin * 3,
             g.v[g.v['layer'] == layer].y.max() + margin,
             layer, fontsize=15)
_images/terrorists_44_0.png

We can also plot the supergraph \(G^L = (V^L, E^L)\)

# create graph_tool graph of lg
gtg = lg.return_gt_graph(relations=True, node_indices=True, edge_indices=True)

# create plot
gtd.graph_draw(gtg,
               vertex_text=gtg.vp['i'], vertex_text_position=-2,
               vertex_fill_color='w',
               vertex_text_color='k',
               edge_text=gtg.ep['n_edges'],
               inline=True, fit_view=.8,
               output_size=(400,400))
_images/terrorists_45_0.png
The SuperGraph \(G^N = (V^N, E^N)\)

Partitioning by the type of feature V_N leads to the supergraph \(G^{N} = (V^{N}, E^{N})\), where each supernode \(V^{N}_{i^N} \in V^{N}\) corresponds to a node of the MLN. Superedges \(E^{N}_{i^N j^N} \in E^{N}\) with \(i^N = j^N\) correspond to the coupling edges of a MLN.

# partition by MLN's node indices
nv, ne, gv, ge = g.partition_graph('V_N', return_gve=True)

# for each superedge, get types of edges and their weights
def type_weights(group):
    index = group['type'].values
    data = group['weight'].values
    return pd.Series(data=data, index=index)
ne_weights = ge.apply(type_weights).unstack()
ne = pd.concat((ne, ne_weights), axis=1)

# create graph
ng = dg.DeepGraph(nv, ne)
ng
<DeepGraph object, with n=79 node(s) and m=623 edge(s) at 0x7fb8d1da8b70>
print(ng.v)
     n_nodes
V_N
0         10
1         10
2         10
3         10
4         10
..       ...
74        10
75        10
76        10
77        10
78        10

[79 rows x 1 columns]
print(ng.e)
             n_edges  Business  Communication  O Logistics  O Meetings  O Operations  O Training  T Classmates  T Friendship  T Kinship  T Soulmates
V_N_s V_N_t
0     15           3       NaN            1.0          2.0         NaN           NaN         NaN           NaN           NaN        1.0          NaN
1     4            1       NaN            NaN          NaN         NaN           NaN         NaN           1.0           NaN        NaN          NaN
      5            1       NaN            NaN          NaN         NaN           NaN         NaN           1.0           NaN        NaN          NaN
      16           1       NaN            NaN          NaN         NaN           2.0         NaN           NaN           NaN        NaN          NaN
      21           1       NaN            NaN          NaN         NaN           NaN         NaN           1.0           NaN        NaN          NaN
...              ...       ...            ...          ...         ...           ...         ...           ...           ...        ...          ...
72    73           4       NaN            1.0          NaN         NaN           2.0         2.0           NaN           NaN        1.0          NaN
      76           6       NaN            1.0          NaN         2.0           2.0         2.0           1.0           1.0        NaN          NaN
      77           2       NaN            NaN          2.0         NaN           NaN         NaN           NaN           NaN        NaN          1.0
73    76           2       NaN            NaN          NaN         NaN           2.0         2.0           NaN           NaN        NaN          NaN
75    78           2       NaN            NaN          NaN         NaN           NaN         2.0           NaN           1.0        NaN          NaN

[623 rows x 11 columns]

Let’s plot the graph g grouped by V_N.

# create graph_tool graph object
g.v['V_N'] = g.v['V_N'].astype(np.int32)  # sfpd only takes int32
g_tmp = dg.DeepGraph(v)
gtg = g_tmp.return_gt_graph(features='V_N')
gtg.set_directed(False)

# get sfdp layout postitions
pos = gtd.sfdp_layout(gtg, groups=gtg.vp['V_N'], mu=.3, gamma=.01)
pos = pos.get_2d_array([0, 1])
g.v['x'] = pos[0]
g.v['y'] = pos[1]

# configure nodes
kwds_scatter = {'c': 'k'}

# configure edges
kwds_quiver = {'headwidth': 1,
               'alpha': .2,
               'cmap': 'viridis_r'}
# color by type
C = g.e.groupby('type').grouper.group_info[0]

# plot
fig, ax = plt.subplots(figsize=(15,15))
g.plot_2d('x', 'y', edges=True,
          kwds_scatter=kwds_scatter, C=C,
          kwds_quiver=kwds_quiver, ax=ax)

# turn axis off, set x/y-lim and name nodes
name_dic = {i: name for i, name in enumerate(vinfo.iloc[:79].Name)}
ax.axis('off')
ax.set_xlim((g.v.x.min() - 1, g.v.x.max() + 1))
ax.set_ylim((g.v.y.min() - 1, g.v.y.max() + 1))
for node in g.v['V_N'].unique():
    plt.text(g.v[g.v['V_N'] == node].x.mean() - 1,
             g.v[g.v['V_N'] == node].y.max() + 1,
             name_dic[node], fontsize=12)
_images/terrorists_52_0.png

Let’s also plot the supergraph \(G^N = (V^N, E^N)\), where the color of the superedges corresponds to the number of edges within the respective superedge.

# get rid of isolated node for nicer layout
ng.v.drop(57, inplace=True, errors='ignore')

# create graph_tool graph object
gtg = ng.return_gt_graph(features=True, relations='n_edges')
gtg.set_directed(False)

# get sfdp layout postitions
pos = gtd.sfdp_layout(gtg)
pos = pos.get_2d_array([0, 1])
ng.v['x'] = pos[0]
ng.v['y'] = pos[1]

# configure nodes
kwds_scatter = {'s': 100,
                'c': 'k'}

# configure edges
# split edges with only one type of connection
C_split_0 = ng.e['n_edges'].values.copy()
C_split_0[C_split_0 == 1] = 0

# edges with one type of connection
kwds_quiver_0 = {'alpha': .3,
                 'width': .001}

# edges with more than one type
kwds_quiver = {'headwidth': 1,
               'width': .003,
               'alpha': .7,
               'cmap': 'Blues',
               'clim': (1, ng.e.n_edges.max())}

# create plot
fig, ax = plt.subplots(figsize=(15,15))
ng.plot_2d('x', 'y', edges=True, C_split_0=C_split_0,
           kwds_scatter=kwds_scatter, kwds_quiver_0=kwds_quiver_0,
           kwds_quiver=kwds_quiver, ax=ax)

# turn axis off, set x/y-lim and name nodes
ax.axis('off')
ax.set_xlim(ng.v.x.min() - 1, ng.v.x.max() + 1)
ax.set_ylim(ng.v.y.min() - 1, ng.v.y.max() + 1)
for i in ng.v.index:
    plt.text(ng.v.at[i, 'x'], ng.v.at[i, 'y'] + .3, i, fontsize=12)
_images/terrorists_54_0.png
The Tensor-Like Representation \(G^{NL} = (V^{NL}, E^{NL})\)

Considering only the information attributed to the layers of the MLN, and the fact that this MLN has just one aspect, there is only one more supergraph we can create of g. It is given by creating the intersection partition (see section III E of the Deep Graphs paper) of the types of features V_N and layer. The resulting supergraph \(G^{N \cdot L} = (V^{N \cdot L},E^{N \cdot L})\) corresponds one to one to the graph \(G = (V, E)\), and therefore to the supra-graph representation of the MLN. The only difference is the indexing, which is tensor-like for the supergraph \(G^{N \cdot L}\).

# partition the graph
relation_funcs = {'type': 'sum', 'weight': 'sum'}  # just to transfer relations
nlv, nle = g.partition_graph(['V_N', 'layer'], relation_funcs=relation_funcs)
nlg = dg.DeepGraph(nlv, nle)
nlg
<DeepGraph object, with n=790 node(s) and m=1014 edge(s) at 0x7fb8d5325550>
print(nlg.v)
                   n_nodes
V_N layer
0   Business             1
    Communication        1
    O Logistics          1
    O Meetings           1
    O Operations         1
...                    ...
78  O Training           1
    T Classmates         1
    T Friendship         1
    T Kinship            1
    T Soulmates          1

[790 rows x 1 columns]
print(nlg.e)
                                         n_edges  weight           type
V_N_s layer_s       V_N_t layer_t
0     Communication 15    Communication        1     1.0  Communication
      O Logistics   15    O Logistics          1     2.0    O Logistics
      T Kinship     15    T Kinship            1     1.0      T Kinship
1     O Operations  16    O Operations         1     2.0   O Operations
                    22    O Operations         1     2.0   O Operations
...                                          ...     ...            ...
72    T Soulmates   77    T Soulmates          1     1.0    T Soulmates
73    O Operations  76    O Operations         1     2.0   O Operations
      O Training    76    O Training           1     2.0     O Training
75    O Training    78    O Training           1     2.0     O Training
      T Friendship  78    T Friendship         1     1.0   T Friendship

[1014 rows x 3 columns]

This tensor-like index allows you to use the advanced indexing features of pandas.

print(nlg.e.loc[2, 'Communication', :, 'Communication'])
                                         n_edges  weight           type
V_N_s layer_s       V_N_t layer_t
2     Communication 5     Communication        1     1.0  Communication
                    12    Communication        1     1.0  Communication
                    30    Communication        1     1.0  Communication
                    58    Communication        1     1.0  Communication

In the future, we might implement a method to convert this tensor-representation of a MLN to some sparse-tensor data structure (e.g., https://github.com/mnick/scikit-tensor). Another idea is to create an interface to a suitable multilayer network package that implements the measures and models developed particularly for MLNs.

The “Hidden Layers” of a MLN

Partitioning a multilayer network solely based on the information attributed to its layers only gets us this far. If there is more information available, or computed during the analysis [e.g., by statistical measures, network measures or similarity/distance measures (see g.create_edges)], it can be used to induce further supergraphs and reach other elements of the partition lattice of g.

This is what we’ll demonstrate here, based on the additional information available about the individual’s attributes:

print(vinfo)
     Education Level  Contact with People   Military Training  Nationality  Current Status (ICG Article)  Role  Primary Group Affiliation  Noordin's Network              Name
0                  0                     5                  0            3                             1     7                          1                  0       Abdul Malik
1                  2                     3                  0            3                             2    10                          1                  0        Abdul Rauf
2                  0                    10                  0            3                             1     9                          0                  0       Abdul Rohim
3                  3                     5                  3            3                             2     1                          2                  0   Abdullah Sunata
4                  2                     3                  0            3                             0     1                          3                  0  Abdullah Sungkar
..               ...                   ...                ...          ...                           ...   ...                        ...                ...               ...
785                2                    12                  5            3                             1     3                          3                  1        Umar Patek
786                2                     1                  7            3                             2     4                          3                  0       Umar Wayan
787                2                     3                  3            3                             2     7                          3                  1             Urwah
788                2                    11                  3            3                             2    10                          3                  1     Usman bin Sef
789                2                     1                  7            4                             1     1                          3                  0        Zulkarnaen

[790 rows x 9 columns]

As you can see, there are 9 different attributes associated with each individual, such as their military training, nationality, education level, etc. Let’s append this information to the node table, and plot the nodes grouped by their education level.

# append node information to g
v = pd.concat((v, vinfo), axis=1)
g = dg.DeepGraph(v, e)
# create graph_tool graph object
g.v['Education Level'] = g.v['Education Level'].astype(np.int32)
g_tmp = dg.DeepGraph(g.v)
gtg = g_tmp.return_gt_graph(features=['Education Level'])
gtg.set_directed(False)

# get sfdp layout postitions
pos = gtd.sfdp_layout(gtg, groups=gtg.vp['Education Level'], mu=.3, gamma=.1)
pos = pos.get_2d_array([0, 1])
g.v['x'] = pos[0]
g.v['y'] = pos[1]

# configure nodes
kwds_scatter = {'s': 10,
                'c': 'k'}

# configure edges
kwds_quiver = {'width': 0.002,
               'headwidth': 1,
               'alpha': .2,
               'cmap': 'prism'}
# color by type
C = g.e.groupby('type').grouper.group_info[0]

# plot
fig, ax = plt.subplots(figsize=(13,12))
obj = g.plot_2d('x', 'y', edges=True,
          kwds_scatter=kwds_scatter, C=C,
          kwds_quiver=kwds_quiver, ax=ax)

# turn axis off, set x/y-lim and name layers
ax.axis('off')
ax.set_xlim((g.v.x.min() - 1, g.v.x.max() + 1))
ax.set_ylim((g.v.y.min() - 1, g.v.y.max() + 1))
for el in g.v['Education Level'].unique():
    plt.text(g.v[g.v['Education Level'] == el].x.mean() - 1,
             g.v[g.v['Education Level'] == el].y.max() + 1,
             'EL {}'.format(el), fontsize=20)
_images/terrorists_68_0.png

Let’s also append the information to the supergraph \(G^N\), and plot this supergraph grouped by education level.

# append info to ng.v
ng.v = pd.concat((ng.v, vinfo[:79]), axis=1)
# create graph_tool graph object
ng.v['Education Level'] = ng.v['Education Level'].astype(np.int32)
g_tmp = dg.DeepGraph(ng.v)
gtg = g_tmp.return_gt_graph(features=['Education Level'])
gtg.set_directed(False)

# get sfdp layout postitions
pos = gtd.sfdp_layout(gtg, groups=gtg.vp['Education Level'], mu=.3, gamma=.01)
pos = pos.get_2d_array([0, 1])
ng.v['x'] = pos[0]
ng.v['y'] = pos[1]

# configure nodes
kwds_scatter = {'s': 50,
                'c': 'k'}

# configure edges
# split edges with only one type of connection
C_split_0 = ng.e['n_edges'].values.copy()
C_split_0[C_split_0 == 1] = 0

# edges with one type of connection
kwds_quiver_0 = {'alpha': .3,
                 'width': .001}

# edges with more than one type
kwds_quiver = {'headwidth': 1,
               'width': .002,
               'alpha': .7,
               'cmap': 'Blues',
               'clim': (1, ng.e.n_edges.max())}

# create plot
fig, ax = plt.subplots(figsize=(15,15))
obj = ng.plot_2d('x', 'y', edges=True, C_split_0=C_split_0,
                 kwds_scatter=kwds_scatter, kwds_quiver_0=kwds_quiver_0,
                 kwds_quiver=kwds_quiver, ax=ax)

# turn axis off, set x/y-lim and name nodes
ax.axis('off')
ax.set_xlim(ng.v.x.min() - 1, ng.v.x.max() + 1)
ax.set_ylim(ng.v.y.min() - 1, ng.v.y.max() + 1)
for i in ng.v.index:
    plt.text(ng.v.at[i, 'x'],
             ng.v.at[i, 'y'] + .2,
             i, fontsize=8)

for el in ng.v['Education Level'].unique():
    plt.text(ng.v[ng.v['Education Level'] == el].x.mean() - .5,
             ng.v[ng.v['Education Level'] == el].y.max() + 1,
             'EL {}'.format(el), fontsize=20)
_images/terrorists_71_0.png

We can now further partition the supergraph \(G^N\) into groups with the same education level.

# partition ng by "Education Level"
relation_funcs = {l: lambda x: x.notnull().sum() for l in layer_names}
relation_funcs['n_edges'] = 'sum'
ELnv, ELne = ng.partition_graph('Education Level',
                                relation_funcs=relation_funcs,
                                n_edges=False)

# compute "undirected" weights
s = ELne.index.get_level_values(0)
t = ELne.index.get_level_values(1)
df1 = ELne[s <= t]
df2 = ELne[s > t].swaplevel(0,1)
df2.index.names = df2.index.names[::-1]
ELne = df1.add(df2, fill_value=0)

# set dtypes
for col in ELne.columns:
    ELne[col] = ELne[col].astype(int)

# find the type of connection most dominant between supernodes
ELne['dominant_type'] = ELne[layer_names].idxmax(axis=1)

# change column order
ELne = ELne[['n_edges'] + ['dominant_type'] + layer_names]

# create graph
ELng = dg.DeepGraph(ELnv, ELne)
ELng
<DeepGraph object, with n=8 node(s) and m=30 edge(s) at 0x7fb8d1d245c0>
print(ELng.v)
                 n_nodes
Education Level
0                     25
1                      1
2                     39
3                      5
4                      5
5                      1
6                      2
8                      1
print(ELng.e)
                                     n_edges  dominant_type  Business  Communication  O Logistics  O Meetings  O Operations  O Training  T Classmates  T Friendship  T Kinship  T Soulmates
Education Level_s Education Level_t
0                 0                       45   O Operations         0              7            2           1            16          15             1             1          2            0
                  1                        3   O Operations         0              0            0           0             2           1             0             0          0            0
                  2                      146   O Operations         1             31            3           7            43          32             9            16          4            0
                  3                       60     O Training         0             11            2           2            14          19             2             9          1            0
                  4                       16     O Training         0              0            0           0             6           9             1             0          0            0
...                                      ...            ...       ...            ...          ...         ...           ...         ...           ...           ...        ...          ...
4                 8                        1   O Operations         0              0            0           0             1           0             0             0          0            0
5                 6                        3   O Operations         0              1            0           0             2           0             0             0          0            0
                  8                        2   O Operations         0              0            0           0             1           1             0             0          0            0
6                 6                        3  Communication         0              1            0           1             1           0             0             0          0            0
                  8                        8   O Operations         1              1            0           1             2           0             1             1          0            1

[30 rows x 12 columns]

Let’s plot the supergraph of education levels, where the node size relates to the number of individuals, edge colors correspond to the number of edges, and edge labels correspond to the most dominant type of connection between nodes.

# create graph_tool graph object
gtg = ELng.return_gt_graph(features=True, relations=True, node_indices=True)
gtg.set_directed(False)

# get sfdp layout postitions
pos = gtd.sfdp_layout(gtg, vweight=gtg.vp['n_nodes'], eweight=gtg.ep['n_edges'])
pos = pos.get_2d_array([0, 1])

# create plot
gtg.vp['n_nodes'].a *= 3
gtd.graph_draw(gtg,
               vertex_text=gtg.vp['i'],
               vertex_text_color='k', vertex_size=gtg.vp['n_nodes'],
               edge_text=gtg.ep['dominant_type'],
               edge_color=gtg.ep['n_edges'],
               inline=True, output_size=(900,900), fit_view=True)
_images/terrorists_72_0.png
Partitioning Edges Based on Node Properties

Here, we demonstrate very briefly how to use the additional information of the nodes to perform queries on the edges.

# create "undirected" edge table (swap-copy all edges)
g.e = pd.concat((e, e.swaplevel(0,1)))
g.e.sort_index(inplace=True)
print(g.partition_edges(source_features=['Nationality']))
               n_edges
Nationality_s
3                 1655
4                  351
5                   22
print(g.partition_edges(source_features=['Nationality'], target_features=['Military Training']))
                                   n_edges
Nationality_s Military Training_t
3             0                        185
              1                         51
              3                        847
              4                         60
              5                        115
...                                    ...
5             4                          3
              5                          1
              7                          1
              9                          1
              10                         1

[26 rows x 1 columns]
print(g.partition_edges(source_features=['Nationality'],
                        target_features=['Military Training'],
                        relations='type'))
                                               n_edges
type        Nationality_s Military Training_t
Business    3             0                          3
                          3                         16
                          4                          1
                          9                          2
                          10                         2
...                                                ...
T Soulmates 3             9                          1
                          10                         2
            4             3                          3
                          9                          1
                          10                         3

[138 rows x 1 columns]
Alternative Representation of the MLN Edges

The edges of the supra-graph representation as presented in the paper look like this

print(e_paper)
         Business  Communication  O Logistics  O Meetings  O Operations  O Training  T Classmates  T Friendship  T Kinship  T Soulmates
s   t
9   67        2.0            NaN          NaN         NaN           NaN         NaN           NaN           NaN        NaN          NaN
    69        1.0            NaN          NaN         NaN           NaN         NaN           NaN           NaN        NaN          NaN
    77        1.0            NaN          NaN         NaN           NaN         NaN           NaN           NaN        NaN          NaN
11  61        1.0            NaN          NaN         NaN           NaN         NaN           NaN           NaN        NaN          NaN
20  59        1.0            NaN          NaN         NaN           NaN         NaN           NaN           NaN        NaN          NaN
...           ...            ...          ...         ...           ...         ...           ...           ...        ...          ...
733 769       NaN            NaN          NaN         NaN           NaN         NaN           NaN           NaN        NaN          1.0
755 769       NaN            NaN          NaN         NaN           NaN         NaN           NaN           NaN        NaN          1.0
    787       NaN            NaN          NaN         NaN           NaN         NaN           NaN           NaN        NaN          1.0
771 788       NaN            NaN          NaN         NaN           NaN         NaN           NaN           NaN        NaN          1.0
783 788       NaN            NaN          NaN         NaN           NaN         NaN           NaN           NaN        NaN          1.0

[1014 rows x 10 columns]

As you can see, the edge table is also comprised of 1014 edges between the nodes in v. However, every type of connection get’s its own column, where a “nan” value means that an edge does not have a relation of the corresponding type.

What Next

Now that you have an idea of what the DeepGraph package provides, you should investigate the parts of the package most useful for you. See API Reference for details.

API Reference

The API reference summarizes DeepGraph’s core class, its methods and the functions subpackage.

The DeepGraph class

DeepGraph([v, e, supernode_labels_by, …]) The core class of DeepGraph (dg).
Creating Edges
DeepGraph.create_edges([connectors, …]) Create an edge table e linking the nodes in v.
DeepGraph.create_edges_ft(ft_feature[, …]) Create (ft) an edge table e linking the nodes in v.
Graph Partitioning
DeepGraph.partition_nodes(features[, …]) Return a supernode DataFrame sv.
DeepGraph.partition_edges([relations, …]) Return a superedge DataFrame se.
DeepGraph.partition_graph(features[, …]) Return supergraph DataFrames sv and se.
Graph Interfaces
DeepGraph.return_cs_graph([relations, dropna]) Return scipy.sparse.coo_matrix representation(s).
DeepGraph.return_nx_graph([features, …]) Return a networkx.DiGraph representation.
DeepGraph.return_nx_multigraph([features, …]) Return a networkx.MultiDiGraph representation.
DeepGraph.return_gt_graph([features, …]) Return a graph_tool.Graph representation.
Plotting Methods
DeepGraph.plot_2d(x, y[, edges, C, …]) Plot nodes and corresponding edges in 2 dimensions.
DeepGraph.plot_2d_generator(x, y, by[, …]) Plot nodes and corresponding edges by groups.
DeepGraph.plot_map(lon, lat[, edges, C, …]) Plot nodes and corresponding edges on a basemap.
DeepGraph.plot_map_generator(lon, lat, by[, …]) Plot nodes and corresponding edges by groups, on basemaps.
DeepGraph.plot_hist(x[, bins, log_bins, …]) Plot a histogram (or pdf) of x.
DeepGraph.plot_logfile(logfile) Plot a logfile.
Other Methods
DeepGraph.append_binning_labels_v(col, col_name) Append a column with binning labels of the values in v[col].
DeepGraph.append_cp([directed, connection, …]) Append a component membership column to v.
DeepGraph.filter_by_values_v(col, values) Keep only nodes in v with features of type col in values.
DeepGraph.filter_by_values_e(col, values) Keep only edges in e with relations of type col in values.
DeepGraph.filter_by_interval_v(col, interval) Keep only nodes in v with features of type col in interval.
DeepGraph.filter_by_interval_e(col, interval) Keep only edges in e with relations of type col in interval.
DeepGraph.update_edges() After removing nodes in v, update e.

The Functions Module

deepgraph.functions Auxiliary connector and selector functions to create edges.
Connector Functions
great_circle_dist(lat_s, lat_t, lon_s, lon_t) Return the great circle distance between nodes.
cp_node_intersection(supernode_ids, sources, …) Work in progress!
cp_intersection_strength(n_unique_nodes, …) Work in progress!
hypergeometric_p_value(n_unique_nodes, …) Work in progress!
Selector Functions

Contact

Email

Please feel free to contact me if you have questions or suggestions regarding DeepGraph:

Dominik Traxl <dominik.traxl@posteo.org>

Authors

Deepgraph was written as part of a PhD thesis in physics by Dominik Traxl at Humboldt University Berlin, the Berstein Center for Computational Neuroscience and the Potsdam Institute for Climate Impact Research.

Indices and tables