deepgraph.deepgraph.DeepGraph.partition_nodes¶

DeepGraph.partition_nodes(features, feature_funcs=None, n_nodes=True, return_gv=False)[source]¶

Return a supernode DataFrame sv.

This is essentially a wrapper around the pandas groupby method: sv = v.groupby(features).agg(feature_funcs). It creates a (intersection) partition of the nodes in v by the type(s) of feature(s) features, resulting in a supernode DataFrame sv. By passing a dictionary of functions on the features of v, feature_funcs, one may aggregate user-defined values of the partition’s elements, the supernodes’ features. If n_nodes is True, create a column with the number of each supernode’s constituent nodes. If return_gv is True, return the created groupby object to facilitate additional operations, such as gv.apply(func, *args, **kwargs).

For details, type help(v.groupby), and/or inspect the available methods of gv.

For examples, see below. For an in-depth description and mathematical details of graph partitioning, see https://arxiv.org/pdf/1604.00971v1.pdf, in particular Sec. III A, E and F.

Parameters:

features (str, int or array_like) – Column name(s) of v, indicating the type(s) of feature(s) used to induce a (intersection) partition. Creates a pandas groupby object, gv = v.groupby(features).
feature_funcs (dict, optional (default=None)) – Each key must be a column name of v, each value either a function, or a list of functions, working when passed a pandas.DataFrame or when passed to pandas.DataFrame.apply. See the docstring of gv.agg for details: help(gv.agg).
n_nodes (bool, optional (default=True)) – Whether to create a n_nodes column in sv, indicating the number of nodes in each supernode.
return_gv (bool, optional (default=False)) – If True, also return the v.groupby(features) object, gv.

Returns:

sv (pd.DataFrame) – The aggreated DataFrame of supernodes, sv.
gv (pandas.core.groupby.DataFrameGroupBy) – The pandas groupby object, v.groupby(features).

Notes

Currently, NA groups in GroupBy are automatically excluded (silently). One workaround is to use a placeholder (e.g., -1, ‘none’) for NA values before doing the groupby (calling this method). See http://stackoverflow.com/questions/18429491/groupby-columns-with-nan-missing-values and https://github.com/pydata/pandas/issues/3729.

Examples

First, we need a node table, in order to demonstrate its partitioning:

>>> import pandas as pd
>>> import deepgraph as dg
>>> v = pd.DataFrame({'x': [-3.4,2.1,-1.1,0.9,2.3],
...                   'time': [0,0,2,2,9],
...                   'color': ['g','g','b','g','r'],
...                   'size': [1,3,2,3,1]})
>>> g = dg.DeepGraph(v)
>>> g.v
  color  size  time    x
0     g     1     0 -3.4
1     g     3     0  2.1
2     b     2     2 -1.1
3     g     3     2  0.9
4     r     1     9  2.3

Create a partition by the type of feature ‘color’:

>>> g.partition_nodes('color')
       n_nodes
color
b            1
g            3
r            1

Create an intersection partition by the types of features ‘color’ and ‘size’ (which is a further refinement of the last partition):

>>> g.partition_nodes(['color', 'size'])
            n_nodes
color size
b     2           1
g     1           1
      3           2
r     1           1

Partition by ‘color’ and collect x values:

>>> g.partition_nodes('color', {'time': lambda x: list(x)})
       n_nodes       time
color
b            1        [2]
g            3  [0, 0, 2]
r            1        [9]

Partition by ‘color’ and aggregate with different functions:

>>> g.partition_nodes('color', {'time': [lambda x: list(x), np.max],
...                             'x': [np.mean, np.sum, np.std]})
       n_nodes    x_mean  x_sum     x_std time_<lambda>  time_amax
color
b            1 -1.100000   -1.1       NaN           [2]          2
g            3 -0.133333   -0.4  2.891943     [0, 0, 2]          2
r            1  2.300000    2.3       NaN           [9]          9