deepgraph.deepgraph.DeepGraph.partition_nodes¶
-
DeepGraph.
partition_nodes
(features, feature_funcs=None, n_nodes=True, return_gv=False)[source]¶ Return a supernode DataFrame
sv
.This is essentially a wrapper around the pandas groupby method:
sv
=v
.groupby(features
).agg(feature_funcs
). It creates a (intersection) partition of the nodes inv
by the type(s) of feature(s)features
, resulting in a supernode DataFramesv
. By passing a dictionary of functions on the features ofv
,feature_funcs
, one may aggregate user-defined values of the partition’s elements, the supernodes’ features. Ifn_nodes
is True, create a column with the number of each supernode’s constituent nodes. Ifreturn_gv
is True, return the created groupby object to facilitate additional operations, such asgv
.apply(func, *args, **kwargs).For details, type help(
v
.groupby), and/or inspect the available methods ofgv
.For examples, see below. For an in-depth description and mathematical details of graph partitioning, see https://arxiv.org/pdf/1604.00971v1.pdf, in particular Sec. III A, E and F.
Parameters: - features (str, int or array_like) – Column name(s) of
v
, indicating the type(s) of feature(s) used to induce a (intersection) partition. Creates a pandas groupby object,gv
=v
.groupby(features
). - feature_funcs (dict, optional (default=None)) – Each key must be a column name of
v
, each value either a function, or a list of functions, working when passed apandas.DataFrame
or when passed topandas.DataFrame.apply
. See the docstring ofgv
.agg for details: help(gv
.agg). - n_nodes (bool, optional (default=True)) – Whether to create a
n_nodes
column insv
, indicating the number of nodes in each supernode. - return_gv (bool, optional (default=False)) – If True, also return the
v
.groupby(features
) object,gv
.
Returns: - sv (pd.DataFrame) – The aggreated DataFrame of supernodes,
sv
. - gv (pandas.core.groupby.DataFrameGroupBy) – The pandas groupby object,
v
.groupby(features
).
See also
Notes
Currently, NA groups in GroupBy are automatically excluded (silently). One workaround is to use a placeholder (e.g., -1, ‘none’) for NA values before doing the groupby (calling this method). See http://stackoverflow.com/questions/18429491/groupby-columns-with-nan-missing-values and https://github.com/pydata/pandas/issues/3729.
Examples
First, we need a node table, in order to demonstrate its partitioning:
>>> import pandas as pd >>> import deepgraph as dg >>> v = pd.DataFrame({'x': [-3.4,2.1,-1.1,0.9,2.3], ... 'time': [0,0,2,2,9], ... 'color': ['g','g','b','g','r'], ... 'size': [1,3,2,3,1]}) >>> g = dg.DeepGraph(v) >>> g.v color size time x 0 g 1 0 -3.4 1 g 3 0 2.1 2 b 2 2 -1.1 3 g 3 2 0.9 4 r 1 9 2.3
Create a partition by the type of feature ‘color’:
>>> g.partition_nodes('color') n_nodes color b 1 g 3 r 1
Create an intersection partition by the types of features ‘color’ and ‘size’ (which is a further refinement of the last partition):
>>> g.partition_nodes(['color', 'size']) n_nodes color size b 2 1 g 1 1 3 2 r 1 1
Partition by ‘color’ and collect x values:
>>> g.partition_nodes('color', {'time': lambda x: list(x)}) n_nodes time color b 1 [2] g 3 [0, 0, 2] r 1 [9]
Partition by ‘color’ and aggregate with different functions:
>>> g.partition_nodes('color', {'time': [lambda x: list(x), np.max], ... 'x': [np.mean, np.sum, np.std]}) n_nodes x_mean x_sum x_std time_<lambda> time_amax color b 1 -1.100000 -1.1 NaN [2] 2 g 3 -0.133333 -0.4 2.891943 [0, 0, 2] 2 r 1 2.300000 2.3 NaN [9] 9
- features (str, int or array_like) – Column name(s) of