deepgraph.deepgraph.DeepGraph.create_edges

DeepGraph.create_edges(connectors=None, selectors=None, transfer_features=None, r_dtype_dic=None, no_transfer_rs=None, step_size=10000000, from_pos=0, to_pos=None, hdf_key=None, verbose=False, logfile=None)

Create an edge table e linking the nodes in v.

This method enables an iterative computation of pairwise relations (edges) between the nodes represented by v. It does so in a flexible, efficient and vectorized fashion, easily parallelizable and with full control over RAM usage.

  1. Connectors

The simplest use-case is to define a single connector function acting on a single column of the node table v. For instance, given a node table v

>>> import pandas as pd
>>> import deepgraph as dg
>>> v = pd.DataFrame({'time': [0.,2.,9.], 'x': [3.,1.,12.]})
>>> g = dg.DeepGraph(v)
>>> g.v
   time   x
0     0   3
1     2   1
2     9  12

one may define a function

>>> def time_difference(time_s, time_t):
...     dt = time_t - time_s
...     return dt

and pass it to create_edges, in order to compute the time difference of each pair of nodes

>>> g.create_edges(connectors=time_difference)
>>> g.e
     dt
s t
0 1   2
  2   9
1 2   7

As one can see, the connector function takes column names of v with additional ‘_s’ and ‘_t’ endings (indicating source node values and target node values, respectively) as input, and returns a variable with the computed values. The resulting edge table g.e is indexed by the node indices (‘s’ and ‘t’, representing source and target node indices, respectively), and has one column (‘dt’, the name of the returned variable) with the computed values of the given connector. Note that only the upper triangle adjacency matrix is computed, which is always the case. See Notes for further information.

One may also pass a list of functions to connectors, which are then computed in the list’s order. Generally, a connector function can take multiple column names of v (with ‘_s’ and/or ‘_t’ appended) as input, as well as already computed relations of former connectors. Also, any connector function may have multiple output variables. Every output variable has to be a 1-dimensional np.ndarray (with arbitrary dtype, including object). The return statement may not contain any operators, only references to each computed relation.

For instance, considering the above example, one may define an additional connector

>>> def velocity(dt, x_s, x_t):
...     dx = x_t - x_s
...     v = dx / dt
...     return v, dx

and then apply both connectors on v, resulting in

>>> g.create_edges(connectors=[time_difference, velocity])
>>> g.e
     dt  dx         v
s t
0 1   2  -2 -1.000000
  2   9   9  1.000000
1 2   7  11  1.571429
  1. Selectors

However, one is often only interested in a subset of all possible edges. In order to select edges during the iteration process - based on some conditions on the node’s features and their computed relations - one may pass a (list of) selector function(s) to create_edges. For instance, given the above example, one may define a selector

>>> def dt_thresh(dt, sources, targets):
...     sources = sources[dt > 5]
...     targets = targets[dt > 5]
...     return sources, targets

and apply it in conjunction with the time_difference connector

>>> g.create_edges(connectors=time_difference, selectors=dt_thresh)
>>> g.e
     dt
s t
0 2   9
1 2   7

leaving only edges with a time difference larger than 5.

Every selector function must have sources and targets as input arguments as well as in the return statement. Most generally, they may depend on column names of v (with ‘_s’ and/or ‘_t’ appended) and/or computed relations of connector functions, and/or computed relations of former selector functions. Apart from sources and targets, they may additionally return computed relations. Given this input/output flexibility of selectors, one could in fact compute all required relations, and select any desired subset of edges, with a single selector function. The purpose of splitting connectors and/or selectors, however, is to control the iteration’s performance by consecutively computing relations and selecting edges: hierarchical selection.

  1. Hierarchical Selection

As the algorithm iterates through the chunks of all possible source and target node indices ([0, g.n*(g.n-1)/2]), it goes through the list of selectors at each step. If a selector has a relation as input, it must have either been computed by a former selector, or the selector requests its computation by the corresponding connector function in connectors (this connector may not depend on any other not yet computed relations). Once the input relations are computed (if requested), the selector is applied and returns updated indices, which are then passed to the next selector. Hence, with each selector, the indices are reduced and consecutive computation of relations only consider the remaining indices. After all selectors have been applied, the connector functions that have not been requested by any selector are computed (on the final, reduced chunk of node and target indices).

  1. Transferring Features

The argument transfer_features, which takes a (list of) column name(s) of v, makes it possible to transfer features of v to the created edge table e

>>> g.create_edges(connectors=time_difference,
...                transfer_features=['x', 'time'])
>>> g.e
     dt  time_s  time_t  x_s  x_t
s t
0 1   2       0       2    3    1
  2   9       0       9    3   12
1 2   7       2       9    1   12

If computation time and memory consumption are of no concern, one might skip the remaing paragraphs.

  1. Logging

Clearly, the order of the hierarchical selection as described in 3. influences the computation’s efficiency. The complexity of a relation’s computation and the (expected average) number of deleted edges of a selector should be considered primarily. In order to track and benchmark the iteration process, the progress and time measurements are printed for each iteration step, if verbose is set to True. Furthermore, one may create a logfile (which can also be plot by dg.DeepGraph.plot_logfile) by setting the argument logfile to a string, indicating the file name of the created logfile.

  1. Parallelization and Memory Control

The arguments from_pos, to_pos and step_size control the range of processed pairs of nodes and the number of pairs of nodes to process at each iteration step. They may be used for parallel computation and to control RAM usage. See Parameters for details.

It is also possible to initiate dg.DeepGraph with a pandas.HDFStore containing the DataFrame representing the node table. Only the data requested by transfer_features and the user- defined connectors and selectors at each iteration step is then pulled from the store, which is particularly useful for large node tables and parallel computation. The only requirement is that the node table contained in the store is in table(t) format, not fixed(f) format. For instance, considering the above created node table, one may store it in a hdf file

>>> vstore = pd.HDFStore('vstore.h5')
>>> vstore.put('node_table', v, format='t', index=False)

initiate a DeepGraph instance with the store

>>> g = dg.DeepGraph(vstore)
>>> g.v
<class 'pandas.io.pytables.HDFStore'>
File path: vstore.h5
/node_table            frame_table  (typ->appendable,nrows->3,ncols->2,
indexers->[index])

and then create edges the same way as if g.v were a DataFrame

>>> g.create_edges(connectors=time_difference)
>>> g.e
     dt
s t
0 1   2
  2   9
1 2   7

In case the store has multiple nodes, hdf_key has to be set to the node corresponding to the node table of the graph.

Also, one may pass a (list of) name(s) of computed relations, no_transfer_rs, which should not be transferred to the created edge table e. This can be advantageous, for instance, if a selector depends on computed relations that are of no further interest.

Furthermore, it is possible to force the dtype of computed relations with the argument r_dtype_dic. The dtype of a relation is then set at each iteration step, but after all selectors and connectors were processed.

  1. Creating Edges on a Fast Track

If the selection of edges includes a simple distance threshold, i.e. a selector function defined as follows:

>>> def ft_selector(x_s, x_t, threshold, sources, targets):
...     dx = x_t - x_s
...     sources = sources[dx <= threshold]
...     targets = targets[dx <= threshold]
...     return sources, targets, dx

the method create_edges_ft should be considered, since it provides a much faster iteration algorithm.

Parameters:
  • connectors (function or array_like, optional (default=None)) – User defined connector function(s) that compute pairwise relations between the nodes in v. A connector accepts multiple column names of v (with ‘_s’ and/or ‘_t’ appended, indicating source node values and target node values, respectively) as input, as well as already computed relations of former connectors. A connector function may have multiple output variables. Every output variable has to be a 1-dimensional np.ndarray (with arbitrary dtype, including object). See above and dg.functions for examplary connector functions.
  • selectors (function or array_like, optional (default=None)) – User defined selector function(s) that select edges during the iteration process, based on some conditions on the node’s features and their computed relations. Every selector function must have sources and targets as input arguments as well as in the return statement. A selector may depend on column names of v (with ‘_s’ and/or ‘_t’ appended) and/or computed relations of connector functions, and/or computed relations of former selector functions. Apart from sources and targets, they may also return computed relations (see connectors). See above, and dg.functions for exemplary selector functions.
  • transfer_features (str, int or array_like, optional (default=None)) – A (list of) column name(s) of v, indicating which features of v to transfer to e (appending ‘_s’ and ‘_t’ to the column names of e, indicating source and target node features, respectively).
  • r_dtype_dic (dict, optional (default=None)) – A dictionary with names of computed relations of connectors and/or selectors as keys and dtypes as values. Forces the data types of the computed relations in e during the iteration (but after all selectors and connectors were processed), otherwise infers them.
  • no_transfer_rs (str or array_like, optional (default=None)) – Name(s) of computed relations that are not to be transferred to the created edge table e. Can be used to save memory, e.g., if a selector depends on computed relations that are of no interest otherwise.
  • step_size (int, optional (default=1e6)) – The number of pairs of nodes to process at each iteration step. Must be in [ 1, g.n*(g.n-1)/2 ]. Its value determines computation speed and memory consumption.
  • from_pos (int, optional (default=0)) – Determines from which pair of nodes to start the iteration process. Must be in [ 0, g.n*(g.n-1)/2 [. May be used in conjuction with to_pos for parallel computation.
  • to_pos (positive integer, optional (default=None)) – Determines at which pair of nodes to stop the iteration process (the endpoint is excluded). Must be in [ 1, g.n*(g.n-1)/2 ] and larger than from_pos. Defaults to None, which translates to the last pair of nodes, g.n*(g.n-1)/2. May be used in conjunction with from_pos for parallel computation.
  • hdf_key (str, optional (default=None)) – If you initialized dg.DeepGraph with a pandas.HDFStore and the store has multiple nodes, you must pass the key to the node in the store that corresponds to the node table.
  • verbose (bool, optional (default=False)) – Whether to print information at each step of the iteration process.
  • logfile (str, optional (default=None)) – Create a log-file named by logfile. Contains the time and date of the method’s call, the input arguments and time mesaurements for each iteration step. A plot of logfile can be created by dg.DeepGraph.plot_logfile.
Returns:

e – Set the created edge table e as attribute of dg.DeepGraph.

Return type:

pd.DataFrame

Notes

  1. Input and output data types

Since connectors (and selectors) take columns of a pandas DataFrame as input, there are no restrictions on the data types of which pairwise relations are computed. In the most general case, a DataFrame’s column has object as dtype, and its values may then be arbitrary Python objects. The same goes for the output variables of connectors (and selectors). The only requirement is that each ouput variable is 1-dimensional.

However, it is also possible to use the values of a column of v as references to arbitrary objects, which may sometimes be more convenient. In case a connector (or selector) needs the node’s original indices as input, one may simply copy them to a column, e.g.

>>> v['indices'] = v.index

and then define the connector’s (or selector’s) input arguments accordingly.

  1. Connectors and selectors

The only requirement on connectors and selectors is that their input arguments and return statements are consistent with the column names of v and the passing of computed relations (see above, 3. Hierarchical Selection).

Whatever happens inside the functions is entirely up to the user. This means, for instance, that one may wrap arbitrary functions within a connector (selector), such as optimized C functions or existing functions whose input/output is not consistent with the create_edges method (see, e.g., the methods provided in dg.functions, scipy or scikit learn’s sklearn.metrics and sklearn.neighbors.DistanceMetric). One could also store a connector’s (selector’s) computations directly within the function, or let the function print out any desired information during iteration.

  1. Why not compute the full adjacency matrix?

This is due to efficiency. For any asymmetric function (i.e., f(s, t) != f(t, s)), one can always create an additional connector (or output variable) that computes the mirrored values of that function.