deepgraph.deepgraph.DeepGraph.create_edges¶
-
DeepGraph.
create_edges
(connectors=None, selectors=None, transfer_features=None, r_dtype_dic=None, no_transfer_rs=None, step_size=10000000, from_pos=0, to_pos=None, hdf_key=None, verbose=False, logfile=None)[source]¶ Create an edge table
e
linking the nodes inv
.This method enables an iterative computation of pairwise relations (edges) between the nodes represented by
v
. It does so in a flexible, efficient and vectorized fashion, easily parallelizable and with full control over RAM usage.- Connectors
The simplest use-case is to define a single connector function acting on a single column of the node table
v
. For instance, given a node tablev
>>> import pandas as pd >>> import deepgraph as dg >>> v = pd.DataFrame({'time': [0.,2.,9.], 'x': [3.,1.,12.]}) >>> g = dg.DeepGraph(v)
>>> g.v time x 0 0 3 1 2 1 2 9 12
one may define a function
>>> def time_difference(time_s, time_t): ... dt = time_t - time_s ... return dt
and pass it to
create_edges
, in order to compute the time difference of each pair of nodes>>> g.create_edges(connectors=time_difference)
>>> g.e dt s t 0 1 2 2 9 1 2 7
As one can see, the connector function takes column names of
v
with additional ‘_s’ and ‘_t’ endings (indicating source node values and target node values, respectively) as input, and returns a variable with the computed values. The resulting edge tableg.e
is indexed by the node indices (‘s’ and ‘t’, representing source and target node indices, respectively), and has one column (‘dt’, the name of the returned variable) with the computed values of the given connector. Note that only the upper triangle adjacency matrix is computed, which is always the case. See Notes for further information.One may also pass a list of functions to
connectors
, which are then computed in the list’s order. Generally, a connector function can take multiple column names ofv
(with ‘_s’ and/or ‘_t’ appended) as input, as well as already computed relations of former connectors. Also, any connector function may have multiple output variables. Every output variable has to be a 1-dimensionalnp.ndarray
(with arbitrary dtype, includingobject
). The return statement may not contain any operators, only references to each computed relation.For instance, considering the above example, one may define an additional connector
>>> def velocity(dt, x_s, x_t): ... dx = x_t - x_s ... v = dx / dt ... return v, dx
and then apply both connectors on
v
, resulting in>>> g.create_edges(connectors=[time_difference, velocity])
>>> g.e dt dx v s t 0 1 2 -2 -1.000000 2 9 9 1.000000 1 2 7 11 1.571429
- Selectors
However, one is often only interested in a subset of all possible edges. In order to select edges during the iteration process - based on some conditions on the node’s features and their computed relations - one may pass a (list of) selector function(s) to
create_edges
. For instance, given the above example, one may define a selector>>> def dt_thresh(dt, sources, targets): ... sources = sources[dt > 5] ... targets = targets[dt > 5] ... return sources, targets
and apply it in conjunction with the
time_difference
connector>>> g.create_edges(connectors=time_difference, selectors=dt_thresh)
>>> g.e dt s t 0 2 9 1 2 7
leaving only edges with a time difference larger than 5.
Every selector function must have
sources
andtargets
as input arguments as well as in the return statement. Most generally, they may depend on column names ofv
(with ‘_s’ and/or ‘_t’ appended) and/or computed relations of connector functions, and/or computed relations of former selector functions. Apart fromsources
andtargets
, they may additionally return computed relations. Given this input/output flexibility of selectors, one could in fact compute all required relations, and select any desired subset of edges, with a single selector function. The purpose of splitting connectors and/or selectors, however, is to control the iteration’s performance by consecutively computing relations and selecting edges: hierarchical selection.- Hierarchical Selection
As the algorithm iterates through the chunks of all possible source and target node indices ([0, g.n*(g.n-1)/2]), it goes through the list of
selectors
at each step. If a selector has a relation as input, it must have either been computed by a former selector, or the selector requests its computation by the corresponding connector function inconnectors
(this connector may not depend on any other not yet computed relations). Once the input relations are computed (if requested), the selector is applied and returns updated indices, which are then passed to the next selector. Hence, with each selector, the indices are reduced and consecutive computation of relations only consider the remaining indices. After all selectors have been applied, the connector functions that have not been requested by any selector are computed (on the final, reduced chunk of node and target indices).- Transferring Features
The argument
transfer_features
, which takes a (list of) column name(s) ofv
, makes it possible to transfer features ofv
to the created edge tablee
>>> g.create_edges(connectors=time_difference, ... transfer_features=['x', 'time'])
>>> g.e dt time_s time_t x_s x_t s t 0 1 2 0 2 3 1 2 9 0 9 3 12 1 2 7 2 9 1 12
If computation time and memory consumption are of no concern, one might skip the remaing paragraphs.
- Logging
Clearly, the order of the hierarchical selection as described in 3. influences the computation’s efficiency. The complexity of a relation’s computation and the (expected average) number of deleted edges of a selector should be considered primarily. In order to track and benchmark the iteration process, the progress and time measurements are printed for each iteration step, if
verbose
is set to True. Furthermore, one may create a logfile (which can also be plot bydg.DeepGraph.plot_logfile
) by setting the argumentlogfile
to a string, indicating the file name of the created logfile.- Parallelization and Memory Control
The arguments
from_pos
,to_pos
andstep_size
control the range of processed pairs of nodes and the number of pairs of nodes to process at each iteration step. They may be used for parallel computation and to control RAM usage. See Parameters for details.It is also possible to initiate
dg.DeepGraph
with apandas.HDFStore
containing the DataFrame representing the node table. Only the data requested bytransfer_features
and the user- definedconnectors
andselectors
at each iteration step is then pulled from the store, which is particularly useful for large node tables and parallel computation. The only requirement is that the node table contained in the store is in table(t) format, not fixed(f) format. For instance, considering the above created node table, one may store it in a hdf file>>> vstore = pd.HDFStore('vstore.h5') >>> vstore.put('node_table', v, format='t', index=False)
initiate a DeepGraph instance with the store
>>> g = dg.DeepGraph(vstore)
>>> g.v <class 'pandas.io.pytables.HDFStore'> File path: vstore.h5 /node_table frame_table (typ->appendable,nrows->3,ncols->2, indexers->[index])
and then create edges the same way as if
g.v
were a DataFrame>>> g.create_edges(connectors=time_difference)
>>> g.e dt s t 0 1 2 2 9 1 2 7
In case the store has multiple nodes,
hdf_key
has to be set to the node corresponding to the node table of the graph.Also, one may pass a (list of) name(s) of computed relations,
no_transfer_rs
, which should not be transferred to the created edge tablee
. This can be advantageous, for instance, if a selector depends on computed relations that are of no further interest.Furthermore, it is possible to force the dtype of computed relations with the argument
r_dtype_dic
. The dtype of a relation is then set at each iteration step, but after all selectors and connectors were processed.- Creating Edges on a Fast Track
If the selection of edges includes a simple distance threshold, i.e. a selector function defined as follows:
>>> def ft_selector(x_s, x_t, threshold, sources, targets): ... dx = x_t - x_s ... sources = sources[dx <= threshold] ... targets = targets[dx <= threshold] ... return sources, targets, dx
the method
create_edges_ft
should be considered, since it provides a much faster iteration algorithm.Parameters: - connectors (function or array_like, optional (default=None)) – User defined connector function(s) that compute pairwise relations
between the nodes in
v
. A connector accepts multiple column names ofv
(with ‘_s’ and/or ‘_t’ appended, indicating source node values and target node values, respectively) as input, as well as already computed relations of former connectors. A connector function may have multiple output variables. Every output variable has to be a 1-dimensionalnp.ndarray
(with arbitrary dtype, includingobject
). See above anddg.functions
for examplary connector functions. - selectors (function or array_like, optional (default=None)) – User defined selector function(s) that select edges during the
iteration process, based on some conditions on the node’s features
and their computed relations. Every selector function must have
sources
andtargets
as input arguments as well as in the return statement. A selector may depend on column names ofv
(with ‘_s’ and/or ‘_t’ appended) and/or computed relations of connector functions, and/or computed relations of former selector functions. Apart fromsources
andtargets
, they may also return computed relations (see connectors). See above, anddg.functions
for exemplary selector functions. - transfer_features (str, int or array_like, optional (default=None)) – A (list of) column name(s) of
v
, indicating which features ofv
to transfer toe
(appending ‘_s’ and ‘_t’ to the column names ofe
, indicating source and target node features, respectively). - r_dtype_dic (dict, optional (default=None)) – A dictionary with names of computed relations of connectors and/or
selectors as keys and dtypes as values. Forces the data types of
the computed relations in
e
during the iteration (but after all selectors and connectors were processed), otherwise infers them. - no_transfer_rs (str or array_like, optional (default=None)) – Name(s) of computed relations that are not to be transferred to the
created edge table
e
. Can be used to save memory, e.g., if a selector depends on computed relations that are of no interest otherwise. - step_size (int, optional (default=1e6)) – The number of pairs of nodes to process at each iteration step. Must be in [ 1, g.n*(g.n-1)/2 ]. Its value determines computation speed and memory consumption.
- from_pos (int, optional (default=0)) – Determines from which pair of nodes to start the iteration process.
Must be in [ 0, g.n*(g.n-1)/2 [. May be used in conjuction with
to_pos
for parallel computation. - to_pos (positive integer, optional (default=None)) – Determines at which pair of nodes to stop the iteration process
(the endpoint is excluded). Must be in [ 1, g.n*(g.n-1)/2 ] and
larger than
from_pos
. Defaults to None, which translates to the last pair of nodes, g.n*(g.n-1)/2. May be used in conjunction withfrom_pos
for parallel computation. - hdf_key (str, optional (default=None)) – If you initialized
dg.DeepGraph
with apandas.HDFStore
and the store has multiple nodes, you must pass the key to the node in the store that corresponds to the node table. - verbose (bool, optional (default=False)) – Whether to print information at each step of the iteration process.
- logfile (str, optional (default=None)) – Create a log-file named by
logfile
. Contains the time and date of the method’s call, the input arguments and time mesaurements for each iteration step. A plot oflogfile
can be created bydg.DeepGraph.plot_logfile
.
Returns: e – Set the created edge table
e
as attribute ofdg.DeepGraph
.Return type: pd.DataFrame
See also
Notes
- Input and output data types
Since connectors (and selectors) take columns of a pandas DataFrame as input, there are no restrictions on the data types of which pairwise relations are computed. In the most general case, a DataFrame’s column has
object
as dtype, and its values may then be arbitrary Python objects. The same goes for the output variables of connectors (and selectors). The only requirement is that each ouput variable is 1-dimensional.However, it is also possible to use the values of a column of
v
as references to arbitrary objects, which may sometimes be more convenient. In case a connector (or selector) needs the node’s original indices as input, one may simply copy them to a column, e.g.>>> v['indices'] = v.index
and then define the connector’s (or selector’s) input arguments accordingly.
- Connectors and selectors
The only requirement on connectors and selectors is that their input arguments and return statements are consistent with the column names of
v
and the passing of computed relations (see above, 3. Hierarchical Selection).Whatever happens inside the functions is entirely up to the user. This means, for instance, that one may wrap arbitrary functions within a connector (selector), such as optimized C functions or existing functions whose input/output is not consistent with the
create_edges
method (see, e.g., the methods provided indg.functions
,scipy
or scikit learn’ssklearn.metrics
andsklearn.neighbors.DistanceMetric
). One could also store a connector’s (selector’s) computations directly within the function, or let the function print out any desired information during iteration.- Why not compute the full adjacency matrix?
This is due to efficiency. For any asymmetric function (i.e., f(s, t) != f(t, s)), one can always create an additional connector (or output variable) that computes the mirrored values of that function.