deepgraph.deepgraph.DeepGraph.create_edges_ft

DeepGraph.create_edges_ft(ft_feature, connectors=None, selectors=None, transfer_features=None, r_dtype_dic=None, no_transfer_rs=None, min_chunk_size=1000, max_pairs=10000000, from_pos=0, to_pos=None, hdf_key=None, verbose=False, logfile=None)

Create (ft) an edge table e linking the nodes in v.

This method implements the same functionalities as create_edges, with the difference of providing a much quicker iteration algorithm based on a so-called fast-track feature. It is advised to read the docstring of create_edges before this one, since only the differences are explained in the following.

Apart from the hierarchical selection through connectors and selectors as described in the method create_edges (see 1.-3.), this method necessarily includes the (internal) selector function

>>> def ft_selector(ftf_s, ftf_t, ftt, sources, targets):
...     ft_r = ftf_t - ftf_s
...     sources = sources[ft_r <= ftt]
...     targets = targets[ft_r <= ftt]
...     return sources, targets, ft_r

where ftf is the fast-track feature (a column name of v), ftt the fast-track threshold (a positive number), and ft_r the computed fast-track relation. The argument ft_feature, which has to be a tuple (ftf, ftt), determines these variables.

  1. The Fast-Track Feature

The simplest use-case, therefore, is to only pass ft_feature. For instance, given a node table

>>> import pandas as pd
>>> import deepgraph as dg
>>> v = pd.DataFrame({'time': [-3.6,-1.1,1.4,4., 6.3],
...                   'x': [-3.,3.,1.,12.,7.]})
>>> g = dg.DeepGraph(v)
>>> g.v
   time   x
0  -3.6  -3
1  -1.1   3
2   1.4   1
3   4.0  12
4   6.3   7

one may create and select edges by

>>> g.create_edges_ft(ft_feature=('time', 5))
>>> g.e
     ft_r
s t
0 1   2.5
  2   5.0
1 2   2.5
2 3   2.6
  4   4.9
3 4   2.3

leaving only edges with a time difference smaller than (or equal to) ftt = 5. Note that the node table always has to be sorted by the fast-track feature. This is due to the fact that the algorithm only processes pairs of nodes whose fast-track relation is smaller than (or equal to) the fast-track threshold, and the (pre)determination of these pairs relies on a sorted DataFrame.

  1. Hierarchical Selection

Additionally, one may define connectors and selectors as described in create_edges (see 1.-3.). Per default, the (internal) fast-track selector is applied first. It’s order of application, however, may be determined by inserting the string ‘ft_selector’ in the desired position of the list of selectors.

The remaining arguments are as described in create_edges, apart from min_chunk_size, max_pairs, from_pos and to_pos. If computation time and/or memory consumption are a concern, one may therefore read the remaining paragraph.

  1. Parallelization and Memory Control on a FastTrack

At each iteration step, the algorithm takes a number of nodes (n = min_chunk_size, per default n=1000) and computes the fast track relation (distance) between the last node and the first node, d_ftf = ftf_last - ftf_first. In case d_ftf > ftt, all nodes with a fast- track feature < ftf_last - ftt are considered source nodes, and their relations with all n nodes are computed (hierarchical selection). In case d_ftf <= ftt, n is increased, s.t. d_ftf > ftt. This might lead to a large number of pairs of nodes to process at a given iteration step. In order to control memory consumption, one might therefore set max_pairs to a suitable value, triggering a subiteration if this value is exceeded.

In order to parallelize the iterative computation, one may pass the arguments from_pos and to_pos. They determine the range of source nodes to process (endpoint excluded). Hence, from_pos has to be in [0, g.n[, and to_pos in [1,g.n]. For instance, given the node table above

>>> g.v
   time   x
0  -3.6  -3
1  -1.1   3
2   1.4   1
3   4.0  12
4   6.3   7

we can compute all relations of the source nodes in [1,3[ by

>>> g.create_edges_ft(ft_feature=('time', 5), from_pos=1, to_pos=3)
>>> g.e
     ft_r
s t
1 2   2.5
2 3   2.6
  4   4.9

Like create_edges, this method also works with a pd.HDFStore containing the DataFrame representing the node table. Only the data requested by ft_feature, transfer_features and the user-defined connectors and selectors at each iteration step is then pulled from the store. The node table in the store has to be in table(t) format, and additionally, the fast_track feature has to be a data column. For instance, storing the above node table

>>> vstore = pd.HDFStore('vstore.h5')
>>> vstore.put('node_table', v, format='t', data_columns=True,
...            index=False)

one may initiate a DeepGraph instance with the store

>>> g = dg.DeepGraph(vstore)
>>> g.v
<class 'pandas.io.pytables.HDFStore'>
File path: vstore.h5
/node_table            frame_table  (typ->appendable,nrows->5,ncols->2,
indexers->[index],dc->[time,x])

and then create edges the same way as if g.v were a DataFrame

>>> g.create_edges_ft(ft_feature=('time', 5), from_pos=1, to_pos=3)
>>> g.e
     ft_r
s t
1 2   2.5
2 3   2.6
  4   4.9

Warning

There is no assertion whether the node table in a store is sorted by the fast-track feature! The result of an unsorted table is unpredictable, and generally not correct.

Parameters:
  • ft_feature (tuple) –

    A tuple (ftf, ftt), where ftf is a column name of v (the fast- track feature) and ftt a positive number (the fast-track threshold). The fast-track feature may contain integers or floats, but datetime-like values are also accepted. In that case, ft_feature has to be a tuple of length 3, (ftf, ftt, dt_unit), where dt_unit is on of {‘D’,’h’,’m’,’s’,’ms’,’us’,’ns’}:

    • D: days
    • h: hours
    • m: minutes
    • s: seconds
    • ms: milliseconds
    • us: microseconds
    • ns: nanoseconds

    determining the unit in which the temporal distance is measured. The variable name of the fast-track relation transferred to e is ft_r.

  • connectors (function or array_like, optional (default=None)) – User defined connector function(s) that compute pairwise relations between the nodes in v. A connector accepts multiple column names of v (with ‘_s’ and/or ‘_t’ appended, indicating source node values and target node values, respectively) as input, as well as already computed relations of former connectors. A connector function may have multiple output variables. Every output variable has to be a 1-dimensional np.ndarray (with arbitrary dtype, including object). A connector may also depend on the fast- track relations (‘ft_r’). See dg.functions for examplary connector functions.
  • selectors (function or array_like, optional (default=None)) –

    User defined selector function(s) that select edges during the iteration process, based on some conditions on the node’s features and their computed relations. Every selector function must have sources and targets as input arguments as well as in the return statement. A selector may depend on column names of v (with ‘_s’ and/or ‘_t’ appended) and/or computed relations of connector functions, and/or computed relations of former selector functions. Apart from sources and targets, they may also return computed relations (see connectors). A selector may also depend on the fast-track relations (‘ft_r’). See dg.functions for exemplary selector functions.

    Note: To specify the hierarchical order of the selection by the fast-track selector, insert the string ‘ft_selector’ in the corresponding position of the selectors list. Otherwise, computation of ft_r and selection by the fast-track selector is carried out first.

  • transfer_features (str, int or array_like, optional (default=None)) – A (list of) column name(s) of v, indicating which features of v to transfer to e (appending ‘_s’ and ‘_t’ to the column names of e, indicating source and target node features, respectively).
  • r_dtype_dic (dict, optional (default=None)) – A dictionary with names of computed relations of connectors and/or selectors as keys and dtypes as values. Forces the data types of the computed relations in e during the iteration (but after all selectors and connectors were processed), otherwise infers them.
  • no_transfer_rs (str or array_like, optional (default=None)) – Name(s) of computed relations that are not to be transferred to the created edge table e. Can be used to save memory, e.g., if a selector depends on computed relations that are of no interest otherwise.
  • min_chunk_size (int, optional (default=1000)) – The minimum number of nodes to form pairs of at each iteration step. See above for details.
  • max_pairs (positive integer, optional (default=1e6)) – The maximum number of pairs of nodes to process at any given iteration step. If the number is exceeded, a memory saving subiteration is applied.
  • from_pos (int, optional (default=0)) – The locational index (.iloc) of v to start the iteration. Determines the range of source nodes to process, in conjuction with to_pos. Has to be in [0, g.n[, and smaller than to_pos. See above for details and an example.
  • to_pos (int, optional (default=None)) – The locational index (.iloc) of v to end the iteration (excluded). Determines the range of source nodes to process, in conjuction with from_pos. Has to be in [1, g.n], and larger than from_pos. Defaults to None, which translates to the last node of v, to_pos=g.n. See above for details and an example.
  • hdf_key (str, optional (default=None)) – If you initialized dg.DeepGraph with a pandas.HDFStore and the store has multiple nodes, you must pass the key to the node in the store that corresponds to the node table.
  • verbose (bool, optional (default=False)) – Whether to print information at each step of the iteration process.
  • logfile (str, optional (default=None)) – Create a log-file named by logfile. Contains the time and date of the method’s call, the input arguments and time mesaurements for each iteration step. A plot of logfile can be created by dg.DeepGraph.plot_logfile.
Returns:

e – Set the created edge table e as attribute of dg.DeepGraph.

Return type:

pd.DataFrame

See also

create_edges()

Notes

The parameter min_chunk_size enforces a vectorized iteration and changing its value can both accelerate or slow down computation time. This depends mostly on the distribution of values of the fast track feature, and the complexity of the given connectors and selectors. Use the logging capabilites to determine a good value.

When using a pd.HDFStore for the computation, the following advice might be considered. Recall that the only requirements on the node in the store are: the format is table(t), not fixed(t); the node is sorted by the fast-track feature; and the fast-track feature is a data column.

The recommended procedure of storing a given node table v in a store is the following (using the above node table):

>>> vstore = pd.HDFStore('vstore.h5')
>>> vstore.put('node_table', v, format='t', data_columns=True,
...            index=False)

Setting index=False significantly decreases the time to construct the node in the store, and also reduces the resulting file size. It has no impact, however, on the capability of querying the store (with the pd.HDFStore.select* methods).

However, there are two reasons one might want to create a pytables index of the fast-track feature:

1. The node table might be too large to be sorted in memory. To sort it on disc, one may proceed as follows. Assuming an unsorted (large) node table

>>> v = pd.DataFrame({'time': [6.3,-3.6,4.,-1.1,1.4],
...                   'x': [-3.,3.,1.,12.,7.]})
>>> v
   time   x
0   6.3  -3
1  -3.6   3
2   4.0   1
3  -1.1  12
4   1.4   7

one stores it as recommended

>>> vstore = pd.HDFStore('vstore.h5')
>>> vstore.put('node_table', v, format='t', data_columns=True,
...            index=False)
>>> vstore.get_storer('node_table').group.table
/node_table/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "time": Float64Col(shape=(), dflt=0.0, pos=1),
  "x": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)

creates a (full) pytables index of the fast-track feature

>>> vstore.create_table_index('node_table', columns=['time'],
...                           kind='full')
>>> vstore.get_storer('node_table').group.table
/node_table/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "time": Float64Col(shape=(), dflt=0.0, pos=1),
  "x": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "time": Index(6, full, shuffle, zlib(1)).is_csi=True}

and then sorts it on disc with

>>> vstore.close()
>>> !ptrepack --chunkshape=auto --sortby=time vstore.h5 s_vstore.h5
>>> s_vstore = pd.HDFStore('s_vstore.h5')
>>> s_vstore.node_table
   time   x
1  -3.6   3
3  -1.1  12
4   1.4   7
2   4.0   1
0   6.3  -3
  1. To speed up the internal queries on the fast-track feature
>>> s_vstore.create_table_index('node_table', columns=['time'],
...                             kind='full')

See http://stackoverflow.com/questions/17893370/ptrepack-sortby-needs-full-index and https://gist.github.com/michaelaye/810bd0720bb1732067ff for details, benchmarks, and the effects of compressing the store.