deepgraph.deepgraph.DeepGraph.create_edges_ft¶
-
DeepGraph.
create_edges_ft
(ft_feature, connectors=None, selectors=None, transfer_features=None, r_dtype_dic=None, no_transfer_rs=None, min_chunk_size=1000, max_pairs=10000000, from_pos=0, to_pos=None, hdf_key=None, verbose=False, logfile=None)[source]¶ Create (ft) an edge table
e
linking the nodes inv
.This method implements the same functionalities as
create_edges
, with the difference of providing a much quicker iteration algorithm based on a so-called fast-track feature. It is advised to read the docstring ofcreate_edges
before this one, since only the differences are explained in the following.Apart from the hierarchical selection through
connectors
andselectors
as described in the methodcreate_edges
(see 1.-3.), this method necessarily includes the (internal) selector function>>> def ft_selector(ftf_s, ftf_t, ftt, sources, targets): ... ft_r = ftf_t - ftf_s ... sources = sources[ft_r <= ftt] ... targets = targets[ft_r <= ftt] ... return sources, targets, ft_r
where
ftf
is the fast-track feature (a column name ofv
),ftt
the fast-track threshold (a positive number), and ft_r the computed fast-track relation. The argumentft_feature
, which has to be a tuple (ftf
,ftt
), determines these variables.- The Fast-Track Feature
The simplest use-case, therefore, is to only pass
ft_feature
. For instance, given a node table>>> import pandas as pd >>> import deepgraph as dg >>> v = pd.DataFrame({'time': [-3.6,-1.1,1.4,4., 6.3], ... 'x': [-3.,3.,1.,12.,7.]}) >>> g = dg.DeepGraph(v)
>>> g.v time x 0 -3.6 -3 1 -1.1 3 2 1.4 1 3 4.0 12 4 6.3 7
one may create and select edges by
>>> g.create_edges_ft(ft_feature=('time', 5))
>>> g.e ft_r s t 0 1 2.5 2 5.0 1 2 2.5 2 3 2.6 4 4.9 3 4 2.3
leaving only edges with a time difference smaller than (or equal to)
ftt
= 5. Note that the node table always has to be sorted by the fast-track feature. This is due to the fact that the algorithm only processes pairs of nodes whose fast-track relation is smaller than (or equal to) the fast-track threshold, and the (pre)determination of these pairs relies on a sorted DataFrame.- Hierarchical Selection
Additionally, one may define
connectors
andselectors
as described increate_edges
(see 1.-3.). Per default, the (internal) fast-track selector is applied first. It’s order of application, however, may be determined by inserting the string ‘ft_selector’ in the desired position of the list ofselectors
.The remaining arguments are as described in
create_edges
, apart frommin_chunk_size
,max_pairs
,from_pos
andto_pos
. If computation time and/or memory consumption are a concern, one may therefore read the remaining paragraph.- Parallelization and Memory Control on a FastTrack
At each iteration step, the algorithm takes a number of nodes (n =
min_chunk_size
, per default n=1000) and computes the fast track relation (distance) between the last node and the first node, d_ftf = ftf_last - ftf_first. In case d_ftf >ftt
, all nodes with a fast- track feature < ftf_last -ftt
are considered source nodes, and their relations with all n nodes are computed (hierarchical selection). In case d_ftf <=ftt
, n is increased, s.t. d_ftf >ftt
. This might lead to a large number of pairs of nodes to process at a given iteration step. In order to control memory consumption, one might therefore setmax_pairs
to a suitable value, triggering a subiteration if this value is exceeded.In order to parallelize the iterative computation, one may pass the arguments
from_pos
andto_pos
. They determine the range of source nodes to process (endpoint excluded). Hence,from_pos
has to be in [0, g.n[, andto_pos
in [1,g.n]. For instance, given the node table above>>> g.v time x 0 -3.6 -3 1 -1.1 3 2 1.4 1 3 4.0 12 4 6.3 7
we can compute all relations of the source nodes in [1,3[ by
>>> g.create_edges_ft(ft_feature=('time', 5), from_pos=1, to_pos=3)
>>> g.e ft_r s t 1 2 2.5 2 3 2.6 4 4.9
Like
create_edges
, this method also works with apd.HDFStore
containing the DataFrame representing the node table. Only the data requested byft_feature
,transfer_features
and the user-definedconnectors
andselectors
at each iteration step is then pulled from the store. The node table in the store has to be in table(t) format, and additionally, the fast_track feature has to be a data column. For instance, storing the above node table>>> vstore = pd.HDFStore('vstore.h5') >>> vstore.put('node_table', v, format='t', data_columns=True, ... index=False)
one may initiate a DeepGraph instance with the store
>>> g = dg.DeepGraph(vstore)
>>> g.v <class 'pandas.io.pytables.HDFStore'> File path: vstore.h5 /node_table frame_table (typ->appendable,nrows->5,ncols->2, indexers->[index],dc->[time,x])
and then create edges the same way as if
g.v
were a DataFrame>>> g.create_edges_ft(ft_feature=('time', 5), from_pos=1, to_pos=3)
>>> g.e ft_r s t 1 2 2.5 2 3 2.6 4 4.9
Warning
There is no assertion whether the node table in a store is sorted by the fast-track feature! The result of an unsorted table is unpredictable, and generally not correct.
Parameters: - ft_feature (tuple) –
A tuple (ftf, ftt), where ftf is a column name of
v
(the fast- track feature) and ftt a positive number (the fast-track threshold). The fast-track feature may contain integers or floats, but datetime-like values are also accepted. In that case,ft_feature
has to be a tuple of length 3, (ftf, ftt, dt_unit), where dt_unit is on of {‘D’,’h’,’m’,’s’,’ms’,’us’,’ns’}:- D: days
- h: hours
- m: minutes
- s: seconds
- ms: milliseconds
- us: microseconds
- ns: nanoseconds
determining the unit in which the temporal distance is measured. The variable name of the fast-track relation transferred to
e
isft_r
. - connectors (function or array_like, optional (default=None)) – User defined connector function(s) that compute pairwise relations
between the nodes in
v
. A connector accepts multiple column names ofv
(with ‘_s’ and/or ‘_t’ appended, indicating source node values and target node values, respectively) as input, as well as already computed relations of former connectors. A connector function may have multiple output variables. Every output variable has to be a 1-dimensionalnp.ndarray
(with arbitrary dtype, includingobject
). A connector may also depend on the fast- track relations (‘ft_r’). Seedg.functions
for examplary connector functions. - selectors (function or array_like, optional (default=None)) –
User defined selector function(s) that select edges during the iteration process, based on some conditions on the node’s features and their computed relations. Every selector function must have
sources
andtargets
as input arguments as well as in the return statement. A selector may depend on column names ofv
(with ‘_s’ and/or ‘_t’ appended) and/or computed relations of connector functions, and/or computed relations of former selector functions. Apart fromsources
andtargets
, they may also return computed relations (see connectors). A selector may also depend on the fast-track relations (‘ft_r’). Seedg.functions
for exemplary selector functions.Note: To specify the hierarchical order of the selection by the fast-track selector, insert the string ‘ft_selector’ in the corresponding position of the
selectors
list. Otherwise, computation of ft_r and selection by the fast-track selector is carried out first. - transfer_features (str, int or array_like, optional (default=None)) – A (list of) column name(s) of
v
, indicating which features ofv
to transfer toe
(appending ‘_s’ and ‘_t’ to the column names ofe
, indicating source and target node features, respectively). - r_dtype_dic (dict, optional (default=None)) – A dictionary with names of computed relations of connectors and/or
selectors as keys and dtypes as values. Forces the data types of
the computed relations in
e
during the iteration (but after all selectors and connectors were processed), otherwise infers them. - no_transfer_rs (str or array_like, optional (default=None)) – Name(s) of computed relations that are not to be transferred to the
created edge table
e
. Can be used to save memory, e.g., if a selector depends on computed relations that are of no interest otherwise. - min_chunk_size (int, optional (default=1000)) – The minimum number of nodes to form pairs of at each iteration step. See above for details.
- max_pairs (positive integer, optional (default=1e6)) – The maximum number of pairs of nodes to process at any given iteration step. If the number is exceeded, a memory saving subiteration is applied.
- from_pos (int, optional (default=0)) – The locational index (.iloc) of
v
to start the iteration. Determines the range of source nodes to process, in conjuction withto_pos
. Has to be in [0, g.n[, and smaller thanto_pos
. See above for details and an example. - to_pos (int, optional (default=None)) – The locational index (.iloc) of
v
to end the iteration (excluded). Determines the range of source nodes to process, in conjuction withfrom_pos
. Has to be in [1, g.n], and larger thanfrom_pos
. Defaults to None, which translates to the last node ofv
, to_pos=g.n. See above for details and an example. - hdf_key (str, optional (default=None)) – If you initialized
dg.DeepGraph
with apandas.HDFStore
and the store has multiple nodes, you must pass the key to the node in the store that corresponds to the node table. - verbose (bool, optional (default=False)) – Whether to print information at each step of the iteration process.
- logfile (str, optional (default=None)) – Create a log-file named by
logfile
. Contains the time and date of the method’s call, the input arguments and time mesaurements for each iteration step. A plot oflogfile
can be created bydg.DeepGraph.plot_logfile
.
Returns: e – Set the created edge table
e
as attribute ofdg.DeepGraph
.Return type: pd.DataFrame
See also
Notes
The parameter
min_chunk_size
enforces a vectorized iteration and changing its value can both accelerate or slow down computation time. This depends mostly on the distribution of values of the fast track feature, and the complexity of the givenconnectors
andselectors
. Use the logging capabilites to determine a good value.When using a
pd.HDFStore
for the computation, the following advice might be considered. Recall that the only requirements on the node in the store are: the format is table(t), not fixed(t); the node is sorted by the fast-track feature; and the fast-track feature is a data column.The recommended procedure of storing a given node table
v
in a store is the following (using the above node table):>>> vstore = pd.HDFStore('vstore.h5') >>> vstore.put('node_table', v, format='t', data_columns=True, ... index=False)
Setting index=False significantly decreases the time to construct the node in the store, and also reduces the resulting file size. It has no impact, however, on the capability of querying the store (with the pd.HDFStore.select* methods).
However, there are two reasons one might want to create a pytables index of the fast-track feature:
1. The node table might be too large to be sorted in memory. To sort it on disc, one may proceed as follows. Assuming an unsorted (large) node table
>>> v = pd.DataFrame({'time': [6.3,-3.6,4.,-1.1,1.4], ... 'x': [-3.,3.,1.,12.,7.]})
>>> v time x 0 6.3 -3 1 -3.6 3 2 4.0 1 3 -1.1 12 4 1.4 7
one stores it as recommended
>>> vstore = pd.HDFStore('vstore.h5') >>> vstore.put('node_table', v, format='t', data_columns=True, ... index=False) >>> vstore.get_storer('node_table').group.table /node_table/table (Table(5,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "time": Float64Col(shape=(), dflt=0.0, pos=1), "x": Float64Col(shape=(), dflt=0.0, pos=2)} byteorder := 'little' chunkshape := (2730,)
creates a (full) pytables index of the fast-track feature
>>> vstore.create_table_index('node_table', columns=['time'], ... kind='full') >>> vstore.get_storer('node_table').group.table /node_table/table (Table(5,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "time": Float64Col(shape=(), dflt=0.0, pos=1), "x": Float64Col(shape=(), dflt=0.0, pos=2)} byteorder := 'little' chunkshape := (2730,) autoindex := True colindexes := { "time": Index(6, full, shuffle, zlib(1)).is_csi=True}
and then sorts it on disc with
>>> vstore.close() >>> !ptrepack --chunkshape=auto --sortby=time vstore.h5 s_vstore.h5 >>> s_vstore = pd.HDFStore('s_vstore.h5')
>>> s_vstore.node_table time x 1 -3.6 3 3 -1.1 12 4 1.4 7 2 4.0 1 0 6.3 -3
- To speed up the internal queries on the fast-track feature
>>> s_vstore.create_table_index('node_table', columns=['time'], ... kind='full')
See http://stackoverflow.com/questions/17893370/ptrepack-sortby-needs-full-index and https://gist.github.com/michaelaye/810bd0720bb1732067ff for details, benchmarks, and the effects of compressing the store.