G-TAD: Sub-Graph Localization for Temporal Action Detection
Figure 1. Graph formulation of a video. Nodes: video snippets (a video snippet is defined as consecutive frames within a short time period). Edges: snippet-snippet correlations. Sub-graphs: actions associated with context. There are 4 types of nodes: action, start, end, and background, shown as colored dots. There are 2 types of edges: (1) temporal edges, which are pre-defined according to the snippets' temporal order; (2) semantic edges, which are learned from node features.
Temporal action detection is a fundamental yet challenging task in video understanding. Video context is a critical cue to effectively detect actions, but current works mainly focus on temporal context, while neglecting semantic context as well as other important context properties. In this work, we propose a graph convolutional network (GCN) model to adaptively incorporate multi-level semantic context into video features and cast temporal action detection as a sub-graph localization problem. Specifically, we formulate video snippets as graph nodes, snippet-snippet correlations as edges, and actions associated with context as target sub-graphs. With graph convolution as the basic operation, we design a GCN block called GCNeXt, which learns the features of each node by aggregating its context and dynamically updates the edges in the graph. To localize each sub-graph, we also design an SGAlign layer to embed each sub-graph into the Euclidean space. Extensive experiments show that G-TAD is capable of finding effective video context without extra supervision and achieves state-of-the-art performance on two detection benchmarks.
GCNeXt block. The input feature is processed by temporal and semantic graphs with the same cardinality. Black and purple boxes represent Edge Convolutions and 1D Convolutions, respectively. We display (input channel, output channel) in each box. Both convolution streams follow a split-transform-merge strategy and 32 paths designed to increase the diversity of transformations. The module output is the summation of both streams and the input.
SGAlign layer. SGAlign extracts sub-graph features using a set of anchors. In the graphs above, the colored dots represent node features, grey arrows are semantic edges, and the orange highlighted arch is the anchor. Both circles represent the same graph. SGAlign arranges node feature along the temporal/semantic graphs, and concatenates both features as output. When using the temporal graph, the order of nodes is preserved in the final representation (black lines). This is not always true for the semantic graph, since node features are represented by their feature neighbors (purple lines).
Action detection results on validation set of ActivityNet-1.3, measured by mAP (%) at different tIoU thresholds and the average mAP. G-TAD achieves better performance in average mAP than the other methods, even the latest work of BMN and P-GCN shown in the second-to-last block.
Action detection results on testing set of THUMOS14, measured by mAP (%) at different tIoU thresholds.
Ablating GCNeXt and SGAlign Components. We disable different functionalities of GCNeXt and SGAlign in GTAD for action detection on ActivityNet-1.3.
Semantic graphs and Context. Given two untrimmed videos (left and right), we combine action frames from one video with context frames from the second (middle). We therefore create a synthetic video with no action context. As expected, the semantic graph of the synthetic video contains no edges between action and background snippets.
Action-Background Semantic Edge Ratio vs. Context Amount. In the scatter plot, each purple dot corresponds to a different video graph. Strong positive correlation is observed between context amount and action-background semantic edge ratio, which means we predict on average more semantic edges in the presence of large video context.
Semantic graph evolution during G-TAD training. We visualize the semantic graphs at first, middle, and last layers during training epoch 0, 3, 6, and 9. The semantic edges at the first layer are always the same, while the semantic graphs at the middle and last layers evolve to incorporate more context.