Recent endeavors in video temporal grounding enforce strong cross-modal interactions through attention mechanisms to overcome the modality gap between video and text query. However, previous works ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results