Action understanding is an important task in computer vision and artificial intelligence, which bears significant potential for numerous real-world applications, such as smart-home monitoring, video summarization, skill assessment. In recent years, deep learning methods have made great progress in this field, but still, there are many challenges in this topic. In real-world daily life, human actions are continuous and can be very dense. Every minute is filled with potential actions to be detected and labelled. Therefore, modelling the relation between action instances is very important for the action detection of densely annotated videos. In this presentation, I will talk about some recent works in action detection in real-world videos, including Complex action relation modelling, Cross-modal action representation modelling.