IEEE 4th International Conference on Multimedia Information Processing and Retrieval (IEEE MIPR 2021)

March 22-24, 2021 September 8-10, 2021. Tokyo, Japan online.


Tutorial: Human-centric Media Understanding: Processing, Generation, Re-identification, and Prediction


    • 09:00 - 09:10 Opening - Zheng Wang
    • 09:10 - 09:40 Image restoration and generation for human analysis - Jing Xiao
    • 09:40 - 10:10 Total Generate: Guided Generative Adversarial Networks for Generating Human Faces, Hands, Bodies and Natural Scenes - Hao Tang
    • 10:10 - 10:30 Coffee Break
    • 10:30 - 11:00 New trends of person re-ID system - Zheng Wang
    • 11:00 - 11:30 Delving into spatio-temporal dependencies: agent behavior forecasting in a video sequence - Yuke Li

  • Image restoration and generation for human analysis
    Jing Xiao
    Wuhan University, China

    Dr. Jing Xiao is an Associate Professor at Wuhan University, China. She received the B.S. and M.S. degrees from Wuhan University in 2006 and 2008, respectively, and the Ph.D. degree from the Institute of Geo-Information Science and Earth Observation, Twente University, The Netherlands, in 2013. She was a Project Researcher with the National Informatics Institute, Japan from 2019 to 2020. Her research interests include image/video processing, restoration and generation.
    The practical applications of human-centric understanding usually rely on the quality of media data. Low resolution, occlusion or defections will always hinder the performance of understanding tasks. Image restoration and generation techniques helps to recover the missing human information for the media understanding tasks. There are mainly two challenges: 1) how to generate natural looking images and 2) how to maintain the identification information of humans after image restoration. I will introduce some image restoration techniques, including human face hallucination and image inpainting that addressing these challenges.

  • Total Generate: Guided Generative Adversarial Networks for Generating Human Faces, Hands, Bodies and Natural Scenes
    Hao Tang
    ETH Zurich, Switzerland

    Dr. Hao Tang is currently a Postdoctoral with Computer Vision Lab, ETH Zurich, Switzerland. He received the master’s degree from the School of Electronics and Computer Engineering, Peking University, China and the Ph.D. degree from Multimedia and Human Understanding Group, University of Trento, Italy. He was a visiting scholar in the Department of Engineering Science at the University of Oxford. His current research interest includes computer vision and pattern recognition, with emphasis on image (including faces, hands, bodies, and nature scenes) generation, audio-to-video translation, 3D object generation, text-guided image editing, and image/video super-resolution. He publishes in prestigious computer vision and multimedia conferences/journals such as CVPR, ECCV, ICCV, BMVC, ACM MM, TNNLS, TMM, and TIP. For more information about Hao, please visit
    In an image-to-image translation problem, we aim to translate an image from one domain to another. Many problems in computer vision, graphics, and image processing can be formulated as image-to-image translation tasks, including semantic image synthesis, style transfer, colorization, sketch to photos, etc. An extension to these image-to-image translation problems involves an additional guidance that helps achieve controllable translation. A guidance typically reflects the desired visual effects or constraints specified by a user. This task has many application scenarios such as human-computer interactions, entertainment, virtual reality, and data augmentation. However, this task is challenging since it needs a high-level semantic understanding of the image mapping between the input domain and the output domain. In this talk, I will present how to enable machines to generate images of human faces, hands, bodies and natural scenes conditioned on different guidance forms, e.g., facial landmarks, segmentation maps, or human skeletons. I will also talk about my future research plans on combining multi-modalities such as text, audio, image, video, and 3D object to build "systems for AI generation".

  • New trends of person re-ID system
    Zheng Wang
    The University of Tokyo, Japan

    Dr. Zheng Wang is a Project Assistant Professor at The University of Tokyo, Japan. He was a JSPS Fellowship Researcher at National Institute of Informatics, Japan. He received B.E., M.S. and Ph.D. degrees from Wuhan University. His current research interest includes multimedia analysis and retrieval. He has served as a PC member in top-tier conferences, including ACM MM, ACM MM Aisa, ICCV, IJCAI, AAAI and et. He won the Best Paper Award at Pacific-Rim Conference on Multimedia 2014, ACM Wuhan Doctoral Dissertation Award 2018, and the ICME Best Reviewer 2019. He has organized the ACM MM 2020 Tutorial "Effective and Efficient: Toward Open-world Instance Re-identification" and CVPR 2020 Tutorial "Image Retrieval in the Wild".
    He will conduct a brief review for general person re-ID, where the person's appearance variation, the short-term environment change and the intra-modality discrepancy work as the main challenge. He will introduce new trends of person re-ID systems that are more practical in open-world conditions, consisting of group, long-term, and cross modality. Representative approaches, comparisons and discussions will be given.

  • Delving into spatio-temporal dependencies: agent behavior forecasting in a video sequence
    Yuke Li
    University of Carifornia, Berkeley, U.S.

    Dr. Yu-ke Li is a postdoctoral researcher at UC Berkeley. He mainly focuses on the field of computer vision and artificial intelligence, especially in the topic of Machine Learning for Autonomous Driving. Prior to joining Berkeley, Dr. Li did research on crowds behavior analysis, video understanding and applications on real world requirements, such as self-driving vehicles. He obtained his Ph.D in Wuhan University, China. He also spent his career in France, Italy, Canada and China as a researcher.
    The topic of human agent behavior forecasting has emerged as an intriguing problem because of the rising demands of intelligent systems, such as autonomous vehicles and smart cities. The key challenges include: 1. spatio-temporal features modeling; and 2. how these features propagate through time. In this talk, the brief descriptions provide summaries of these points. In particular, I will introduce my work on path forecasting and action forecasting in detail.

Tutorial: A Masterʼs Toolbox and Algorithms for Low-Latency Live Streaming

Ali C. Begen
Ozyegin University, Turkey

Ali C. Begen is currently a computer science professor at Ozyegin University and a technical consultant in the Advanced Technology and Standards group at Comcast. Previously, he was a research and development engineer at Cisco. Begen received his PhD in electrical and computer engineering from Georgia Tech in 2006. To date he received a number of academic and industry awards, and was granted 30+ US patents. In 2020, he was listed among the world's most influential scientists in the subfield of networking and telecommunications. More details are at
Today, a glass-to-glass latency of 10-30 seconds is practically achievable in live streaming and such a range is acceptable in most cases. Yet, the increasing number of cord-cutters is putting pressure on streaming providers to offer low-latency (2-10 seconds) streaming, especially for sports content.
In the last few years, the streaming industry produced a number of solutions: (i) DASH Low-Latency (DASH-LL) was introduced by DASH-IF and DVB in 2019; (ii) Low-latency HTTP Live Streaming (LHLS) was first introduced by Twitter’s Periscope application in 2018, then improved by Twitch in 2019 and eventually abandoned; (iii) Low-Latency HTTP Live Streaming (LL-HLS) was announced by Apple in June 2019 and became available in Sept. 2020; (iv) High Efficiency Streaming Protocol (HESP) was announced and promoted by the HESP Alliance in mid 2020. At the high level, there are common features across all these solutions but there are also obvious as well as subtle differences in their requirements and implementations.
Independent of the technology, low-latency live streaming brings up new challenges. This tutorial covers many of these aspects, shows examples and surveys various efforts that are underway in the streaming industry.