3D+t dense motion trajectories as kinematics primitives to recognize gestures on depth video sequences

Fabio Martínez Carrillo; Fabián Castillo; Lola Bautista

doi:10.33571/rpolitec.v15n29a7

Autores/as

Fabio Martínez Carrillo Universidad Industrial de Santander http://orcid.org/0000-0001-7353-049X
Fabián Castillo
Lola Bautista

DOI:

https://doi.org/10.33571/rpolitec.v15n29a7

Palabras clave:

RGB-D, scene flows, dense motion trajectories, tracking, kinematic features

Resumen

Los sensores RGB-D han permitido atacar de forma novedosa muchos de los problemas clásicos en visión por computador, tales como la segmentación, la representación de escenas, la interacción humano-computador, entre otros. Con respecto a la caracterización de movimiento, las estrategias típicas en RGB-D están limitadas al análisis dinámico de formas globales y a la captura de flujos de escena. Estas estrategias, sin embargo, solo recuperan información dinámica entre cuadros consecutivos, limitando el análisis de largos desplazamientos. Este trabajo presenta una estrategia para el cálculo de trayectorias (3D+t), las cuales son fundamentales para la descripción cinemática local, permitiendo una descripción densa de movimiento. Cada trayectoria permite modelar palabras cinemáticas, las cuales en conjunto, describen gestos complejos en los videos. Estas palabras cinemáticas fueron procesadas dentro de un esquema de bolsa-de-palabras para obtener un descriptor basado ocurrencias. Este descriptor de trayectorias logró una exactitud del 80% en 5 gestos y 100 videos.

RGB-D sensors have allowed attacking many classical problems in computer vision such as segmentation, scene representations and human interaction, among many others. Regarding motion characterization, typical RGB-D strategies are limited to namely analyze global shape changes and capture scene flow fields to describe local motions in depth sequences. Nevertheless, such strategies only recover motion information among a couple of frames, limiting the analysis of coherent large displacements along time. This work presents a novel strategy to compute 3D+t dense and long motion trajectories as fundamental kinematic primitives to represent video sequences. Each motion trajectory models kinematic words primitives that together can describe complex gestures developed along videos. Such kinematic words were processed into a bag-of-kinematic-words framework to obtain an occurrence video descriptor. The novel video descriptor based on 3D+t motion trajectories achieved an average accuracy of 80% in a dataset of 5 gestures and 100 videos.

Métricas de artículo

Resumen: 531 HTML (English): 215 PDF (English): 303 XML (English): 45

Métricas PlumX

Biografía del autor/a

Fabio Martínez Carrillo, Universidad Industrial de Santander

Profesor Asistente, Escuela de Ingeniería de Sistemas e Informática.

Fabián Castillo

Lola Bautista

Doctorado en Ciencias de la Computación e Ingeniería, Universidad de Nice Sophia Antipolis

Citas

Blum M., Springenberg, J.T., Wülfing, J. and Riedmiller, M. A learned feature descriptor for object recognition in RGB-D data, 2012 IEEE International Conference on Robotics and Automation (ICRA), 2012. pp. 1298–1303.

Zhao, Y., Liu, Z., Yang, L. and Cheng, H. Combing RGB and depth map features for human activity recognition, Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), Asia-Pacific, 2012. pp. 1–4.

Bo, L., Ren, X. and Fox, D. Depth kernel descriptors for object recognition., 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2011. pp. 821–826.

Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A. and Blake, A. Real-time human pose recognition in parts from single depth images, 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011. pp. 1297–1304.

Vedula, S., Baker, S., Rander, P., Collins, R. and Kanade, T. Three-dimensional scene flow, Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999. Volume 2, pp. 722–729.

Huguet, F. and Devernay, F. A variational method for scene flow estimation from stereo sequences, 11th IEEE International Conference on Computer Vision ICCV, 2007. pp. 1–7.

Herbst, E., Ren, X. and Fox, D. RGB-D flow: Dense 3-d motion estimation using color and depth, 2013 IEEE International Conference on Robotics and Automation (ICRA), 2013. pp. 2276–2282.

Khoshelham, K. Accuracy analysis of kinect depth data, ISPRS workshop laser scanning, 2011. Volume 38, pp. W12.

Quiroga, J., Devernay, F. and Crowley, J. Local/global scene flow estimation, 20th IEEE International Conference on Image Processing (ICIP), 2013. pp. 3850–3854.

Endres, F., Hess, J., Sturm, J., Cremers, D. and Burgard, W. 3-D mapping with an RGB-D camera. IEEE Transactions on Robotics, 30(1):177–187, 2014.

Lucas, B., Kanade, T., et al. An iterative image registration technique with an application to stereo vision, Proceedings of the 7th international joint conference on Artificial intelligence, 1981. pp. 674-679.

Horn, B. and Schunck, B. Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1981.

Quiroga, J., Brox, T., Devernay, F. and Crowley, J. Dense semi-rigid scene flow estimation from RGB-D images. In: Fleet D., Pajdla T., Schiele B., Tuytelaars T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8695. Springer, Cham. pp. 567–582.

Sun, S.-W., FrankWang, Y.-C., Huang, F. and Liao, H.Y. Moving foreground object detection via robust SIFT trajectories. Journal of Visual Communication and Image Representation, 24(3):232–243, 2013.

Wang, H., Kläser, A., Schmid, C. and Liu, C.-H. Action recognition by dense trajectories, 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011. pp. 3169–3176.

Shi, J. et al. Good features to track. 1994 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1994. pp. 593–600.

Dollár, P., Rabaud, V., Cottrell, G. and Belongie, S. Behavior recognition via sparse spatio-temporal features. 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. pp. 65–72.

Cai, Z., Han, J., Liu, L. and Shao, L. RGB-D datasets using Microsoft kinect or similar sensors: a survey. Multimedia Tools and Applications. 76(3): 4313–4355, 2017.

Xia L. and Aggarwal J.K.. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013. pp. 2834–2841.

Li, W., Zhang, Z. and Liu, Z. Action recognition based on a bag of 3D points. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2010. pp. 9–14.

Yang X. and Tian, J.L.. Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2012. pp. 14–19.

Lowe, D.G.. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 100 2004.

Xiao, Y., Zhao, G., Yuan, J. and Thalmann, D. Activity recognition in unconstrained RGB-D video using 3D trajectories. In SIGGRAPH Asia 2014 Autonomous Virtual Humans and Social Robot for Telepresence, 2014. pp. 4.

Willems, G., Tuytelaars, T. and Van Gool, J. An efficient dense and scale-invariant spatio-temporal interest point detector. Computer Vision–ECCV 2008, 2008. pp. 650–663.

Laptev, I. and Lindeberg, T. Space-time interest points. Proceedings 9th International Conference on Computer Vision, Nice, France, 2003. pp. 432–439.

Ren, X., Bo, L. and Fox, D. RGB-(D) scene labeling: Features and algorithms. 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012. pp. 2759–2766.

Boutin, M. Numerically invariant signature curves. International Journal of Computer Vision 40(3): 235-248, 2000.

Schuldt, C., Laptev, I. and Caputo, B. Recognizing human actions: a local SVM approach. Proceedings of the 17th International Conference on Pattern Recognition ICPR. 2004. pp. 32-36.

Basura, F. et al. Rank pooling for action recognition. IEEE Transactions On Pattern Analysis And Machine Intelligence, 39(4): 773-787, 2017.

Janoch, A. et al. A category-level 3D object dataset: Putting the kinect to work. Consumer depth cameras for computer vision. Springer, London, 2013. pp. 141-165.

Lai, K. et al. RGB-D object recognition: Features, algorithms, and a large scale benchmark. Consumer Depth Cameras for Computer Vision. Springer, London, 2013. pp. 167-192.

Ni, B., Wang, G., Moulin, P. RGB-D, HuDaAct. A Colour-Depth video database for human daily activity recognition [C]. Proceedings of IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2011. pp. 6-13.