LIU Yi,JIN Xiaofeng*.A mapping model of facial features and speech features based on Bi -LSTM[J].Journal of Yanbian University,2020,46(03):215-220.
基于Bi -LSTM的面部特征与语音特征的映射模型
- Title:
- A mapping model of facial features and speech features based on Bi -LSTM
- 文章编号:
- 1004-4353(2020)03-0215-06
- Keywords:
- facial animation; MFCC; B i -LSTM; fine -tuning
- 分类号:
- TP391
- 文献标志码:
- A
- 摘要:
- 针对人脸动画技术中的面部特征与语音特征的映射问题,提出了一种基于双向长短时记忆网络(Bi -LSTM)的映射模型学习方法.首先,在训练视频中同步地分别提取语音信号的MFCC参数和视频帧序列中的人脸特征点参数.其次,训练映射模型过程中将MFCC参数作为Bi -LSTM网络的输入,将面部特征参数作为网络的期望输出,并引入参数调优机制对迭代次数、隐层单元数、批处理大小、优化器类型等进行实验调优,以此得到最优的映射模型.对最优映射模型进行实验结果表明,采用双向Bi -LSTM 网络明显优于单向的LSTM网络,而且经过参数调优后映射准确率达到0.895; 因此,本文方法可以为后续的基于语音驱动的人脸视频合成应用提供有效的人脸特征预测参数.
- Abstract:
- Aiming the issue of mapping model between facial features and speech features in face animation technology, a mapping model learning method based on Bi -LSTM is proposed. Firstly, both MFCC parameters of speech and the facial landmark parameters of video frame are extracted concurrently from training video clips. Secondly, the mapping model is converged gradually by iterative training process with inputting MFCC parameters to B i -LSTM network and expecting the corresponding facial landmark parameters as output. In the meantime, approaches of fine -tune is applied to obtain best mapping model by experimental method, such as best epoch times, number of hidden layers, batch size and type of optimizer. The best mapping model experimental results show that Bi -LSTM is significantly better than LSTM, and the mapping accuracy reaches 0.895 after parameter fine -tuning. Therefore, the proposed method can provide effective facial predictive landmark parameters for applications of speech -driven face video synthesis.
参考文献/References:
[1] 李欣怡,张志超.语音驱动的人脸动画研究现状综述[J].计算机工程与应用,2017,53(22):21-28.
[2] 肖磊.语音驱动的高自然度人脸动画[D].合肥:中国科学技术大学,2019.
[3] LUO C W, YU J, WANG Z F. Synthesizing real -time speech -driven facial animation[C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Florence: IEEE, 2014:4568-4572.
[4] 赵晖.真实感汉语可视语音合成关键技术研究[D].长沙:国防科学技术大学,2009.
[5] 张贺,蒋冬梅,吴鹏,等.基于AAM和异步发音特征DBN模型的逼真可视语音合成[C]//第十一届全国人机语音通讯学术会议论文集.西安:西北工业大学,2011.
[6] TAYLOR S, KATO A, MATTHEWS I A, et al. Audio -to -visual speech conversion using deep neural networks[C]//San Francisco: Interspeech.2016:1482-1486.
[7] 阳珊,樊博,谢磊,等.基于BLSTM-RNN的语音驱动逼真面部动画合成[J].清华大学学报(自然科学版),2017,57(3):250-256.
[8] 宋怀波,齐关锋,钱程.基于YUV颜色空间的脸部区域特征点定位方法[J].吉林大学学报(工学版),2013,43(S1):39-42.
[9] 潘翔,陈敖,周春燕,等.基于视图特征点分布的三维模型检索算法[J].浙江工业大学学报(自然科学版),2013,41(6):641-645.
[10] 贾海鹏,张云泉,徐建良.基于OpenCL的图像积分图算法优化研究[J].计算机科学,2013,40(2):1-7.
[11] CRISTINACCE D, COOTES T. Feature detection and tracking with constrained local models[C]//British Machine Vision Conference. Edinburgh: BMVA, 2006:929-938.
[12] 高庆吉,赵志华,徐达,等.语音情感识别研究综述[J].智能系统学报,2020,15(1):1-13.
[13] AHMAD J, FIAZ M, KWON S I, et al. Gender identification using MFCC for telephone applications - a comparative study[J]. Iternational Journal of Computer Science and Electronics Engineering, 2015,3(5):351-355.
[14] HOCHREITER S, SCHMIDHUBER J. Long short -term memory[J]. Neural Computation, 1997,9(8):1735-1780.
[15] SCHUSTER M, PAILWAL K K. Bidirectional recurrent neural networks[J]. IEEE Transactions on Signal Processing, 1997,45(11):2673-2681.
备注/Memo
收稿日期: 2020-03-21 *通信作者: 金小峰(1970—),男,教授,研究方向为机器感知、图像及音频处理.
基金项目: 吉林省教育厅“十三五”科学技术项目(JJKH20191126KJ); 延边大学世界一流学科建设培育项目(18YLPY14)