merlin base
术语表
- Front end 前端
- vocoder 声音合成机(声码器)
- MFCC 受限波尔曼兹机
- bap band aperiodicity
- ASR:Automatic Speech Recognition自动语音识别
- AM:声学模型
- LM:语言模型
- HMM:Hiden Markov Model 输出序列用于描述语音的特征向量,状态序列表示相应的文字
- HTS:HMM-based Speech Synthesis System语音合成工具包
- HTK:Hidden Markov Model Toolkit 语音识别的工具包自编码器
- SPTK:speech signal precessing toolkit
- SPSS : 统计参数语音合成statistical parametric speech synthesis
- pitch 音高:表示声音(基本)频率的高低
- Timbre 音色
- Zero Crossing Rate 过零率
- Volume 音量
- sil silence
- syllable 音节
- intonation 声调,语调,抑扬顿挫
- POS part of speech
- mgc
- mcep Mel-Generalized Cepstral Reprfesentation
- mcc mel cepstral coefficents
- mfcc Mel Frequency Cepstral Coefficents
- LSP: Line Spectral Pair线谱对参数
- 多个音素的 命名规则
- monophone 单音素
- biphone diphone 两音素
- triphone 三音素
- quadphone 四音素
- utterance 语音,发声
- 英语韵律符号系统ToBI(Tone and Break Index)
- CD-DNN-HMM(Context-Dependent DNN-HMM)
- frontend :The part of a TTS system that transforms plain
- text into a linguistic representation is called a frontend
- .wpa word to phonetic alphabet
- .cmp Composed acoustic features
- .scp system control program
- .mlf master label file
- .pam phonetic alphabets to model
- .mgc mel generalized cepstral feature
- .lf0 log f0 a representation of pitch(音高) 音高用基频表示
- .mgc
- .utt .utt files are the linguistic representation of the text that Festival outputs(full context training labels)
- .cfg
- initial && final 声母和韵母
AM Acoustic Model,声学模型
ACR Absolute Category Rating,绝对等级评定
ASR Automatic Speech Recognition,自动语音识别
CART Classification and Regression Tree,分类回归树
CCR Comparison Category Rating,比较等级评定
CFHMM Continuous F0,连续基频模型
CMLLR Constrained Maximum Likelihood Linear Regression,受限最大似然线性回归
CMOS Comparison Mean Opinion Score,比较平均意见分
CORC Correlation Coefficient,相关系数
CR Command-Response,命令响应
CSMAPLR Constrained Structural Maximum A Posterior Linear Regression,受限结构化最大后验概率线性回归
DBN Dynamic Bayesian Network,动态贝叶斯网络
DCR Degradation Category Rating,损伤等级评定
DCT Discrete Cosine Transform,离散余弦变换
DMOS Degradation Mean Opinion Score,损伤平均意见分
ED Emotion Dependent,特定情感
EM Expectation Maximization,期望最大化
F0 Fundamental Frequency,基音频率
GMM Gaussian Mixture Model,高斯混合模型
GTD Global Tied Distribution,全局绑定分布
HMM Hidden Markov Model,隐马尔科夫模型
HNR Harmony Noise Ratio,谐波噪声比
HSS HMM-based Speech Synthesis,基于HMM的语音合成
HSMM Hidden Semi-Markov Model,隐半马尔科夫模型
HTK HMM Tool Kit,HMM工具包
HTS HMM-based Speech Synthesis System,基于HMM的语音合成系统
LPC Linear Prediction Coefficient,线性预测系数
MAP Maximum A Posterior,最大后验概率
MCD Mel-Cepstral Distortion,倒谱系数失真
MDL Minimum Description Length,最小描述长度
MDS Multi-Dimensional Scaling,多维标度
MGCC Mel-Generalized Cepstral Coefficient,梅尔广义倒谱系数
MLI Maximum Likelihood Increase,最大似然增量
MLSA Mel Log Spectral Approximation,梅尔对数谱近似
MLLR Maximum Likelihood Linear Regression,最大似然线性回归
MLPG Maximum Likelihood Parameter Generation,最大似然参数生成
MOS Mean Opinion Score,平均意见分
MSD Multi-Space Distribution,多空间分布
PiTAR Pitch Target Realisation,基频目标实现
PM Prosodic Model,韵律模型
RMSE Root-Mean-Square-Error,根均方误差
SA Speaker Adaptation,说话人自适应
SI Speaker Independent,说话人无关
SMAP Structural Maximum A Posterior,结构化最大后验概率
SMAPLR Structural Maximum A Posterior Linear Regression,结构化最大后验概率线性回归
SPTK Speech Processing Tool Kit,语音处理工具包
SSM Supra-Segmental Model,超音段模型
SSML Speech Synthesis Markup Language,语音合成标记语言
TA Target Approximation,目标逼近
ToBI Tone and Break Index,调式与停顿标记
TTS Text-To-Speech,文语转换
VC Voice Conversion,声音转换
VFS Vector Field Smoothing,矢量场平滑
VPR Voice Print Recognition,声纹识别
VTLN Vocal Tract Length Normalization,声道长度规整
MTTS Merlin/Mandarin Text-to-Speech Document
- merlin 只提供了基础的TTS声学模型,声学和语音特征归一化,神经网络声学模型训练和生成。
- 前端文本处理(frontend)
- festival
- festvox
- hts
- htk
- 声码器(vocoder)
- WORLD
- SPTK
- MagPhase
merlin 安装教程
# python2.7环境下安装merlin
# 建议在conda环境下建立merlin环境
sudo apt-get install csh cmake realpath autotools-dev automake
pip install numpy scipy matplotlib lxml theano bandmat
git clone https://github.com/CSTR-Edinburgh/merlin.git
cd merlin/tools
./compile_tools.sh
运行Merlin demo
$ sudo bash ~/merlin/egs/slt_arctic/s1/run_demo.sh
Merlin源码理解
语料库包含了文本和音频文件,文本需要首先通过前端FrontEnd处理成神经网络可接受的数据,这一步比较繁琐,不同语言也各不相同,下面会着重讲解。音频文件则通过声码器(这里使用的是STRAIGHT声码器)转换成声码器参数(包括了mfcc梅谱倒谱系数,f0基频,bap:band aperiodicity等)再参与到神经网络的训练之中。
- *Merlin: An Open Source Neural Network Speech Synthesis System :
merlin
merlin算法的基本知识
github地址:merlin
https://github.com/CSTR-Edinburgh/merlin
front-end text processor: Festival
- 地址:http://www.cstr.ed.ac.uk/projects/festival/
vocoder: STRAIGHT or WORLD
install
$ bash tools/compile_tools.sh #SPTK, WORLD
$ bash tools/compile_other_speech_tools.sh #speech tools, festival and festvox
$ bash tools/compile_htk.sh yanerle KSmLJjjt #align labels, Merlin requires installation of HTK
$ pip install numpy
$ pip install -r requirements.txt
example :Getting started with the Merlin Speech Synthesis Toolkit
- 地址:https://jrmeyer.github.io/tts/2017/02/14/Installing-Merlin.html
教程
- 地址:http://www.speech.zone/courses/one-off/merlin-interspeech2017
- CMU_ARCTIC databases:https://cstr-edinburgh.github.io/merlin/getting-started/slt-arctic-voice/