Text-Based Speech Video Synthesis From A Single Face Image