Vision And Language Understanding Through Generative Modeling