Addressing Misalignment In Language Model Deployments Through Context-Specific Evaluations