The Alexa Content action is used to speak a reply to an Alexa Trigger (ie. an Alexa Intent). If you attach an MP3 file then the MP3 audio will be played instead. Images are only displayed at the moment on large displays (ie. not the Echo Spot).
You can attach an MP3 file which will play instead of the text-to-speech but you'll need to encode it according to Amazon's requirements https://developer.amazon.com/docs/custom-skills/speech-synthesis-markup-language-ssml-reference.html#h3_converting_mp3
The image below shows the settings in Adobe Audition.