Semantic congruency is a phenomenon by which a stimulus is perceived, semantically speaking, in a way that perception is close related to contents provided by other sources. This way, when an observer is looking at an image, its perception can be influenced by the semantic load of auditory stimuli, if their content is related to the meaning of the image. When it happens, the semantic congruency effect arises. This study was aimed at establishing the modulating effect that tones of voice can exert on the perception of an ambiguous image, wanting to vindicate the semantic congruency effect. Thirty-two participants viewed the bistable image My girlfriend or my mother-in-law, while listening to two incomprehensible modulating tones of voice. An eye-tracker device was used to measure the time of visual percepts that were congruent with the semantic load of the audios. There were significant differences between the durations of the congruent and incongruent visual percepts with the modulating audio. Therefore, it is concluded that tones of voice can bring about an impact on the perception of ambiguous images. An incomprehensible spoken language modulates the understanding of bistable visual designs, where, by means of top-down perceptual modulating processes, the semantic congruency effect emerges.