Audio Samples of Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience

Abstract

In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces “Listen, Chat, and Edit” (LCE), a novel multimodal sound mixture editor that modifies each sound source in a mixture based on user-provided text instructions. LCE distinguishes itself with a user-friendly chat interface and its unique ability to edit multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for editing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles it into the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along with text prompts for diverse editing tasks like extraction, removal, and volume control. Our experiments demonstrate significant improvements in signal quality across all editing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources.

This page contains a set of audio samples in support of the paper.

We provide five samples for sound mixtures consisting of 2 Speech (TextrolSpeech) + 2 Audio (VGGSound) and two samples for each one of the zero-shot sound mixture compositions.

For every sample, we write 4 or 6 text prompts. We show the edited and the target sound mixture according to each text prompt.

Spectrograms for all samples can be found in Figure 2 and Figure 6 to 13 in Appendix G, in the paper.

We recommend opening this website with Chrome and wearing headphones for the best audio experience.

Sound Mixture Compositions

In-distribution
  • 2 Speech + 2 Audio (VGGSound)
  • Zero-shot
  • 2 Speech
  • 2 Audio (VGGSound)
  • 2 Speech + 1 Audio (VGGSound)
  • 1 Speech + 2 Audio (VGGSound)
  • 2 Speech + 2 Audio (FSD50K, seen audio labels)
  • 2 Speech + 2 Audio (FSD50K, unseen audio labels)
  • 2 Speech + 2 Audio (VGGSound)


    Input Mixture #1 female speaker with high pitch, normal tempo, high energy, and neutral emotion male speaker with low pitch, high tempo, normal energy, and neutral emotion helicopter turkey gobbling

    Text prompt A: "Increase the volume of the speeches and decrease the volume of the background sounds."

    Text prompt B: "Let's pull out the sound of the fast-talking man and the turkey."

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Remove all people talking."

    Text prompt D: "Why not get rid of the man's voice and the turkey's noise, and reduce the helicopter's volume?"

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    Text prompt E: "Could you kindly eliminate the sound of the helicopter? I appreciate it."

    Text prompt F: "Please extract the person with an elevated tone."

    Edited Mixture E Target Mixture E Edited Mixture F Target Mixture F

    Input Mixture #2 male speaker with low pitch, low tempo, low energy, and sad emotion female speaker with high pitch, normal tempo, normal energy, and neutral emotion playing accordion playing drum kit

    Text prompt A: "Is it possible to single out the accordion's performance?"

    Text prompt B: "Lower the volume of the live accordion music that is currently being played, please."

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Could you raise the decibel level of the gloomy speaker that has a subdued tone?"

    Text prompt D: "Please raise the sound for the female speaker with a standard tempo, amplify the playing accordion, reduce the playing drum kit, and decrease the volume for the male speaker with a sluggish pace."

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    Text prompt E: "I'd like you to exclude the speaker with a high-frequency voice and average vitality, conveying a neutral tone."

    Text prompt F: "Make everything quieter."

    Edited Mixture E Target Mixture E Edited Mixture F Target Mixture F

    Input Mixture #3 male speaker with low pitch, normal tempo, normal energy, and neutral emotion male speaker with low pitch, high tempo, normal energy, and neutral emotion underwater bubbling train horning

    Text prompt A: "Enhance this recording by removing all the noises."

    Text prompt B: "Could you raise the audio level of the underwater bubbling sound exclusively?"

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Can you adjust the sound so that both speakers are louder, the train horn is quieter, and the underwater bubbling is completely removed from the recording?"

    Text prompt D: "Is it possible to turn down the speakers' volume and crank up the background ambiance?"

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    Text prompt E: "I'd like you to edit out the speaker characterized by a faster tempo and the train horn sound altogether."

    Text prompt F: "Can you remove the speaker with a rapid rhythm?"

    Edited Mixture E Target Mixture E Edited Mixture F Target Mixture F

    Input Mixture #4 female speaker with normal pitch, normal tempo, normal energy, and neutral emotion male speaker with low pitch, normal tempo, high energy, and neutral emotion playing hammond organ rain

    Text prompt A: "Can you edit the recording to extract the sound of the organ and rainfall?"

    Text prompt B: "Can you modify the sound so that the rain and Hammond organ are quieter, the female speaker with normal pitch and energy is louder, and the male speaker with low pitch and high energy is entirely eliminated from the recording?"

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "I'd like you to turn down the volume for the lady with the average pitch."

    Text prompt D: "Please remove the the organ music and both the female and male speakers in the audio track."

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    Text prompt E: "Please make the organ music louder."

    Text prompt F: "Let's extract the part featuring the speaker characterized by typical tone?"

    Edited Mixture E Target Mixture E Edited Mixture F Target Mixture F

    Input Mixture #5 female speaker with normal pitch, normal tempo, low energy, and neutral emotion female speaker with normal pitch, low tempo, normal energy, and neutral emotion church bell ringing playing theremin

    Text prompt A: "Extract the bell ringing from the rest of the audio."

    Text prompt B: "Could you amplify the audio level of the speaker with normal energy and slow tempo, and also raise the church bell ringing sound, but lower the volume of the speaker with low vitality?"

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Let's extract all human voices from the recording."

    Text prompt D: "Reduce the background audio and turn up the volume on the talking parts."

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    Text prompt E: "Is it feasible to erase the theremin's playing sound?"

    Text prompt F: "I'd like you to amplify the playing theremin sound and reduce the church bell ringing sound."

    Edited Mixture E Target Mixture E Edited Mixture F Target Mixture F

    2 Speech


    Input Mixture #6 female speaker with normal pitch, normal
    tempo, low energy, and contempt emotion
    female speaker with normal pitch, normal
    tempo, normal energy, and neutral emotion

    Text prompt A: "Can you extract the speaker characterized by their contemptuous manner?"

    Text prompt B: "Is it possible to turn up the volume of the speaker exhibiting typical enthusiasm and reduce the volume of the speaker showing low energy?"

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Could you isolate the speaker without emotion and speaking in a normal volume?"

    Text prompt D: "Why not turn up the sound of the contemptuous speaker while removing the speaker maintaining a neutral emotional state?"

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    Input Mixture #7 female speaker with normal pitch, high
    tempo, normal energy, and neutral emotion
    male speaker with low pitch, low
    tempo, high energy, and neutral emotion

    Text prompt A: "I'd appreciate it if you could eliminate the speaker who is speaking at a rapid pace."

    Text prompt B: "The female speaker talks is loud. Could you turn down the volume?"

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Remove the gentleman with a deep tone."

    Text prompt D: "Begin by decreasing the volume of the female speaker with a fast tempo, and then increase the volume of the male speaker with a low pitch."

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    2 Audio (VGGSound)


    Input Mixture #8 playing tabla missile launch

    Text prompt A: "Please turn up the volume of the playing tabla sound and remove the missile launch sound."

    Text prompt B: "Kindly turn down the sound of the rocket being launched."

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Extract the tabla music for me."

    Text prompt D: "Could you decrease the volume for both the missile launch and the playing tabla sounds?"

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    Input Mixture #9 fireworks banging vacuum cleaner cleaning floors

    Text prompt A: "Is it possible to remove the noise from the vacuum and increase the volume of the fireworks?"

    Text prompt B: "Please take out the sound of the fireworks banging and enhance the volume of the vacuum cleaner cleaning the floors."

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Could you eliminate the noise from the fireworks explosions, please?"

    Text prompt D: "Just extract the firework for me."

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    2 Speech + 1 Audio (VGGSound)


    Input Mixture #10 male speaker with normal pitch, low tempo, normal energy, and surprised emotion female speaker with high pitch, high tempo, low energy, and neutral emotion playing banjo

    Text prompt A: "Lower the sound level of the banjo playing, remove the woman with a high-paced, low-energy delivery, and increase the volume of the surprised man who speaks slowly with normal enthusiasm."

    Text prompt B: "Try to delete the surprised male speaker, if you can."

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Could you decrease the audio level of the female speaker with a fast tempo and low vitality who maintains a neutral emotion?"

    Text prompt D: "Extract the banjo music from the audio."

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    Input Mixture #11 male speaker with low pitch, high tempo, low energy, and neutral emotion female speaker with high pitch, normal tempo, normal energy, and neutral emotion wind chime

    Text prompt A: "Extract the background sound but make it quieter."

    Text prompt B: "Can you extract the audio of the speaker with a low-pitched voice and a brisk tempo?"

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Increase the audio of the wind chime, decrease the volume of the male speaker with a fast pace and low enthusiasm, and remove the female speaker with a regular pace and average enthusiasm."

    Text prompt D: "Please take out the individual with a low pitch."

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    1 Speech + 2 Audio (VGGSound)


    Input Mixture #12 male speaker with low pitch, high tempo, normal energy, and neutral emotion cuckoo bird calling playing oboe

    Text prompt A: "I'd like to extract the audio of an oboe being played, please."

    Text prompt B: "Please remove all non-human sounds."

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Enhance the sound level of the cuckoo bird's call, please."

    Text prompt D: "First, volume up the man with a high tempo and regular enthusiasm. Second, volume down the cuckoo bird's calling. Third, remove the oboe music."

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    Input Mixture #13 female speaker with high pitch, high tempo, low energy, and neutral emotion dog growling chicken crowing

    Text prompt A: "Kindly remove the menacing growl produced by the canine."

    Text prompt B: "Could you extract the sound of the woman speaking and the chicken, and then decrease the chicken's sound?"

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Please quieten down the woman's voice, but make the dog and chicken's voices louder."

    Text prompt D: "I only want to keep the animal voices of the dog and chicken in the mix."

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    2 Speech + 2 Audio (FSD50K, seen audio labels)


    Input Mixture #14 male speaker with low pitch, normal tempo, low energy, and neutral emotion female speaker with high pitch, normal tempo, normal energy, and neutral emotion acoustic guitar (dog) bark

    Text prompt A: "Make the conversation as clean as possible."

    Text prompt B: "Boost the volume of the conversation, and also quieten down those distracting background sounds."

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Could you pull out the dog barking sound for me? Thanks."

    Text prompt D: "Can you isolate the speaker with a deep tone and low enthusiasm?"

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    Text prompt E: "I'd like you to turn down the volume of the low-pitch male speaker and remove the dog barking noise."

    Text prompt F: "Please extract the sound of the guitar and the dog's barking."

    Edited Mixture E Target Mixture E Edited Mixture F Target Mixture F

    Input Mixture #15 female speaker with high pitch, normal tempo, normal energy, and neutral emotion female speaker with normal pitch, normal tempo, normal energy, and neutral emotion toilet flush siren

    Text prompt A: "Let's remove the annoying siren sound."

    Text prompt B: "Can you edit the audio to extract the speaker characterized by standard pitch?"

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Could you extract the high-pitched speaker and the wailing siren?"

    Text prompt D: "I want you to single out the siren sound."

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    Text prompt E: "I'd like you to decrease the siren volume, lower the sound of the toilet flushing, reduce the volume of the speaker with normal pitch, and boost the volume of the speaker with high pitch."

    Text prompt F: "Please raise the sound level of the high-pitched speaker, remove the speaker with typical pitch, and erase the siren sound."

    Edited Mixture E Target Mixture E Edited Mixture F Target Mixture F

    2 Speech + 2 Audio (FSD50K, unseen audio labels)


    Input Mixture #16 male speaker with low pitch, low tempo, high energy, and neutral emotion female speaker with high pitch, normal tempo, high energy, and neutral emotion scissors bowed string instrument

    Text prompt A: "Please get rid of the sound of scissors."

    Text prompt B: "Pump up the volume on the talks and reduce other noises."

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Please separate the part featuring someone playing a string instrument."

    Text prompt D: "I'd like you to isolate both the female speaker and the bowed string instrument sound from the mixture."

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    Text prompt E: "Remove all speakers from the audio, can you?"

    Text prompt F: "Is it possible to turn up the volume of the string instrument and reduce the volume of the scissors?"

    Edited Mixture E Target Mixture E Edited Mixture F Target Mixture F

    Input Mixture #17 male speaker with low pitch, normal tempo, normal energy, and neutral emotion female speaker with normal pitch, normal tempo, high energy, and neutral emotion musical keyboard seagull

    Text prompt A: "Is it possible to extract the song of the seagull?"

    Text prompt B: "Add more volume to the surrounding, and decrease the speech volume."

    Edited Mixture A Target Mixture A Edited Mixture B Target Mixture B

    Text prompt C: "Could you eliminate the music from a keyboard?"

    Text prompt D: "Identify and isolate the lady speaking in a high energy."

    Edited Mixture C Target Mixture C Edited Mixture D Target Mixture D

    Text prompt E: "Eliminate any non-speech sounds in the surroundings."

    Text prompt F: "Can you edit the audio to increase the volume of the female speaker with normal pitch and high energy, decrease the sound of the male speaker with low pitch and normal energy, raise the volume of the seagull noise, and lower the volume of the musical keyboard?"

    Edited Mixture E Target Mixture E Edited Mixture F Target Mixture F