ISSE: An Instruction-guided Speech Style Editing Dataset and Benchmark

Abstract

Speech style editing refers to modifying the stylistic properties of speech while preserving its linguistic content and speaker identity. However, most existing approaches depend on explicit labels or reference audio, which limits both flexibility and scalability. More recent attempts to use natural language descriptions remain constrained by oversimplified instructions and coarse style control. To address these limitations, we introduce an Instruction-guided Speech Style Editing Dataset (ISSE). The dataset comprises nearly 400 hours of speech and over 100,000 source-target pairs, each aligned with diverse and detailed textual editing instructions. We also build a systematic instructed speech data generation pipeline leveraging large language model, expressive text-to-speech and voice conversion technologies to construct high-quality paired samples. Furthermore, we train an instruction-guided autoregressive speech model on ISSE and evaluate it in terms of instruction adherence, timbre preservation, and content consistency. Experimental results demonstrate that ISSE enables accurate, controllable, and generalizable speech style editing compared to other datasets. Full ISSE dataset can be download as follows: Huggingface Link.

Framework

Dataset Distribution

Dataset Speech Samples

Transcript	Source Speech	Target Speech	Instruction
No, snow is not expected on Saturday.			Convert the source speech to a monotonous, crisp, guttural, measured, loud style with vocal fry.
You don't have to be sorry for my loss. You can find me a flight to Houston by typing into your computer there.			Convert the source speech to a calm, slow-paced style.
Gotcha, okay. I can't believe I didn't get that the first time. I thought it was a literal pot of gold.			Convert the source speech to a animated, happy tone with a measured pace.
I'm so grateful. I know that you have other friends that you could have chosen to bring with you, but...			Convert the source speech to a booming, crisp, animated, angry tone with a shrill, loud volume.
And so you get combinations of primary colors as the air pressure, as you get higher, the air pressure gets lower.			Convert the source speech to a flowing, sing-song, slow, confused style.
She picked up a letter opener and she thrust it right into Rachel, time and time again!			Convert the source speech to a nasal, enunciated, authoritative, flowing, singsong style with a slow speaking speed.
It's gonna be sunny, with temperatures going from sixty eight to seventy eight today in Covington.			Convert the source speech to a slow, calm, smooth, and rhythmic style.
I mean, you paid for it, you have to take a bite.			Convert the source speech to a slow, sad tone.
Did you purchase a doll at a store called "Toys and Treasures"?			Convert the source speech to a measured, high-pitched, fast-paced style.
Done! Purchase is complete. Is there anything else?			Convert the source speech to a measured, fast-paced style.
I'm about out of patience right now, so if you don't politely walk yourself to the very back of the line right now...			Convert the source speech to a calm, measured tone.
What do you wanna talk about, rising interest rates or global warming?			Convert the source speech to an authoritative, measured, nasal tone with whispered, singsong passages.
Tuesday, February eleventh, twenty twenty five will be in four years and seven months.			Convert the source speech to a fast-paced, happy tone.
Aw, man, look what I done did now. They all know it was me. I can't get out of this one. They're just all going to tell each other it's just going to go around the whole school sooner or later. Man, I shouldn't have did that today.			Convert the source speech to a loud, rapid, measured, crisp, and shrill tone.
Oh, that thing's not baked. That thing's not baked.			Convert the source speech to a animated, singsong, expressive, crisp with a measured speed, silky tone, loud projection.
It's about a charge of malicious mischief.			Convert the source speech to a cheerful, slow, animated tone.

Speech Style Edit

Source Speech	Instruction	Generated Speech	Target Speech
	Convert the source speech to a flowing, singsong, authoritative, and happy tone with a high-pitched, nasal, clear voice, delivered slowly.
	Convert the source speech to a clear, enunciated, slow, high-pitched style.
	Convert the source speech to a booming, enunciated, crisp, high-pitched, with a slow speed and loud volume.
	Convert the source speech to a high-pitched, silky, crisp, calm, and sing-song style.
	Convert the source speech to a high-pitched, silky, singsong, slow, crisp, and clear tone.
	Convert the source speech to a slow, high-pitched, slightly hushed tone.
	Convert the source speech to a high-pitched, flowing, singsong, slow-paced, disgusted tone in an American accent with balanced clarity.
	Convert the source speech to a high-pitched, confused, slow-paced style.
	Convert the source speech to a deep, calm, raspy, high-pitched, monotonous, slowly enunciated style.
	Convert the source speech to a sing-song, authoritative, measured tone with a nasal, high-pitched quality.
	Convert the source speech to a high-pitched, slow, sad tone.

Speech Synthesis

Transcript	Llasa-1B	Llasa-1B_TTS	Llasa-1B_ESS
Mm-hmm. Exactly. Exactly. You got that exactly right. You know what? Since you know it so well, why don't you go first?			Caption: A male speaker delivers a singing, flowing speech at a fast speed.
And we're good. All right. He's in there right now. He's unconscious from the shock, the trauma. So we're going to want to make sure that we cover all of our bases now. Do you know this guy's name?			Caption: A male speaker delivers a flowing, fast-paced speech with a measured rhythm, displaying a singsong, high-pitched tone.
He defended me during the trial.			Caption: A male American's speech is confused and delivered slowly.
So I had to get myself in that emotional place and get sad again.			Caption: male American's voice is heard, speaking with a measured pace. His speech is tinged with sadness.
It was so spectacular you didn't even know if, like, is this digital like or what's going on here?			Caption: A female speaker delivers a whispered, high-pitched speech with a measured speed.
I think RuPaul has a thing down here.			Caption: A sleepy male speaks in a measured, flowing manner, displaying a singsong accent.
Enjoy with some candy and popcorn and make sure not to steal Smaug's gold!			Caption: A male speaker delivers a happy and animated speech.
In Seattle today, there isn't any rain, but it's partly cloudy, with a very low chance of ice pellets.			Caption: A female speaks in a whispered, yet crisp and enunciated manner. Her voice is booming, with a measured speed and occasional shrillness. Despite her whispered tone, her speech remains loud and expressive.