SFM-Adapter: Style-aware Feature Manipulation Adapter for Speech Style Editing

Yun Chen1, Haohe Liu1, Qi Chen2, Arshdeep Singh3, Junqi Zhao1,

Wenwu Wang1, Philip J.B. Jackson1, Mark D. Plumbley3,*

1Centre for Vision, Speech and Signal Processing, University of Surrey

2ByteDance Intelligent Creation

3King’s College London

Abstract

Speech Style Editing (SSE) aims to modify selected style attributes (e.g., timbre, emotion, pitch) while preserving the linguistic content and all other style attributes that are not given. Many speech applications require flexible control over speech style, making SSE increasingly important. Existing SSE approaches typically follow a style-generation paradigm that synthesizes non-linguistic attributes from style conditions. However, this often results in limited preservation of source attributes and insufficient flexibility when only a subset of style attributes is specified. To overcome these limitations, we adopt a style editing paradigm, in which the target style is achieved by adjusting the source speech instead of producing speech from scratch. Building on this paradigm, we propose a diffusion-based framework with a Style-aware Feature Manipulation Adapter (SFM-Adapter). The SFM-Adapter performs feature-level modulation by integrating user-provided style information with source speech features through multi-layer cross-attention. The resulting modulated features are incorporated into the generation process via mask attention. During inference, a Large Audio-Language Model (LALM)-based length regulation is designed to predict speaking speed and adjust duration. Experiments across multiple speech style editing tasks demonstrate that the SFM-Adapter achieves more natural, accurate, and source-preserving style editing compared with existing methods.

Framework

Framework Overview
Figure 1: Overall architecture of the proposed framework.
Inference Process
Figure 2: Illustration of speed estimation and the inference process of SFM-Adapter.

Multi-Style Editing

Inputs:

  • Text Prompt: A sentence describing the intended speaking style.
  • Audio Prompt: An audio sample representing the target speaker identity.
  • Source Speech: The original input speech with the desired content.

Note: In SFM-Adapter, style editing is guided by the text prompt while timbre is controlled by the audio prompt. In contrast, Vevo relies on the audio prompt for both style and timbre, making it less flexible when fine-grained textual control is desired.

Source Speech Audio Prompt Text Prompt Target SFM-Adapter Vevo
With a regular pitch, the man takes his livid time to express himself, infusing his words with a hint of normal energy.
A stunned male speaker conveys thoughts with a standard pitch, addressing topics at an ordinary speed, and exuding low energy.
A fatigued disconsolate male voice, characterized by a slow speaking speed and a deep, low pitch, emits a subtle energy, resulting in a soothing audio experience that exudes tranquility.
Speaking with intention, her unhappy voice held a high pitch and low energy.
With a rapid speaking speed, she expressed her joyful viewpoint.

Expressive Style Editing

Inputs:

  • Text Prompt: A sentence or phrase that describes the intended speaking style.
  • Source Speech: The original input speech containing the content to be retained.
Source Speech Audio Prompt Text Prompt SFM-Adapter AINN Vevo StyleVC
Whispering softly, her pitch remains high.
Her low-energy voice contrasts with her heartbroken high key.
The heartbroken man speaks in a rich, low voice with a measured cadence.
Speaking quickly with a high tone, her amused energy remained subdued.
A livid man addresses topics with a standard pitch, engaging in discourse at a usual speed, and with an undertone of low energy.

Timbre Editing

Inputs:

  • Audio Prompt: An audio sample representing the target speaker identity.
  • Source Speech: The original input speech containing the content to be retained.
Source Speech Audio Prompt DiffHier FreeVC DDDMVC Vevo StyleVC SFM-Adapter

Prompt Design

Text Prompt
You are a perceptive assistant trained to evaluate the likely speech tempo implied by a text prompt.
Your task is to:
1. Extract key acoustic properties implied by the text (such as speech rate, pause expectation, clarity, emotional tone, energy level, etc.).
2. Analyze how these properties would influence the pacing of the corresponding speech.
3. Assign a tempo score between 0 and 1 based on your analysis:
- 0 represents extremely slow delivery (very slow, calm, deliberate speech).
- 1 represents extremely fast delivery (very rapid, energetic, compressed speech).
- 0.5 represents a moderate pace (between slow and fast).
- Values in between indicate intermediate tempo levels.
Audio Prompt
You are a perceptive assistant trained to evaluate the speech tempo of an audio clip.
Your goal is to reason step by step like a human listener and assign a tempo score between 0 (extremely slow) and 1 (extremely fast). The higher the score, the faster the perceived speech tempo.
Do not rely solely on raw speed metrics—consider how the speech feels holistically, including rhythm, clarity, pausing, and overall delivery style.
The score must be a single number from 0 to 1, rounded to one decimal place.
**Step 1: Pausing Pattern**
- Are there long silences or frequent pauses between words or phrases?
- Or does the speaker talk continuously with minimal interruption?
**Step 2: Articulation Clarity**
- Are the words clearly enunciated and easy to understand?
- Or are they rushed, slurred, or overly compressed?
**Step 3: Information Density & Rhythm**
- Does the speaker convey a large amount of information in a short time?
- Does the rhythm feel calm and measured, or fast and pressured?