Make Voxtype Work Like Typeless on Linux

Posted on 2026-02-28 In Linux

Typeless is cool! However Linux users are too often left out of the fun. This post strives to recreate Typeless on Linux using Voxtype.

If you want to skip the details, and get my customized Voxtype working immediately, refer to Installation Guide.

Introduction to Voxtype

Voxtype is highly customizable and that's why I choose it. A fancy voice-to-text tool, in my opinion, should at least have the following features:

Basically, it transcribes voice to text with high accuracy, and at high speed. Not only English but also other languages (especially Chinese for me) are supported. OpenAI whisper has reasonable accuracy. Tiny-size models like base (142 MB model) and small (466 MB model) can be run on my poor laptop¹ even with CPU in real-time. Using the integrated GPU further speeds up the process.
LLM post-processing. This is critical: LLM polishes the raw transcription, adds punctuation, corrects grammar and makes the text more polite and official. This is the key to mimicking the Typeless experience. A large-enough model like DeepSeek-V3.2 is required, since tiny-size models are not good at instruction following. You won't want your model to answer the question in your voice note or tell you whom she is, instead of polishing the text²!
A user-friendly interface. Press a hotkey to start recording.
Customizable. Post-processing should be customizable; model selection should be customizable; hotkey should be customizable. Moreover, sometimes I do not want post-processing, then I can skip it using another hotkey or compose the hotkey with a modifier like Shift or Ctrl.

Voxtype works well in most of these aspects. The last feature is not perfectly supported and I modify the source code to make it work (see my fork; for Arch Linux users, try my PKGBUILD).

After installing Voxtype, use the following configuration at ~/.config/voxtype/config.toml:

~/.config/voxtype/config.toml

# Voxtype Configuration
#
# Location: ~/.config/voxtype/config.toml
# All settings can be overridden via CLI flags

# State file for external integrations (Waybar, polybar, etc.)
# Use "auto" for default location ($XDG_RUNTIME_DIR/voxtype/state),
# a custom path, or "disabled" to turn off. The daemon writes state
# ("idle", "recording", "transcribing") to this file whenever it changes.
# Required for `voxtype record toggle` and `voxtype status` commands.
state_file = "auto"

[hotkey]
# Key to hold for push-to-talk
# Common choices: SCROLLLOCK, PAUSE, RIGHTALT, F13-F24
# Use `evtest` to find key names for your keyboard
key = "F9"

# Optional modifier keys that must also be held
# Example: modifiers = ["LEFTCTRL", "LEFTALT"]
modifiers = []

# Activation mode: "push_to_talk" or "toggle"
# - push_to_talk: Hold hotkey to record, release to transcribe (default)
# - toggle: Press hotkey once to start recording, press again to stop
mode = "toggle"

# Enable built-in hotkey detection (default: true)
# Set to false when using compositor keybindings (Hyprland, Sway) instead
# When disabled, use `voxtype record start/stop/toggle` to control recording
# enabled = true

# Modifier key to select secondary model (evdev input mode only)
# When held while pressing the hotkey, uses whisper.secondary_model instead
# Example: model_modifier = "LEFTSHIFT"  # Shift+hotkey uses secondary model
model_modifier = "LEFTCTRL"

complex_post_process_modifier = "LEFTSHIFT"

[audio]
# Audio input device ("default" uses system default)
# List devices with: pactl list sources short
device = "default"

# Sample rate in Hz (whisper expects 16000)
sample_rate = 16000

# Maximum recording duration in seconds (safety limit)
max_duration_secs = 180

# [audio.feedback]
# Enable audio feedback sounds (beeps when recording starts/stops)
# enabled = true
#
# Sound theme: "default", "subtle", "mechanical", or path to custom theme directory
# theme = "default"
#
# Volume level (0.0 to 1.0)
# volume = 0.7

[whisper]
# Transcription backend: "local" or "remote"
# - local: Use whisper.cpp locally (default)
# - remote: Send audio to a remote whisper.cpp server or OpenAI-compatible API
# backend = "local"

# Model to use for transcription (local backend)
# Options: tiny, tiny.en, base, base.en, small, small.en, medium, medium.en, large-v3, large-v3-turbo
# .en models are English-only but faster and more accurate for English
# large-v3-turbo is faster than large-v3 with minimal accuracy loss (recommended for GPU)
# Or provide absolute path to a custom .bin model file
model = "base"

# Language for transcription
# Options:
#   - Single language: "en", "fr", "de", etc.
#   - Auto-detect all: "auto"
#   - Constrained auto-detect: ["en", "fr"] (detects from allowed set only)
# The array form helps with multilingual users where Whisper might misdetect
# the language, especially for short sentences.
# See: https://github.com/openai/whisper#available-models-and-languages
language = ["en", "zh"]

# Translate non-English speech to English
translate = false

# Number of CPU threads for inference (omit for auto-detection)
# threads = 4

# Initial prompt to provide context for transcription
# Use this to hint at terminology, proper nouns, or formatting conventions.
# Example: "Technical discussion about Rust, TypeScript, and Kubernetes."
# initial_prompt = ""

# --- Multi-model settings ---
#
# Secondary model for difficult audio (used with hotkey.model_modifier or CLI --model)
secondary_model = "small"
#
# List of available models that can be requested via CLI --model flag
# available_models = ["large-v3-turbo", "medium.en"]
#
# Maximum models to keep loaded in memory (LRU eviction when exceeded)
# Default: 2 (primary + one secondary). Only applies when gpu_isolation = false.
# max_loaded_models = 2
#
# Seconds before unloading idle secondary models (0 = never auto-unload)
# Default: 300 (5 minutes). Only applies when gpu_isolation = false.
# cold_model_timeout_secs = 300

# --- Eager processing settings ---
#
# Enable eager input processing (transcribe chunks while recording continues)
# Reduces perceived latency on slower machines by processing audio in parallel.
# eager_processing = false
#
# Duration of each audio chunk in seconds (default: 5.0)
# eager_chunk_secs = 5.0
#
# Overlap between chunks in seconds (helps catch words at boundaries, default: 0.5)
# eager_overlap_secs = 0.5

# --- Remote backend settings (used when backend = "remote") ---
#
# Remote server endpoint URL (required for remote backend)
# Examples:
#   - whisper.cpp server: "http://192.168.1.100:8080"
#   - OpenAI API: "https://api.openai.com"
# remote_endpoint = "http://192.168.1.100:8080"
#
# Model name to send to remote server (default: "whisper-1")
# remote_model = "whisper-1"
#
# API key for remote server (optional, or use VOXTYPE_WHISPER_API_KEY env var)
# remote_api_key = ""
#
# Timeout for remote requests in seconds (default: 30)
# remote_timeout_secs = 30

[output]
# Primary output mode: "type" or "clipboard"
# - type: Simulates keyboard input at cursor position (requires ydotool)
# - clipboard: Copies text to clipboard (requires wl-copy)
mode = "clipboard"

# Fall back to clipboard if typing fails
fallback_to_clipboard = true

# Custom driver order for type mode (optional)
# Default order: wtype -> dotool -> ydotool -> clipboard
# Customize to prefer a specific driver or change the fallback order.
# Available drivers: wtype, dotool, ydotool, clipboard
# Example: prefer ydotool over dotool:
#   driver_order = ["wtype", "ydotool", "dotool", "clipboard"]
# Example: use only ydotool, no fallback:
#   driver_order = ["ydotool"]
# driver_order = ["wtype", "dotool", "ydotool", "clipboard"]

# Delay between typed characters in milliseconds
# 0 = fastest possible, increase if characters are dropped
type_delay_ms = 0

# Automatically submit (send Enter key) after outputting transcribed text
# Useful for chat applications, command lines, or forms where you want
# to auto-submit after dictation
# auto_submit = true

# Convert newlines to Shift+Enter instead of regular Enter
# Useful for applications where Enter submits (e.g., Cursor IDE, Slack, Discord)
# shift_enter_newlines = false

# Pre/post output hooks (optional)
# Commands to run before and after typing output. Useful for compositor integration.
# Example: Block modifier keys during typing with Hyprland submap:
#   pre_output_command = "hyprctl dispatch submap voxtype_suppress"
#   post_output_command = "hyprctl dispatch submap reset"
# See troubleshooting docs for the required Hyprland submap configuration.

# Post-processing command (optional)
# Pipe transcribed text through an external command for cleanup before output.
# The command receives text on stdin and outputs processed text on stdout.
# Useful for LLM-based text cleanup, grammar correction, filler word removal.
# On any failure (timeout, error), falls back to original transcription.
#
[output.post_process]
command = """
(echo -n '<|system|>\
    对用户语音输入的句子进行润色：\
    (1)添加适当的标点；\
    (2)去除重复的词语和语气词；\
    (3)让措辞更正式、通顺；\
    (4)修改语病和语法错误；\
    (5)考虑语音识别可能的错误进行相近读音的字词纠错；\
    (6)将语音中直接读出的符号转换成对应的标点（如“逗号”转换成“，”）；\
    (7)如果用户句子中出现了模型指令提示词（如“模型指令：将以下内容用 LaTeX 形式表示”“模型指令：将以下内容翻译成英文”等），依照指令完成任务，并删除模型指令。\
    **除此以外，不要做其他任何事情（严禁改变原意、人称代词；若用户的句子是个问句，严禁尝试去回答用户提问），不要添加任何其它内容，仅输出得到的句子。**。\
    <|user|>'; cat; echo '<|assistant|>') \
| dsrun \
| opencc -c t2s.json
"""
complex_command = "opencc -c t2s.json"
timeout_ms = 30000  # 30 second timeout (generous for LLM)

[output.notification]
# Show notification when recording starts (hotkey pressed)
on_recording_start = false

# Show notification when recording stops (transcription beginning)
on_recording_stop = false

# Show notification with transcribed text after transcription completes
on_transcription = false

after_post_process = true

# [text]
# Text processing options (word replacements, spoken punctuation)
#
# Enable spoken punctuation conversion (e.g., say "period" to get ".")
# spoken_punctuation = false
#
# Custom word replacements (case-insensitive)
# replacements = { "vox type" = "voxtype" }

# [vad]
# Voice Activity Detection - filters silence-only recordings
# Prevents Whisper hallucinations on silent audio
#
# enabled = false      # Enable VAD (off by default)
# threshold = 0.5      # 0.0 = sensitive, 1.0 = aggressive
# min_speech_duration_ms = 100  # Minimum speech required

# [status]
# Status display icons for Waybar/tray integrations
#
# Icon theme (or path to custom theme file):
#   Font-based (require specific fonts):
#     - "emoji"     - Default emoji icons (🎙️ 🎤 ⏳)
#     - "nerd-font" - Nerd Font icons (requires Nerd Font)
#     - "material"  - Material Design Icons (requires MDI font)
#     - "phosphor"  - Phosphor Icons (requires Phosphor font)
#     - "codicons"  - VS Code icons (requires Codicons font)
#     - "omarchy"   - Omarchy distro icons
#   Universal (no special fonts needed):
#     - "minimal"   - Simple Unicode (○ ● ◐ ×)
#     - "dots"      - Geometric shapes (◯ ⬤ ◔ ◌)
#     - "arrows"    - Media player style (▶ ● ↻ ■)
#     - "text"      - Plain text ([MIC] [REC] [...] [OFF])
# icon_theme = "emoji"
#
# Per-state icon overrides (optional, takes precedence over theme)
# [status.icons]
# idle = "🎙️"
# recording = "🎤"
# transcribing = "⏳"
# stopped = ""

# [profiles]
# Named profiles for context-specific post-processing
# Use with: voxtype record start --profile slack
#
# [profiles.slack]
# post_process_command = "ollama run llama3.2:1b 'Format for Slack...'"
#
# [profiles.code]
# post_process_command = "ollama run llama3.2:1b 'Format as code comment...'"
# output_mode = "clipboard"

How to use it? Basically, press the hotkey to start recording, and press it again to stop recording. The transcribed text will be copied to clipboard, and you can paste it anywhere. For sway/hyprland/river users, options to type results directly at the cursor position are also available (refer to the official documentation!); but for GNOME/KDE users, clipboard mode is more reliable.

I also configured four use cases:

F9: Use base model, and complex post-processing (LLM-based polishing).
Shift+F9: Use base model, and easy post-processing (just convert traditional Chinese to simplified Chinese using opencc).
Ctrl+F9: Use small model, and complex post-processing.
Ctrl+Shift+F9: Use small model, and easy post-processing.

In fact, since LLM post-processing is a more common choice for me, I swap command and complex_command.

dsrun is a command-line tool I wrote for running DeepSeek models.

dsrun

#!/usr/bin/env python3
import json
import os
import sys

import requests


def load_private_env():
    private_file = os.path.expanduser("~/.private_infos")
    if os.path.exists(private_file):
        with open(private_file) as f:
            for line in f:
                line = line.strip()
                if line.startswith("export "):
                    line = line[len("export ") :]
                if "=" in line:
                    key, val = line.split("=", 1)
                    val = val.strip('"').strip("'")
                    os.environ.setdefault(key.strip(), val.strip())


def main():
    load_private_env()

    API_KEY = os.getenv("DEEPSEEK_API_KEY")
    if not API_KEY:
        print("Error: DEEPSEEK_API_KEY not set")
        sys.exit(1)

    if not sys.stdin.isatty():
        user_input = sys.stdin.read().strip()
    elif len(sys.argv) > 1:
        user_input = " ".join(sys.argv[1:])
    else:
        print('Usage: dsrun "your prompt"  OR  echo "text" | dsrun')
        sys.exit(1)

    url = "https://api.deepseek.com/chat/completions"

    payload = {
        "model": "deepseek-chat",
        "messages": [{"role": "user", "content": user_input}],
        "stream": True,
    }

    headers = {"Content-Type": "application/json", "Authorization": f"Bearer {API_KEY}"}

    with requests.post(url, headers=headers, json=payload, stream=True) as r:
        for line in r.iter_lines():
            if line:
                line = line.decode("utf-8")
                if line.startswith("data: "):
                    data = line[6:]
                    if data == "[DONE]":
                        break
                    try:
                        obj = json.loads(data)
                        delta = obj["choices"][0]["delta"].get("content", "")
                        print(delta, end="", flush=True)
                    except Exception:
                        pass

    print()


if __name__ == "__main__":
    main()

Following Improvements

Better models

I heard that FunASR is more powerful on Chinese transcription than OpenAI whisper. Voxtype is capable of using many ONNX models as the ASR backend, including Paraformer-based models.

After replacing the ASR backend with the Paraformer-zh model, the recognition accuracy on Chinese is significantly improved. However, the English support of this model is poor. An easy solution is to use a secondary model such as OpenAI whisper. However this is not implemented in Voxtype yet -- Voxtype permits only two whisper models to co-work.

Hence I implement such feature in my fork. Arch Linux users can try my PKGBUILD.

Use the following configuration to use Paraformer-zh as the primary model and whisper's small.en as the secondary model:

~/.config/voxtype/config.toml

# Voxtype Configuration
#
# Location: ~/.config/voxtype/config.toml
# All settings can be overridden via CLI flags

# State file for external integrations (Waybar, polybar, etc.)
# Use "auto" for default location ($XDG_RUNTIME_DIR/voxtype/state),
# a custom path, or "disabled" to turn off. The daemon writes state
# ("idle", "recording", "transcribing") to this file whenever it changes.
# Required for `voxtype record toggle` and `voxtype status` commands.
engine = "paraformer"

state_file = "auto"

[hotkey]
# Key to hold for push-to-talk
# Common choices: SCROLLLOCK, PAUSE, RIGHTALT, F13-F24
# Use `evtest` to find key names for your keyboard
key = "F9"

# Optional modifier keys that must also be held
# Example: modifiers = ["LEFTCTRL", "LEFTALT"]
modifiers = []

# Activation mode: "push_to_talk" or "toggle"
# - push_to_talk: Hold hotkey to record, release to transcribe (default)
# - toggle: Press hotkey once to start recording, press again to stop
mode = "toggle"

# Enable built-in hotkey detection (default: true)
# Set to false when using compositor keybindings (Hyprland, Sway) instead
# When disabled, use `voxtype record start/stop/toggle` to control recording
# enabled = true

# Modifier key to select secondary model (evdev input mode only)
# When held while pressing the hotkey, uses whisper.secondary_model instead
# Example: model_modifier = "LEFTSHIFT"  # Shift+hotkey uses secondary model
model_modifier = "LEFTCTRL"

complex_post_process_modifier = "LEFTSHIFT"

[audio]
# Audio input device ("default" uses system default)
# List devices with: pactl list sources short
device = "default"

# Sample rate in Hz (whisper expects 16000)
sample_rate = 16000

# Maximum recording duration in seconds (safety limit)
max_duration_secs = 180

# [audio.feedback]
# Enable audio feedback sounds (beeps when recording starts/stops)
# enabled = true
#
# Sound theme: "default", "subtle", "mechanical", or path to custom theme directory
# theme = "default"
#
# Volume level (0.0 to 1.0)
# volume = 0.7

[whisper]
# Transcription backend: "local" or "remote"
# - local: Use whisper.cpp locally (default)
# - remote: Send audio to a remote whisper.cpp server or OpenAI-compatible API
# backend = "local"

# Model to use for transcription (local backend)
# Options: tiny, tiny.en, base, base.en, small, small.en, medium, medium.en, large-v3, large-v3-turbo
# .en models are English-only but faster and more accurate for English
# large-v3-turbo is faster than large-v3 with minimal accuracy loss (recommended for GPU)
# Or provide absolute path to a custom .bin model file
model = "small"

# Language for transcription
# Options:
#   - Single language: "en", "fr", "de", etc.
#   - Auto-detect all: "auto"
#   - Constrained auto-detect: ["en", "fr"] (detects from allowed set only)
# The array form helps with multilingual users where Whisper might misdetect
# the language, especially for short sentences.
# See: https://github.com/openai/whisper#available-models-and-languages
language = ["en", "zh"]

# Translate non-English speech to English
translate = false

# Number of CPU threads for inference (omit for auto-detection)
# threads = 4

# Initial prompt to provide context for transcription
# Use this to hint at terminology, proper nouns, or formatting conventions.
# Example: "Technical discussion about Rust, TypeScript, and Kubernetes."
# initial_prompt = ""

# --- Multi-model settings ---
#
# Secondary model for difficult audio (used with hotkey.model_modifier or CLI --model)
secondary_model = "small.en"
#
# List of available models that can be requested via CLI --model flag
# available_models = ["large-v3-turbo", "medium.en"]
#
# Maximum models to keep loaded in memory (LRU eviction when exceeded)
# Default: 2 (primary + one secondary). Only applies when gpu_isolation = false.
# max_loaded_models = 2
#
# Seconds before unloading idle secondary models (0 = never auto-unload)
# Default: 300 (5 minutes). Only applies when gpu_isolation = false.
# cold_model_timeout_secs = 300

# --- Eager processing settings ---
#
# Enable eager input processing (transcribe chunks while recording continues)
# Reduces perceived latency on slower machines by processing audio in parallel.
# eager_processing = false
#
# Duration of each audio chunk in seconds (default: 5.0)
# eager_chunk_secs = 5.0
#
# Overlap between chunks in seconds (helps catch words at boundaries, default: 0.5)
# eager_overlap_secs = 0.5

# --- Remote backend settings (used when backend = "remote") ---
#
# Remote server endpoint URL (required for remote backend)
# Examples:
#   - whisper.cpp server: "http://192.168.1.100:8080"
#   - OpenAI API: "https://api.openai.com"
# remote_endpoint = "http://192.168.1.100:8080"
#
# Model name to send to remote server (default: "whisper-1")
# remote_model = "whisper-1"
#
# API key for remote server (optional, or use VOXTYPE_WHISPER_API_KEY env var)
# remote_api_key = ""
#
# Timeout for remote requests in seconds (default: 30)
# remote_timeout_secs = 30

[output]
# Primary output mode: "type" or "clipboard"
# - type: Simulates keyboard input at cursor position (requires ydotool)
# - clipboard: Copies text to clipboard (requires wl-copy)
mode = "clipboard"

# Fall back to clipboard if typing fails
fallback_to_clipboard = true

# Custom driver order for type mode (optional)
# Default order: wtype -> dotool -> ydotool -> clipboard
# Customize to prefer a specific driver or change the fallback order.
# Available drivers: wtype, dotool, ydotool, clipboard
# Example: prefer ydotool over dotool:
#   driver_order = ["wtype", "ydotool", "dotool", "clipboard"]
# Example: use only ydotool, no fallback:
#   driver_order = ["ydotool"]
# driver_order = ["wtype", "dotool", "ydotool", "clipboard"]

# Delay between typed characters in milliseconds
# 0 = fastest possible, increase if characters are dropped
type_delay_ms = 0

# Automatically submit (send Enter key) after outputting transcribed text
# Useful for chat applications, command lines, or forms where you want
# to auto-submit after dictation
# auto_submit = true

# Convert newlines to Shift+Enter instead of regular Enter
# Useful for applications where Enter submits (e.g., Cursor IDE, Slack, Discord)
# shift_enter_newlines = false

# Pre/post output hooks (optional)
# Commands to run before and after typing output. Useful for compositor integration.
# Example: Block modifier keys during typing with Hyprland submap:
#   pre_output_command = "hyprctl dispatch submap voxtype_suppress"
#   post_output_command = "hyprctl dispatch submap reset"
# See troubleshooting docs for the required Hyprland submap configuration.

# Post-processing command (optional)
# Pipe transcribed text through an external command for cleanup before output.
# The command receives text on stdin and outputs processed text on stdout.
# Useful for LLM-based text cleanup, grammar correction, filler word removal.
# On any failure (timeout, error), falls back to original transcription.
#
[output.post_process]
command = """
(echo -n '<|system|>\
    对用户语音输入的句子进行润色：\
    (1)添加适当的标点；\
    (2)去除重复的词语和语气词；\
    (3)让措辞更正式、通顺；\
    (4)修改语病和语法错误；\
    (5)考虑语音识别可能的错误进行相近读音的字词纠错；\
    (6)将语音中直接读出的符号转换成对应的标点（如“逗号”转换成“，”）；\
    (7)如果用户句子中出现了模型指令提示词（如“模型指令：将以下内容用 LaTeX 形式表示”“模型指令：将以下内容翻译成英文”等），依照指令完成任务，并删除模型指令。\
    **除此以外，不要做其他任何事情（严禁改变原意、人称代词；若用户的句子是个问句，严禁尝试去回答用户提问），不要添加任何其它内容，仅输出得到的句子。**。\
    <|user|>'; cat; echo '<|assistant|>') \
| dsrun \
| opencc -c t2s.json
"""
complex_command = "opencc -c t2s.json"
timeout_ms = 30000  # 30 second timeout (generous for LLM)

[output.notification]
# Show notification when recording starts (hotkey pressed)
on_recording_start = false

# Show notification when recording stops (transcription beginning)
on_recording_stop = false

# Show notification with transcribed text after transcription completes
on_transcription = false

after_post_process = true

# [text]
# Text processing options (word replacements, spoken punctuation)
#
# Enable spoken punctuation conversion (e.g., say "period" to get ".")
# spoken_punctuation = false
#
# Custom word replacements (case-insensitive)
# replacements = { "vox type" = "voxtype" }

# [vad]
# Voice Activity Detection - filters silence-only recordings
# Prevents Whisper hallucinations on silent audio
#
# enabled = false      # Enable VAD (off by default)
# threshold = 0.5      # 0.0 = sensitive, 1.0 = aggressive
# min_speech_duration_ms = 100  # Minimum speech required

# [status]
# Status display icons for Waybar/tray integrations
#
# Icon theme (or path to custom theme file):
#   Font-based (require specific fonts):
#     - "emoji"     - Default emoji icons (🎙️ 🎤 ⏳)
#     - "nerd-font" - Nerd Font icons (requires Nerd Font)
#     - "material"  - Material Design Icons (requires MDI font)
#     - "phosphor"  - Phosphor Icons (requires Phosphor font)
#     - "codicons"  - VS Code icons (requires Codicons font)
#     - "omarchy"   - Omarchy distro icons
#   Universal (no special fonts needed):
#     - "minimal"   - Simple Unicode (○ ● ◐ ×)
#     - "dots"      - Geometric shapes (◯ ⬤ ◔ ◌)
#     - "arrows"    - Media player style (▶ ● ↻ ■)
#     - "text"      - Plain text ([MIC] [REC] [...] [OFF])
# icon_theme = "emoji"
#
# Per-state icon overrides (optional, takes precedence over theme)
# [status.icons]
# idle = "🎙️"
# recording = "🎤"
# transcribing = "⏳"
# stopped = ""

# [profiles]
# Named profiles for context-specific post-processing
# Use with: voxtype record start --profile slack
#
# [profiles.slack]
# post_process_command = "ollama run llama3.2:1b 'Format for Slack...'"
#
# [profiles.code]
# post_process_command = "ollama run llama3.2:1b 'Format as code comment...'"
# output_mode = "clipboard"

[paraformer]
model = "zh"

Use voice to modify existing text

A cool function of Typeless is that you can select an existing text and ask Typeless to modify it using voice instructions. For example, you can say "Make this sentence more polite" or "Format the content to an email". This is very useful for writing emails, reports and so on.

Now I have implemented this functionality in my fork. Arch Linux users can try my PKGBUILD.

The operation method is as follows:

Select and copy a segment of text;
Press a hotkey, such as F10, to start voice input; press F10 again to end voice input;
After a short while, Voxtype will generate the result to the clipboard, and you can paste it directly.

The configuration ~/.config/voxtype/config.toml is as follows:

~/.config/voxtype/config.toml

# Voxtype Configuration
#
# Location: ~/.config/voxtype/config.toml
# All settings can be overridden via CLI flags

# State file for external integrations (Waybar, polybar, etc.)
# Use "auto" for default location ($XDG_RUNTIME_DIR/voxtype/state),
# a custom path, or "disabled" to turn off. The daemon writes state
# ("idle", "recording", "transcribing") to this file whenever it changes.
# Required for `voxtype record toggle` and `voxtype status` commands.
engine = "paraformer"

state_file = "auto"

[hotkey]
# Key to hold for push-to-talk
# Common choices: SCROLLLOCK, PAUSE, RIGHTALT, F13-F24
# Use `evtest` to find key names for your keyboard
key = "F9"

edit_key = "F10"

# Optional modifier keys that must also be held
# Example: modifiers = ["LEFTCTRL", "LEFTALT"]
modifiers = []

# Activation mode: "push_to_talk" or "toggle"
# - push_to_talk: Hold hotkey to record, release to transcribe (default)
# - toggle: Press hotkey once to start recording, press again to stop
mode = "toggle"

# Enable built-in hotkey detection (default: true)
# Set to false when using compositor keybindings (Hyprland, Sway) instead
# When disabled, use `voxtype record start/stop/toggle` to control recording
# enabled = true

# Modifier key to select secondary model (evdev input mode only)
# When held while pressing the hotkey, uses whisper.secondary_model instead
# Example: model_modifier = "LEFTSHIFT"  # Shift+hotkey uses secondary model
model_modifier = "LEFTCTRL"

complex_post_process_modifier = "LEFTSHIFT"

[audio]
# Audio input device ("default" uses system default)
# List devices with: pactl list sources short
device = "default"

# Sample rate in Hz (whisper expects 16000)
sample_rate = 16000

# Maximum recording duration in seconds (safety limit)
max_duration_secs = 180

# [audio.feedback]
# Enable audio feedback sounds (beeps when recording starts/stops)
# enabled = true
#
# Sound theme: "default", "subtle", "mechanical", or path to custom theme directory
# theme = "default"
#
# Volume level (0.0 to 1.0)
# volume = 0.7

[whisper]
# Transcription backend: "local" or "remote"
# - local: Use whisper.cpp locally (default)
# - remote: Send audio to a remote whisper.cpp server or OpenAI-compatible API
# backend = "local"

# Model to use for transcription (local backend)
# Options: tiny, tiny.en, base, base.en, small, small.en, medium, medium.en, large-v3, large-v3-turbo
# .en models are English-only but faster and more accurate for English
# large-v3-turbo is faster than large-v3 with minimal accuracy loss (recommended for GPU)
# Or provide absolute path to a custom .bin model file
model = "small"

# Language for transcription
# Options:
#   - Single language: "en", "fr", "de", etc.
#   - Auto-detect all: "auto"
#   - Constrained auto-detect: ["en", "fr"] (detects from allowed set only)
# The array form helps with multilingual users where Whisper might misdetect
# the language, especially for short sentences.
# See: https://github.com/openai/whisper#available-models-and-languages
language = "en"

# Translate non-English speech to English
translate = false

# Number of CPU threads for inference (omit for auto-detection)
# threads = 4

# Initial prompt to provide context for transcription
# Use this to hint at terminology, proper nouns, or formatting conventions.
# Example: "Technical discussion about Rust, TypeScript, and Kubernetes."
# initial_prompt = ""

# --- Multi-model settings ---
#
# Secondary model for difficult audio (used with hotkey.model_modifier or CLI --model)
secondary_model = "small.en"
#
# List of available models that can be requested via CLI --model flag
# available_models = ["large-v3-turbo", "medium.en"]
#
# Maximum models to keep loaded in memory (LRU eviction when exceeded)
# Default: 2 (primary + one secondary). Only applies when gpu_isolation = false.
# max_loaded_models = 2
#
# Seconds before unloading idle secondary models (0 = never auto-unload)
# Default: 300 (5 minutes). Only applies when gpu_isolation = false.
# cold_model_timeout_secs = 300

# --- Eager processing settings ---
#
# Enable eager input processing (transcribe chunks while recording continues)
# Reduces perceived latency on slower machines by processing audio in parallel.
# eager_processing = false
#
# Duration of each audio chunk in seconds (default: 5.0)
# eager_chunk_secs = 5.0
#
# Overlap between chunks in seconds (helps catch words at boundaries, default: 0.5)
# eager_overlap_secs = 0.5

# --- Remote backend settings (used when backend = "remote") ---
#
# Remote server endpoint URL (required for remote backend)
# Examples:
#   - whisper.cpp server: "http://192.168.1.100:8080"
#   - OpenAI API: "https://api.openai.com"
# remote_endpoint = "http://192.168.1.100:8080"
#
# Model name to send to remote server (default: "whisper-1")
# remote_model = "whisper-1"
#
# API key for remote server (optional, or use VOXTYPE_WHISPER_API_KEY env var)
# remote_api_key = ""
#
# Timeout for remote requests in seconds (default: 30)
# remote_timeout_secs = 30

[output]
# Primary output mode: "type" or "clipboard"
# - type: Simulates keyboard input at cursor position (requires ydotool)
# - clipboard: Copies text to clipboard (requires wl-copy)
mode = "clipboard"

# Fall back to clipboard if typing fails
fallback_to_clipboard = true

# Custom driver order for type mode (optional)
# Default order: wtype -> dotool -> ydotool -> clipboard
# Customize to prefer a specific driver or change the fallback order.
# Available drivers: wtype, dotool, ydotool, clipboard
# Example: prefer ydotool over dotool:
#   driver_order = ["wtype", "ydotool", "dotool", "clipboard"]
# Example: use only ydotool, no fallback:
#   driver_order = ["ydotool"]
# driver_order = ["wtype", "dotool", "ydotool", "clipboard"]

# Delay between typed characters in milliseconds
# 0 = fastest possible, increase if characters are dropped
type_delay_ms = 0

# Automatically submit (send Enter key) after outputting transcribed text
# Useful for chat applications, command lines, or forms where you want
# to auto-submit after dictation
# auto_submit = true

# Convert newlines to Shift+Enter instead of regular Enter
# Useful for applications where Enter submits (e.g., Cursor IDE, Slack, Discord)
# shift_enter_newlines = false

# Pre/post output hooks (optional)
# Commands to run before and after typing output. Useful for compositor integration.
# Example: Block modifier keys during typing with Hyprland submap:
#   pre_output_command = "hyprctl dispatch submap voxtype_suppress"
#   post_output_command = "hyprctl dispatch submap reset"
# See troubleshooting docs for the required Hyprland submap configuration.

# Post-processing command (optional)
# Pipe transcribed text through an external command for cleanup before output.
# The command receives text on stdin and outputs processed text on stdout.
# Useful for LLM-based text cleanup, grammar correction, filler word removal.
# On any failure (timeout, error), falls back to original transcription.
#
[output.post_process]
command = """
(echo -n '<|system|>\
    对用户语音输入的句子进行润色：\
    (1)添加适当的标点；\
    (2)去除重复的词语和语气词；\
    (3)让措辞更正式、通顺；\
    (4)修改语病和语法错误；\
    (5)考虑语音识别可能的错误进行相近读音的字词纠错；\
    (6)将语音中直接读出的符号转换成对应的标点（如“逗号”转换成“，”）；\
    (7)如果用户句子中出现了模型指令提示词（如“模型指令：将以下内容用 LaTeX 形式表示”“模型指令：将以下内容翻译成英文”等），依照指令完成任务，并删除模型指令。\
    **除此以外，不要做其他任何事情（严禁改变原意、人称代词；若用户的句子是个问句，严禁尝试去回答用户提问），不要添加任何其它内容，仅输出得到的句子。**。\
    <|user|>'; cat; echo '<|assistant|>') \
| dsrun \
| opencc -c t2s.json
"""
complex_command = "opencc -c t2s.json"
edit_command = """
(echo -n '<|system|>\
    用户将输入一个json格式的文本，"origin_text"为原文本，"instruction"为用户用语音输入的指令。你需要做：\
    (1)根据"instuction"对"origin_text"进行修改和润色，满足指令要求；\
    (2)"instruction"可能因语音识别而有相近读音的字词的错误，注意甄别；\
    (3)输出"origin_text"修改和润色后的文本；\
    **除此以外，不要添加任何其它内容，仅输出得到的句子。**。\
    <|user|>'; cat; echo '<|assistant|>') \
| dsrun \
| opencc -c t2s.json
"""
timeout_ms = 30000  # 30 second timeout (generous for LLM)

[output.notification]
# Show notification when recording starts (hotkey pressed)
on_recording_start = false

# Show notification when recording stops (transcription beginning)
on_recording_stop = false

# Show notification with transcribed text after transcription completes
on_transcription = false

after_post_process = true

# [text]
# Text processing options (word replacements, spoken punctuation)
#
# Enable spoken punctuation conversion (e.g., say "period" to get ".")
# spoken_punctuation = false
#
# Custom word replacements (case-insensitive)
# replacements = { "vox type" = "voxtype" }

# [vad]
# Voice Activity Detection - filters silence-only recordings
# Prevents Whisper hallucinations on silent audio
#
# enabled = false      # Enable VAD (off by default)
# threshold = 0.5      # 0.0 = sensitive, 1.0 = aggressive
# min_speech_duration_ms = 100  # Minimum speech required

# [status]
# Status display icons for Waybar/tray integrations
#
# Icon theme (or path to custom theme file):
#   Font-based (require specific fonts):
#     - "emoji"     - Default emoji icons (🎙️ 🎤 ⏳)
#     - "nerd-font" - Nerd Font icons (requires Nerd Font)
#     - "material"  - Material Design Icons (requires MDI font)
#     - "phosphor"  - Phosphor Icons (requires Phosphor font)
#     - "codicons"  - VS Code icons (requires Codicons font)
#     - "omarchy"   - Omarchy distro icons
#   Universal (no special fonts needed):
#     - "minimal"   - Simple Unicode (○ ● ◐ ×)
#     - "dots"      - Geometric shapes (◯ ⬤ ◔ ◌)
#     - "arrows"    - Media player style (▶ ● ↻ ■)
#     - "text"      - Plain text ([MIC] [REC] [...] [OFF])
# icon_theme = "emoji"
#
# Per-state icon overrides (optional, takes precedence over theme)
# [status.icons]
# idle = "🎙️"
# recording = "🎤"
# transcribing = "⏳"
# stopped = ""

# [profiles]
# Named profiles for context-specific post-processing
# Use with: voxtype record start --profile slack
#
# [profiles.slack]
# post_process_command = "ollama run llama3.2:1b 'Format for Slack...'"
#
# [profiles.code]
# post_process_command = "ollama run llama3.2:1b 'Format as code comment...'"
# output_mode = "clipboard"

[paraformer]
model = "zh"

As you can see, edit_key (F10) and output.post_process.edit_command are available in the configuration file. The input to edit_command will be a JSON-formatted text, similar to:

{
  "origin_text": "The original text information",
  "instruction": "The voice input instruction"
}

If you want to parse it yourself or have other uses, you can refer to the above format.

Integrate with input method

To make Voxtype more like Typeless, we shall reduce the use of clipboard and make it directly type at the cursor position (and for edit mode, directly read the content we select). My solution is to develop an add-on for fcitx5 input method framework.

The add-on is at https://github.com/rijuyuezhu/fcitx5-voxtype-bridge, with installation instructions in the README.md file.

When using the add-on, avoid setting the same hotkey in both Voxtype and fcitx5.

You can change the configuration of the hotkey (now it is managed by fcitx5 instead of Voxtype) in fcitx5-configtool. The default value is F9 for normal mode and F10 for edit mode, which is the same as my Voxtype configuration.

Installation Guide

Step 1: Install Voxtype (my fork)

For Arch Linux users. You can use my PKGBUILD:

$ git clone https://github.com/rijuyuezhu/voxtype-git.pkg
$ cd voxtype-git.pkg
$ makepkg -si
$ # you can also use paru instead of makepkg
$ # paru -Bi .
$ sudo voxtype setup onnx --enable

For other Linux users. Install manually:

$ git clone https://github.com/rijuyuezhu/voxtype
$ cd voxtype
$ cargo build --frozen --release --features parakeet-load-dynamic,moonshine,sensevoice,paraformer,dolphin,omnilingual,gpu-vulkan
$ sudo install -Dm755 target/release/voxtype /usr/bin/voxtype
$ sudo install -Dm644 packaging/systemd/voxtype.service /usr/lib/systemd/user/voxtype.service
$ systemctl --user enable --now voxtype.service

Step 2: Install model and service

1
2
3

$ voxtype setup model   # select paraformer-zh model
$ voxtype setup model   # select base.en model
$ systemctl --user enable --now voxtype.service

Step 3: Add Voxtype configuration

~/.config/voxtype/config.toml

# Voxtype Configuration
#
# Location: ~/.config/voxtype/config.toml
# All settings can be overridden via CLI flags

# State file for external integrations (Waybar, polybar, etc.)
# Use "auto" for default location ($XDG_RUNTIME_DIR/voxtype/state),
# a custom path, or "disabled" to turn off. The daemon writes state
# ("idle", "recording", "transcribing") to this file whenever it changes.
# Required for `voxtype record toggle` and `voxtype status` commands.
engine = "paraformer"

state_file = "auto"

[hotkey]
# Key to hold for push-to-talk
# Common choices: SCROLLLOCK, PAUSE, RIGHTALT, F13-F24
# Use `evtest` to find key names for your keyboard
key = "F7"

edit_key = "F8"

# Optional modifier keys that must also be held
# Example: modifiers = ["LEFTCTRL", "LEFTALT"]
modifiers = []

# Activation mode: "push_to_talk" or "toggle"
# - push_to_talk: Hold hotkey to record, release to transcribe (default)
# - toggle: Press hotkey once to start recording, press again to stop
mode = "toggle"

# Enable built-in hotkey detection (default: true)
# Set to false when using compositor keybindings (Hyprland, Sway) instead
# When disabled, use `voxtype record start/stop/toggle` to control recording
enabled = true

# Modifier key to select secondary model (evdev input mode only)
# When held while pressing the hotkey, uses whisper.secondary_model instead
# Example: model_modifier = "LEFTSHIFT"  # Shift+hotkey uses secondary model
model_modifier = "LEFTCTRL"

complex_post_process_modifier = "LEFTSHIFT"

[audio]
# Audio input device ("default" uses system default)
# List devices with: pactl list sources short
device = "default"

# Sample rate in Hz (whisper expects 16000)
sample_rate = 16000

# Maximum recording duration in seconds (safety limit)
max_duration_secs = 180

# [audio.feedback]
# Enable audio feedback sounds (beeps when recording starts/stops)
# enabled = true
#
# Sound theme: "default", "subtle", "mechanical", or path to custom theme directory
# theme = "default"
#
# Volume level (0.0 to 1.0)
# volume = 0.7

[whisper]
# Transcription backend: "local" or "remote"
# - local: Use whisper.cpp locally (default)
# - remote: Send audio to a remote whisper.cpp server or OpenAI-compatible API
# backend = "local"

# Model to use for transcription (local backend)
# Options: tiny, tiny.en, base, base.en, small, small.en, medium, medium.en, large-v3, large-v3-turbo
# .en models are English-only but faster and more accurate for English
# large-v3-turbo is faster than large-v3 with minimal accuracy loss (recommended for GPU)
# Or provide absolute path to a custom .bin model file
model = "base.en"

# Language for transcription
# Options:
#   - Single language: "en", "fr", "de", etc.
#   - Auto-detect all: "auto"
#   - Constrained auto-detect: ["en", "fr"] (detects from allowed set only)
# The array form helps with multilingual users where Whisper might misdetect
# the language, especially for short sentences.
# See: https://github.com/openai/whisper#available-models-and-languages
language = "en"

# Translate non-English speech to English
translate = false

# Number of CPU threads for inference (omit for auto-detection)
# threads = 4

# Initial prompt to provide context for transcription
# Use this to hint at terminology, proper nouns, or formatting conventions.
# Example: "Technical discussion about Rust, TypeScript, and Kubernetes."
# initial_prompt = ""

# --- Multi-model settings ---
#
# Secondary model for difficult audio (used with hotkey.model_modifier or CLI --model)
secondary_model = "small.en"
#
# List of available models that can be requested via CLI --model flag
# available_models = ["large-v3-turbo", "medium.en"]
#
# Maximum models to keep loaded in memory (LRU eviction when exceeded)
# Default: 2 (primary + one secondary). Only applies when gpu_isolation = false.
# max_loaded_models = 2
#
# Seconds before unloading idle secondary models (0 = never auto-unload)
# Default: 300 (5 minutes). Only applies when gpu_isolation = false.
# cold_model_timeout_secs = 300

# --- Eager processing settings ---
#
# Enable eager input processing (transcribe chunks while recording continues)
# Reduces perceived latency on slower machines by processing audio in parallel.
# eager_processing = false
#
# Duration of each audio chunk in seconds (default: 5.0)
# eager_chunk_secs = 5.0
#
# Overlap between chunks in seconds (helps catch words at boundaries, default: 0.5)
# eager_overlap_secs = 0.5

# --- Remote backend settings (used when backend = "remote") ---
#
# Remote server endpoint URL (required for remote backend)
# Examples:
#   - whisper.cpp server: "http://192.168.1.100:8080"
#   - OpenAI API: "https://api.openai.com"
# remote_endpoint = "http://192.168.1.100:8080"
#
# Model name to send to remote server (default: "whisper-1")
# remote_model = "whisper-1"
#
# API key for remote server (optional, or use VOXTYPE_WHISPER_API_KEY env var)
# remote_api_key = ""
#
# Timeout for remote requests in seconds (default: 30)
# remote_timeout_secs = 30

[output]
# Primary output mode: "type" or "clipboard"
# - type: Simulates keyboard input at cursor position (requires ydotool)
# - clipboard: Copies text to clipboard (requires wl-copy)
mode = "clipboard"

# Fall back to clipboard if typing fails
fallback_to_clipboard = true

# Custom driver order for type mode (optional)
# Default order: wtype -> dotool -> ydotool -> clipboard
# Customize to prefer a specific driver or change the fallback order.
# Available drivers: wtype, dotool, ydotool, clipboard
# Example: prefer ydotool over dotool:
#   driver_order = ["wtype", "ydotool", "dotool", "clipboard"]
# Example: use only ydotool, no fallback:
#   driver_order = ["ydotool"]
# driver_order = ["wtype", "dotool", "ydotool", "clipboard"]

# Delay between typed characters in milliseconds
# 0 = fastest possible, increase if characters are dropped
type_delay_ms = 0

# Automatically submit (send Enter key) after outputting transcribed text
# Useful for chat applications, command lines, or forms where you want
# to auto-submit after dictation
# auto_submit = true

# Convert newlines to Shift+Enter instead of regular Enter
# Useful for applications where Enter submits (e.g., Cursor IDE, Slack, Discord)
# shift_enter_newlines = false

# Pre/post output hooks (optional)
# Commands to run before and after typing output. Useful for compositor integration.
# Example: Block modifier keys during typing with Hyprland submap:
#   pre_output_command = "hyprctl dispatch submap voxtype_suppress"
#   post_output_command = "hyprctl dispatch submap reset"
# See troubleshooting docs for the required Hyprland submap configuration.

# Post-processing command (optional)
# Pipe transcribed text through an external command for cleanup before output.
# The command receives text on stdin and outputs processed text on stdout.
# Useful for LLM-based text cleanup, grammar correction, filler word removal.
# On any failure (timeout, error), falls back to original transcription.
#
[output.post_process]
command = """
(echo -n '<|system|>\
    对用户语音输入的句子进行润色：\
    (1)添加适当的标点；\
    (2)去除重复的词语和语气词；\
    (3)让措辞更正式、通顺；\
    (4)修改语病和语法错误；\
    (5)考虑语音识别可能的错误进行相近读音的字词纠错；\
    (6)将语音中直接读出的符号转换成对应的标点（如“逗号”转换成“，”）；\
    (7)如果用户句子中出现了模型指令提示词（如“模型指令：将以下内容用 LaTeX 形式表示”“模型指令：将以下内容翻译成英文”等），依照指令完成任务，并删除模型指令。\
    **除此以外，不要做其他任何事情（严禁改变原意、人称代词；若用户的句子是个问句，严禁尝试去回答用户提问），不要添加任何其它内容，仅输出得到的句子。**。\
    <|user|>'; cat; echo '<|assistant|>') \
| dsrun \
| opencc -c t2s.json
"""
complex_command = "opencc -c t2s.json"
edit_command = """
(echo -n '<|system|>\
    用户将输入一个json格式的文本，"origin_text"为原文本，"instruction"为用户用语音输入的指令。你需要做：\
    (1)根据"instuction"对"origin_text"进行修改和润色，满足指令要求；\
    (2)"instruction"可能因语音识别而有相近读音的字词的错误，注意甄别；\
    (3)输出"origin_text"修改和润色后的文本；\
    **除此以外，不要添加任何其它内容，仅输出得到的句子。**。\
    <|user|>'; cat; echo '<|assistant|>') \
| dsrun \
| opencc -c t2s.json
"""
timeout_ms = 30000  # 30 second timeout (generous for LLM)

[output.notification]
# Show notification when recording starts (hotkey pressed)
on_recording_start = false

# Show notification when recording stops (transcription beginning)
on_recording_stop = false

# Show notification with transcribed text after transcription completes
on_transcription = false

after_post_process = true

# [text]
# Text processing options (word replacements, spoken punctuation)
#
# Enable spoken punctuation conversion (e.g., say "period" to get ".")
# spoken_punctuation = false
#
# Custom word replacements (case-insensitive)
# replacements = { "vox type" = "voxtype" }

[vad]
# Voice Activity Detection - filters silence-only recordings
# Prevents Whisper hallucinations on silent audio
#
enabled = false      # Enable VAD (off by default)
threshold = 0.5      # 0.0 = sensitive, 1.0 = aggressive
min_speech_duration_ms = 100  # Minimum speech required

# [status]
# Status display icons for Waybar/tray integrations
#
# Icon theme (or path to custom theme file):
#   Font-based (require specific fonts):
#     - "emoji"     - Default emoji icons (🎙️ 🎤 ⏳)
#     - "nerd-font" - Nerd Font icons (requires Nerd Font)
#     - "material"  - Material Design Icons (requires MDI font)
#     - "phosphor"  - Phosphor Icons (requires Phosphor font)
#     - "codicons"  - VS Code icons (requires Codicons font)
#     - "omarchy"   - Omarchy distro icons
#   Universal (no special fonts needed):
#     - "minimal"   - Simple Unicode (○ ● ◐ ×)
#     - "dots"      - Geometric shapes (◯ ⬤ ◔ ◌)
#     - "arrows"    - Media player style (▶ ● ↻ ■)
#     - "text"      - Plain text ([MIC] [REC] [...] [OFF])
# icon_theme = "emoji"
#
# Per-state icon overrides (optional, takes precedence over theme)
# [status.icons]
# idle = "🎙️"
# recording = "🎤"
# transcribing = "⏳"
# stopped = ""

# [profiles]
# Named profiles for context-specific post-processing
# Use with: voxtype record start --profile slack
#
# [profiles.slack]
# post_process_command = "ollama run llama3.2:1b 'Format for Slack...'"
#
# [profiles.code]
# post_process_command = "ollama run llama3.2:1b 'Format as code comment...'"
# output_mode = "clipboard"

[paraformer]
model = "zh"

Optionally, you can use my dsrun tool to get Deepseek response in the post-processing command. You may store DEEPSEEK_API_KEY="sk-xxxx" in ~/.private_infos. Add dsrun to your PATH.

dsrun

#!/usr/bin/env python3
import json
import os
import sys

import requests


def load_private_env():
    private_file = os.path.expanduser("~/.private_infos")
    if os.path.exists(private_file):
        with open(private_file) as f:
            for line in f:
                line = line.strip()
                if line.startswith("export "):
                    line = line[len("export ") :]
                if "=" in line:
                    key, val = line.split("=", 1)
                    val = val.strip('"').strip("'")
                    os.environ.setdefault(key.strip(), val.strip())


def main():
    load_private_env()

    API_KEY = os.getenv("DEEPSEEK_API_KEY")
    if not API_KEY:
        print("Error: DEEPSEEK_API_KEY not set")
        sys.exit(1)

    # 读取输入（支持参数或管道）
    if not sys.stdin.isatty():
        user_input = sys.stdin.read().strip()
    elif len(sys.argv) > 1:
        user_input = " ".join(sys.argv[1:])
    else:
        print('Usage: dsrun "your prompt"  OR  echo "text" | dsrun')
        sys.exit(1)

    url = "https://api.deepseek.com/chat/completions"

    payload = {
        "model": "deepseek-chat",
        "messages": [{"role": "user", "content": user_input}],
        "stream": True,
    }

    headers = {"Content-Type": "application/json", "Authorization": f"Bearer {API_KEY}"}

    with requests.post(url, headers=headers, json=payload, stream=True) as r:
        for line in r.iter_lines():
            if line:
                line = line.decode("utf-8")
                if line.startswith("data: "):
                    data = line[6:]
                    if data == "[DONE]":
                        break
                    try:
                        obj = json.loads(data)
                        delta = obj["choices"][0]["delta"].get("content", "")
                        print(delta, end="", flush=True)
                    except Exception:
                        pass

    print()


if __name__ == "__main__":
    main()

You can now try use F7 to start voice input and F8 to start edit mode. The edit mode will read the content in your clipboard.

Step 4: Install the fcitx5 add-on

For Arch Linux users. You can use my PKGBUILD:

$ git clone https://github.com/rijuyuezhu/fcitx5-voxtype-bridge.pkg
$ cd fcitx5-voxtype-bridge.pkg
$ makepkg -si
$ # you can also use paru instead of makepkg
$ # paru -Bi .

For other Linux users. Install manually:

$ # Install locally
$ git clone https://github.com/rijuyuezhu/fcitx5-voxtype-bridge
$ cd fcitx5-voxtype-bridge
$ cmake -DCMAKE_INSTALL_PREFIX=$(realpath ~/.local) -DCMAKE_INSTALL_LIBDIR=$(realpath ~/.local/lib) -B build
$ cmake --build build -j$(nproc) --target install

$ # Or install system-wide (requires root)
$ git clone https://github.com/rijuyuezhu/fcitx5-voxtype-bridge
$ cd fcitx5-voxtype-bridge
$ cmake -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_INSTALL_LIBDIR=/usr/lib -B build
$ cmake --build build -j$(nproc)
$ sudo cmake --install build

Step 5: Configure fcitx5

Remember to restart fcitx5 after installing the add-on:

1 2	$ pkill fcitx5 $ fcitx5

Now you can use F9 for normal voice input and F10 for edit mode. The result will appear at your cursor directly. For edit mode, it will use the text you select as the original text; if no text is selected, it will read the content in your clipboard as the original text.

Also, modifiers works as follows:

Ctrl+F9 uses the secondary model for normal voice input (base.en);
Shift+F9 uses the complex post-process command for normal voice input (in fact, the command is simpler in my configuration);
Ctrl+Shift+F9 uses both the secondary model and complex post-process command for normal voice input;
Ctrl+F10 uses the secondary model for edit mode;
Shift has no effect on edit mode. You can configure a more useful command by yourself.

You can change the hotkeys in fcitx5-configtool (refer to Integrate with input method).

Other Materials and References

My friend @LeonardNJU develops a cool input method called Vocotype-linux. It hacks Rime input method framework to provide voice-to-text input. His input method inspires me a lot.
A post on AIA forum (a Chinese AI community) records my development timeline and some thoughts about this project. It is in Chinese, but you can use translation tools to read it if you are interested.

My laptop: HP ProBook 440 14 inch G10 Notebook PC, with 13th Gen Intel(R) Core(TM) i5-1340P (16) @ 4.60 GHz CPU and Intel Iris Xe Graphics @ 1.45 GHz integrated GPU, 16GB RAM.↩︎
I met such problem when using ollama to run models like qwen2.5:1.5b locally.↩︎