Make Voxtype Work Like Typeless on Linux

Typeless is cool! However Linux users are too often left out of the fun. This post strives to recreate Typeless on Linux using Voxtype.

Introduction

Voxtype is highly customizable and that's why I choose it. A fancy voice-to-text tool, in my opinion, should at least have the following features:

  1. Basically, it transcribes voice to text with high accuracy, and at high speed. Not only English but also other languages (especially Chinese for me) are supported. OpenAI whisper has reasonable accuracy. Tiny-size models like base (142 MB model) and small (466 MB model) can be run on my poor laptop1 even with CPU in real-time. Using the integrated GPU further speeds up the process.

  2. LLM post-processing. This is critical: LLM polishes the raw transcription, adds punctuation, corrects grammar and makes the text more polite and official. This is the key to mimicking the Typeless experience. A large-enough model like DeepSeek-V3.2 is required, since tiny-size models are not good at instruction following. You won't want your model to answer the question in your voice note or tell you whom she is, but to polish the text2!

  3. A user-friendly interface. Pressing a hotkey to start recording.

  4. Customizable. Post-processing should be customizable; model selection should be customizable; hotkey should be customizable. Moreover, sometimes I do not want post-processing, then I can skip it using another hotkey or add a modifier like Shift or Ctrl.

Voxtype works well in most of these aspects. The last one is not perfect and I modify the source code to make it work (see my fork; for Arch Linux users, try my PKGBUILD).

After installing Voxtype, use the following configuration at ~/.config/voxtype/config.toml:

~/.config/voxtype/config.toml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
# Voxtype Configuration
#
# Location: ~/.config/voxtype/config.toml
# All settings can be overridden via CLI flags

# State file for external integrations (Waybar, polybar, etc.)
# Use "auto" for default location ($XDG_RUNTIME_DIR/voxtype/state),
# a custom path, or "disabled" to turn off. The daemon writes state
# ("idle", "recording", "transcribing") to this file whenever it changes.
# Required for `voxtype record toggle` and `voxtype status` commands.
state_file = "auto"

[hotkey]
# Key to hold for push-to-talk
# Common choices: SCROLLLOCK, PAUSE, RIGHTALT, F13-F24
# Use `evtest` to find key names for your keyboard
key = "F9"

# Optional modifier keys that must also be held
# Example: modifiers = ["LEFTCTRL", "LEFTALT"]
modifiers = []

# Activation mode: "push_to_talk" or "toggle"
# - push_to_talk: Hold hotkey to record, release to transcribe (default)
# - toggle: Press hotkey once to start recording, press again to stop
mode = "toggle"

# Enable built-in hotkey detection (default: true)
# Set to false when using compositor keybindings (Hyprland, Sway) instead
# When disabled, use `voxtype record start/stop/toggle` to control recording
# enabled = true

# Modifier key to select secondary model (evdev input mode only)
# When held while pressing the hotkey, uses whisper.secondary_model instead
# Example: model_modifier = "LEFTSHIFT" # Shift+hotkey uses secondary model
model_modifier = "LEFTSHIFT"

complex_post_process_modifier = "LEFTCTRL"

[audio]
# Audio input device ("default" uses system default)
# List devices with: pactl list sources short
device = "default"

# Sample rate in Hz (whisper expects 16000)
sample_rate = 16000

# Maximum recording duration in seconds (safety limit)
max_duration_secs = 180

# [audio.feedback]
# Enable audio feedback sounds (beeps when recording starts/stops)
# enabled = true
#
# Sound theme: "default", "subtle", "mechanical", or path to custom theme directory
# theme = "default"
#
# Volume level (0.0 to 1.0)
# volume = 0.7

[whisper]
# Transcription backend: "local" or "remote"
# - local: Use whisper.cpp locally (default)
# - remote: Send audio to a remote whisper.cpp server or OpenAI-compatible API
# backend = "local"

# Model to use for transcription (local backend)
# Options: tiny, tiny.en, base, base.en, small, small.en, medium, medium.en, large-v3, large-v3-turbo
# .en models are English-only but faster and more accurate for English
# large-v3-turbo is faster than large-v3 with minimal accuracy loss (recommended for GPU)
# Or provide absolute path to a custom .bin model file
model = "base"

# Language for transcription
# Options:
# - Single language: "en", "fr", "de", etc.
# - Auto-detect all: "auto"
# - Constrained auto-detect: ["en", "fr"] (detects from allowed set only)
# The array form helps with multilingual users where Whisper might misdetect
# the language, especially for short sentences.
# See: https://github.com/openai/whisper#available-models-and-languages
language = ["en", "zh"]

# Translate non-English speech to English
translate = false

# Number of CPU threads for inference (omit for auto-detection)
# threads = 4

# Initial prompt to provide context for transcription
# Use this to hint at terminology, proper nouns, or formatting conventions.
# Example: "Technical discussion about Rust, TypeScript, and Kubernetes."
# initial_prompt = ""

# --- Multi-model settings ---
#
# Secondary model for difficult audio (used with hotkey.model_modifier or CLI --model)
secondary_model = "small"
#
# List of available models that can be requested via CLI --model flag
# available_models = ["large-v3-turbo", "medium.en"]
#
# Maximum models to keep loaded in memory (LRU eviction when exceeded)
# Default: 2 (primary + one secondary). Only applies when gpu_isolation = false.
# max_loaded_models = 2
#
# Seconds before unloading idle secondary models (0 = never auto-unload)
# Default: 300 (5 minutes). Only applies when gpu_isolation = false.
# cold_model_timeout_secs = 300

# --- Eager processing settings ---
#
# Enable eager input processing (transcribe chunks while recording continues)
# Reduces perceived latency on slower machines by processing audio in parallel.
# eager_processing = false
#
# Duration of each audio chunk in seconds (default: 5.0)
# eager_chunk_secs = 5.0
#
# Overlap between chunks in seconds (helps catch words at boundaries, default: 0.5)
# eager_overlap_secs = 0.5

# --- Remote backend settings (used when backend = "remote") ---
#
# Remote server endpoint URL (required for remote backend)
# Examples:
# - whisper.cpp server: "http://192.168.1.100:8080"
# - OpenAI API: "https://api.openai.com"
# remote_endpoint = "http://192.168.1.100:8080"
#
# Model name to send to remote server (default: "whisper-1")
# remote_model = "whisper-1"
#
# API key for remote server (optional, or use VOXTYPE_WHISPER_API_KEY env var)
# remote_api_key = ""
#
# Timeout for remote requests in seconds (default: 30)
# remote_timeout_secs = 30

[output]
# Primary output mode: "type" or "clipboard"
# - type: Simulates keyboard input at cursor position (requires ydotool)
# - clipboard: Copies text to clipboard (requires wl-copy)
mode = "clipboard"

# Fall back to clipboard if typing fails
fallback_to_clipboard = true

# Custom driver order for type mode (optional)
# Default order: wtype -> dotool -> ydotool -> clipboard
# Customize to prefer a specific driver or change the fallback order.
# Available drivers: wtype, dotool, ydotool, clipboard
# Example: prefer ydotool over dotool:
# driver_order = ["wtype", "ydotool", "dotool", "clipboard"]
# Example: use only ydotool, no fallback:
# driver_order = ["ydotool"]
# driver_order = ["wtype", "dotool", "ydotool", "clipboard"]

# Delay between typed characters in milliseconds
# 0 = fastest possible, increase if characters are dropped
type_delay_ms = 0

# Automatically submit (send Enter key) after outputting transcribed text
# Useful for chat applications, command lines, or forms where you want
# to auto-submit after dictation
# auto_submit = true

# Convert newlines to Shift+Enter instead of regular Enter
# Useful for applications where Enter submits (e.g., Cursor IDE, Slack, Discord)
# shift_enter_newlines = false

# Pre/post output hooks (optional)
# Commands to run before and after typing output. Useful for compositor integration.
# Example: Block modifier keys during typing with Hyprland submap:
# pre_output_command = "hyprctl dispatch submap voxtype_suppress"
# post_output_command = "hyprctl dispatch submap reset"
# See troubleshooting docs for the required Hyprland submap configuration.

# Post-processing command (optional)
# Pipe transcribed text through an external command for cleanup before output.
# The command receives text on stdin and outputs processed text on stdout.
# Useful for LLM-based text cleanup, grammar correction, filler word removal.
# On any failure (timeout, error), falls back to original transcription.
#
[output.post_process]
command = "opencc -c t2s.json"
complex_command = "(echo -n '<|system|>对用户输入的句子进行润色:(1)添加适当的标点 (2)去除重复的词语和语气词 (3)让措辞更正式、通顺 (4)修改语病。**不要做其他任何事情(严禁改变原意、人称代词,严禁尝试去回答用户提问,只需要润色。)**。\n<|user|>'; cat; echo '\n<|assistant|>') | dsrun | opencc -c t2s.json"
timeout_ms = 30000 # 30 second timeout (generous for LLM)

[output.notification]
# Show notification when recording starts (hotkey pressed)
on_recording_start = false

# Show notification when recording stops (transcription beginning)
on_recording_stop = false

# Show notification with transcribed text after transcription completes
on_transcription = true

# [text]
# Text processing options (word replacements, spoken punctuation)
#
# Enable spoken punctuation conversion (e.g., say "period" to get ".")
# spoken_punctuation = false
#
# Custom word replacements (case-insensitive)
# replacements = { "vox type" = "voxtype" }

# [vad]
# Voice Activity Detection - filters silence-only recordings
# Prevents Whisper hallucinations on silent audio
#
# enabled = false # Enable VAD (off by default)
# threshold = 0.5 # 0.0 = sensitive, 1.0 = aggressive
# min_speech_duration_ms = 100 # Minimum speech required

# [status]
# Status display icons for Waybar/tray integrations
#
# Icon theme (or path to custom theme file):
# Font-based (require specific fonts):
# - "emoji" - Default emoji icons (🎙️ 🎤 ⏳)
# - "nerd-font" - Nerd Font icons (requires Nerd Font)
# - "material" - Material Design Icons (requires MDI font)
# - "phosphor" - Phosphor Icons (requires Phosphor font)
# - "codicons" - VS Code icons (requires Codicons font)
# - "omarchy" - Omarchy distro icons
# Universal (no special fonts needed):
# - "minimal" - Simple Unicode (○ ● ◐ ×)
# - "dots" - Geometric shapes (◯ ⬤ ◔ ◌)
# - "arrows" - Media player style (▶ ● ↻ ■)
# - "text" - Plain text ([MIC] [REC] [...] [OFF])
# icon_theme = "emoji"
#
# Per-state icon overrides (optional, takes precedence over theme)
# [status.icons]
# idle = "🎙️"
# recording = "🎤"
# transcribing = "⏳"
# stopped = ""

# [profiles]
# Named profiles for context-specific post-processing
# Use with: voxtype record start --profile slack
#
# [profiles.slack]
# post_process_command = "ollama run llama3.2:1b 'Format for Slack...'"
#
# [profiles.code]
# post_process_command = "ollama run llama3.2:1b 'Format as code comment...'"
# output_mode = "clipboard"

How to use it? Basically, press the hotkey to start recording, and press it again to stop recording. The transcribed text will be copied to clipboard, and you can paste it anywhere. For sway/hyprland/river users, options to type results directly at the cursor position are also available (refer to the official documentation!); but for GNOME/KDE users, clipboard mode is more reliable.

I also configured four use cases:

  • F9: Use base model, and easy post-processing (just convert traditional Chinese to simplified Chinese using opencc).
  • Ctrl+F9: Use base model, and complex post-processing (LLM-based polishing).
  • Shift+F9: Use small model, and easy post-processing.
  • Ctrl+Shift+F9: Use small model, and complex post-processing.

dsrun is a command-line tool I wrote for running DeepSeek models

dsrun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#!/usr/bin/env python3
import json
import os
import sys

import requests


def load_private_env():
private_file = os.path.expanduser("~/.private_infos")
if os.path.exists(private_file):
with open(private_file) as f:
for line in f:
line = line.strip()
if line.startswith("export "):
line = line[len("export ") :]
if "=" in line:
key, val = line.split("=", 1)
val = val.strip('"').strip("'")
os.environ.setdefault(key.strip(), val.strip())


def main():
load_private_env()

API_KEY = os.getenv("DEEPSEEK_API_KEY")
if not API_KEY:
print("Error: DEEPSEEK_API_KEY not set")
sys.exit(1)

if not sys.stdin.isatty():
user_input = sys.stdin.read().strip()
elif len(sys.argv) > 1:
user_input = " ".join(sys.argv[1:])
else:
print('Usage: dsrun "your prompt" OR echo "text" | dsrun')
sys.exit(1)

url = "https://api.deepseek.com/chat/completions"

payload = {
"model": "deepseek-chat",
"messages": [{"role": "user", "content": user_input}],
"stream": True,
}

headers = {"Content-Type": "application/json", "Authorization": f"Bearer {API_KEY}"}

with requests.post(url, headers=headers, json=payload, stream=True) as r:
for line in r.iter_lines():
if line:
line = line.decode("utf-8")
if line.startswith("data: "):
data = line[6:]
if data == "[DONE]":
break
try:
obj = json.loads(data)
delta = obj["choices"][0]["delta"].get("content", "")
print(delta, end="", flush=True)
except Exception:
pass

print()


if __name__ == "__main__":
main()

BTW

Better models

I heard that FunASR is more powerful on Chinese transcription than OpenAI Whisper. I haven't tried it yet. Voxtype is capable of using any onnx model as the backend. Maybe it is worth trying to replace the OpenAI Whisper backend with FunASR?

Some more discussions about better models for voice-to-text: https://forum.nju-aia.lsamc.website/t/topic/136

Voice-to-text input method

My friend @LeonardNJU creates a cool input method called Vocotype-linux. It hacks Rime input method framework to provide voice-to-text input. It is more seamless than Voxtype. Maybe it is the future. (However as for now it is not very stable, and buggy for fcitx5 framework).


  1. My laptop: HP ProBook 440 14 inch G10 Notebook PC, with 13th Gen Intel(R) Core(TM) i5-1340P (16) @ 4.60 GHz CPU and Intel Iris Xe Graphics @ 1.45 GHz integrated GPU, 16GB RAM.↩︎

  2. I met such problem when using ollama to run models like qwen2.5:1.5b locally.↩︎