Make Voxtype Work Like Typeless on Linux
Typeless is cool! However Linux users are too often left out of the fun. This post strives to recreate Typeless on Linux using Voxtype.
Introduction
Voxtype is highly customizable and that's why I choose it. A fancy voice-to-text tool, in my opinion, should at least have the following features:
Basically, it transcribes voice to text with high accuracy, and at high speed. Not only English but also other languages (especially Chinese for me) are supported. OpenAI whisper has reasonable accuracy. Tiny-size models like
base(142 MB model) andsmall(466 MB model) can be run on my poor laptop1 even with CPU in real-time. Using the integrated GPU further speeds up the process.LLM post-processing. This is critical: LLM polishes the raw transcription, adds punctuation, corrects grammar and makes the text more polite and official. This is the key to mimicking the Typeless experience. A large-enough model like DeepSeek-V3.2 is required, since tiny-size models are not good at instruction following. You won't want your model to answer the question in your voice note or tell you whom she is, but to polish the text2!
A user-friendly interface. Pressing a hotkey to start recording.
Customizable. Post-processing should be customizable; model selection should be customizable; hotkey should be customizable. Moreover, sometimes I do not want post-processing, then I can skip it using another hotkey or add a modifier like Shift or Ctrl.
Voxtype works well in most of these aspects. The last one is not perfect and I modify the source code to make it work (see my fork; for Arch Linux users, try my PKGBUILD).
After installing Voxtype, use the following configuration at
~/.config/voxtype/config.toml:
~/.config/voxtype/config.toml
1 | # Voxtype Configuration |
How to use it? Basically, press the hotkey to start recording, and press it again to stop recording. The transcribed text will be copied to clipboard, and you can paste it anywhere. For sway/hyprland/river users, options to type results directly at the cursor position are also available (refer to the official documentation!); but for GNOME/KDE users, clipboard mode is more reliable.
I also configured four use cases:
F9: Usebasemodel, and easy post-processing (just convert traditional Chinese to simplified Chinese usingopencc).Ctrl+F9: Usebasemodel, and complex post-processing (LLM-based polishing).Shift+F9: Usesmallmodel, and easy post-processing.Ctrl+Shift+F9: Usesmallmodel, and complex post-processing.
dsrun is a command-line tool I wrote for running
DeepSeek models
dsrun
1 | #!/usr/bin/env python3 |
BTW
Better models
I heard that FunASR is more powerful on Chinese transcription than OpenAI Whisper. I haven't tried it yet. Voxtype is capable of using any onnx model as the backend. Maybe it is worth trying to replace the OpenAI Whisper backend with FunASR?
Some more discussions about better models for voice-to-text: https://forum.nju-aia.lsamc.website/t/topic/136
Voice-to-text input method
My friend @LeonardNJU creates a cool input method called Vocotype-linux. It hacks Rime input method framework to provide voice-to-text input. It is more seamless than Voxtype. Maybe it is the future. (However as for now it is not very stable, and buggy for fcitx5 framework).