I’ve recently implemented new spellchecker module to improve current implementation based on node.js native addon. Both are based on well-known spellchecker library Hunspell but new module enables possibility for performant spellchecking even on web browsers via WebAssembly. This post summarizes short retrospective around considerations and further improvements for those.
Modules I implemented are Hunspell-asm for wasm bindings of hunspell spellchecker and Cld3-asm for Chromium’s languade detector CLD3. Additionally Electron-hunspell is published to use hunspell-asm in Electron based applications.
Build tools and configurations
There are couple of known way to build C/C++ based code into WebAssembly binary. Emscripten was tools used for building hunspell-asm & cld3-asm among those. It have out of box support including reuse of existing makefile
as well as providing lot of convinient helpers to interop between javascript code to C/C++ codes. For example, hunspell’s C interface of spell suggestion returns pointer of string array
int Hunspell_suggest(Hunhandle* pHunspell, char*** slst, const char* word);
can easily interop via emscripten’s helper function like getValue
/Pointer_stringify
. Emsctipten also supports some level of file system access api as well especially for node.js too.
For actual build, hunspell-asm and cld3-asm both doesn’t build WebAssembly binary in their javascript build script. Instead it has similar logic to node-pre-gyp
by downloading prebuilt binary. Still compare to node.js native addon it doesn’t need to care about per-platform build most cases, so build process is simplified by using docker image. There are ongoing effort to seamlessly import native code from javascript and vice versa even having some early implementation via webpack loader or else. Eventually it’ll be way to integrate between WebAssembly to javascript to code but for cases of hunspell-asm and cld3-asm would like to reuse existing build setups including makefile
of its project configuration.
node.js vs. browser, or isomorphic
Hunspell-asm and Cld3-asm aimed to be isomorphic. It supports node.js and Browser, even Electron as long as it supports native WebAssembly binary, including few specific target environment supports.
When after build native codes via emscripten build output contains xxx.wasm
for WebAssembly binary with xxxx.js
contains helper from emscripten. This javascript module takes care of actual job for loading modules. It have detection logics for running environment, call fetch
/ require
appropriately to import binary then init, compile. Environments like Electron is edge case cause it’s renderer process have fetch
and node.js globals at the same time and emscripten’s script can be confused to which way to load binaries, as well as fetch
is not a way to go if all resources are packaged in local system to not able to call remote endpoint to pull down binaries. Hunspell-asm uses SINGLE_FILE
option for those purpose.
SINGLE_FILE
is basically embed WebAssembly binary into javascript module exported via emscripten by encoding binary into base64
encoded string. It removes need to loading separate binary but also it potentially increases initial module import time as binary is no longer lazy loaded but appended into module directly. Also worth thing to note is it defeats some of loading optimization streategy for WebAssembly as well like on-the-fly initialization via WebAssembly.instantiateStreaming()
or caching compiled binary into indexeddb to reduce subsequent loading times. For those reasons hunspell-asm/cld-asm aim to split binary again for next version.
Unlike cld3-asm, hunspell-asm needs additional input resources to operate - dictionary for spell checking. Its spellchecker instantiation interface gets path to dictionary then read files in system internally. Since WebAssembly have same limit to other javascript runtime environment, there is no default way to access file system from it.
Hunhandle* Hunspell_create(const char* affpath, const char* dpath);
Emscripten thanksfully exposes few workaround for those. Easiest way is package necessary resources into binary at build time but it wasn’t option for hunspell since it have nearly 40 dictionary available, and lot of cases only few of them are used in application. Without packaging emscripten still provides its own virtual filesystem hunspell could use for.
In case of node.js environment, those filesystem can be mounted into real filesystem then any file api used in C/C++ will gain access to actual files.
FS.mount(FS.filesystems.NODEFS, { root: path.resolve(dirPath) }, mountedDirPath);
For browser environment, emscripten provides typeof MEMFS
to use memory object mapped to file system. Once data is loaded in memory with type of ArrayBufferView
, it can be written as virtual file
then C/C++ file api will access it as plain files.
FS.writeFile(mountedFilePath, contents, { encoding: 'binary' });
Few considerations around performance
Lot of talks around WebAssembly mention about huge performance boost. It can obviously vary though, in case of hunspell-asm expectation wasn’t high as original module being used were already node.js native addon but relatively satisfied after replacing into hunspell-asm. One thing to note is this is very subjective cause comaprison doesn’t count surroundings like way native addon was composed, Electron’s behavior integrating node.js in renderer process, etcs. If code is really compute intensive it may worth to do more strict controlled benchmark instead.
One another thing allows hunspell / cld performs well via WebAssembly is nature of its interface doesn’t require lot of interop between javascript. It is not trivial to direct memory access between WebAssembly and javascript memory area, which creates overhead each time passing or returning data. As a naïve example, Hunspell_suggest
internally stores suggestions as std::vector<std::string>
and returns its pointer as return value.
int Hunspell_suggest(Hunhandle* pHunspell, char*** slst, const char* word);
To read those string values in javascript it needs to iterate array and convert pointer to string values via Pointer_stringify
or similar manner. When there are more chances requires interop between two context it’ll hurt performance gain in WebAssembly.
One other performace factor than runtime performance is size of module and its loading / init time. Theoretically if there are same codes WebAssembly build would be smaller as it is stored as binary format, but in real cases binary size can be larger than expected. For example hunspell-asm takes around 800KB, cld3-asm takes around 930KB including emscripten javascript glues. This is mostly due to dependencies native code relying on. Like Hunspell_suggest
if it uses std::vector
or std::string
, or more deeply memory allocation like malloc
requires some of bytecode to be included in output. Rust
aims to provide first-class WebAssembly support and have experimental implementation like making targeted memory allocator even which isn’t something immediately available for C/C++ with existing codebases. If it’s intended to write new code for WebAssembly target, worth try to minimize dependencies bring size concerns.
It is very important WebAssembly’s memory is not being GCed. From buffer like hunspell dictionary to instance of class or internally allocated resources all of them must be manually freed. Destructors in C++ class using embind
is not being called as well, and there are some more other edge cases requires careful eye to avoid possible memory leaks.