As of
Est.

Unambitious Site Search

🔍

VitePress has a default slot for search (powered by Algolia.com).
Looked into it, found provisioning of data more complicated than search itself and started my own search.

This is the description of the full text search function that I built into this VitePress site. To use the search function, use the navigation link at the top right of the pages. (Or press Ctrl - K.) This feature is not intended to compete with real search engines. It is merely a simple exercise for frontend programming.

Why Run Your Own Search?

A well-structured sidebar can make it easier to find information, but when it comes to speed of retrieval, we are better off with a keyword search.

There are not many good reasons why one should craft a homemade website search. There are professional solutions out there, and you can't expect to create something decent and comparable with a hobby project. I did it out of curiosity.

First, I looked at available solutions. Search for small data sets is offered by Algolia as a free service. So I created an account and tried it out. My initial expectation was that I could just point a web crawler at ergberg.tk and everything would be fine. Algolia does offer a crawler, but only as a paid feature. The standard way to use Algolia is to send them data in a format they can recognize.

There are also other well-known alternatives like Apache Lucene and its descendants such as Elasticsearch. All the major cloud providers like Amazon, IBM, Google and Microsoft offer multiple products. And there are also lightweight solutions like lunr.js.

Crawling & Indexing ain't Easy

When you start to think about it, getting the data to search in a suitable format for search might be more complicate than implementing a naive search on that data.

When crawling a website, you may find out that a lot of text comes up that doesn't really belong to the page's content but to the layout and boilerplate. Fortunately, this site is build from Markdown, so I already have plain text and finding the sections in it shouldn't be too complicate. True. But finding the words might be non trivial, even in Markdown. For one thing, I use embedded HTML now and then. For example to add a <figcaption> to a figure or to help word splitting with <wbr/>. I also use HTML entities like  to improve hyphenation inside words. And some fenced code is converted into something completely different, like the graphviz and mermaid graphs.

Even if I use a search engine, crawling and indexing is up to me. On the other hand, search engines have fancy features that I can well do without, like search with misspelled words and logical expressions over keywords.

Requirements

My own demands and requirements are rather low. I have a bunch of web pages with a small index that can be kept in the users' browsers. A search engine could provide many benefits, such as fuzzy search, synonyms or better ranking of results. But there are still tasks that I have to fulfill manually when using a search engine:

Cleaning up site content from Markdown & Markup
Integrating the search into my pages

From that perspective, the extra effort required to build an undemanding index seems comparable small. And that's How I did it….^[1]

On the search page, I want …

… auto-completion of keywords as you type and
… a list of search results sorted by relevance

Selecting a search result should lead to the source of information. Typical targets on my website are the individual topics and the individual subheadings. Conveniently, markdown already defines a link target for all of the sections.

Implementation

We need a suitable data model, a GUI component for user interaction, and an automated approach for gathering the search information.

Data model

My data model maps keywords to sections of Markdown pages sorted by weight. A little special handling is required to match the page itself with its first section, which in most cases is synonymous, but does not have to be. For each section, the model provides the URL, the heading, and the first few characters of text.

Entity-relationship diagram of the search index

The information is mainly copied from the Markdown text, but some precautions are needed to parse and replace typical Markdown syntax and embedded HTML. The goal is that only text that is visible in the rendered pages is part of the search index. For example, a search for the word digraph, which is part of many graphviz diagrams, should return only one hit for just this sentence here. But not for the various graphviz diagrams it is used in. On the other hand, it should be possible to find all words in .

The actual size of the JSON representation of the index is about 1M. It is loaded via a dynamic import. Vue converts it to a 412k JavaScript file and Vite delivers it as ~ 80k of compressed data.

UI

Since the site is based on VitePress, the search itself is implemented as a Vue component. It consists of an input field, a dropdown for the keyword completion, and a list for search results. The input field, dropdown and result list are all reactive and change immediately when the model changes. Navigation and selection in the dropdown is supported via event listeners.

No results found for "search"

The search UI with input field, auto-completion and search results

You can enter two or more keywords in the search field. This will result in multiple searches, one for each word. The results are combined by calculating derived weights for all sections found in this way.

The Search Results list can be used independent of the search field. The search string is a property and can be set from the parent component. This is used for the search results on the glossary pages.

Building the index

The index is built by a Vite plugin. The plugin analyzes the markdown files before they are converted to HTML. Content that does not qualify as search result is filtered out. Examples are the HTML tags and entities mentioned above.

I decided against the use of stop-word lists. I simply collect all character sequence consisting of letters, digits and some punctuation characters like -. Some combinations are discarded, such as words consisting only of digits. The frequency of keywords is counted per section and then aggregated for supersections and the whole page. Occurrences with the same word stem are combined.

✔️ check
❌ ~~checked~~
❌ ~~checking~~
❌ ~~checks~~

✔️ technology
❌ ~~technologies~~

✔️ http
✔️ https

And no, https isn't the plural form of http.

Words that appear on almost all pages are also ignored because they don't contribute to the disambiguation of search results. Ah, there are the stop words again, but they are not statically defined. They are inferred from the content.

It almost reads like a poem:

All also an are as at be
but by can content do for from
have in is it like my name not of on
or page see site some that the there this
to us use want web when with you

Have you also always been told not to write How we did it?😏 ↩︎

Unambitious Site Search ​

Why Run Your Own Search? ​

Crawling & Indexing ain't Easy ​

Requirements ​

Implementation ​

Data model ​

UI ​

Building the index ​