Super-powered Search via Couchbase

Sunstone crystal

Ok. If you get that reference, then this article is right up your alley. The Sunstone crystal came into the possession of Superman, and was later used to create the famed Fortress of Solitude. That’s not to say that you will be on a solitary road wielding the immense power that will be available to you at the conclusion of this article, but you will certainly be in an elite community.

Beginning with Couchbase 5, you now have access to some extremely powerful tools to set your search capabilities apart from the pack. These tools are powerful enough on their own, but when combined they possess enough raw power to destroy the universe, or at a minimum make your search results much more accurate and intuitive.

Your powers: tokenizers, analyzers, indexes… oh my

In the same way Superman draws his powers by basking in the rays of the fiery orb we call the Sun, the full text search (FTS) at the heart of the Couchbase engine draws its powers from the customizability of the various tuning knobs at a developer’s disposal. Let’s step through some of them.

Indexes

Every FTS is performed on a user-created full text index, which contains the data targets on which searches are to be performed. Through this knob you can configure your Couchbase documents to only expose certain pieces of information to the FTS Engine. Think of this as your first opportunity to begin contouring the shape of your search behavior. Don’t want to expose email addresses to the search engine? Don’t. Want to include Last Name and leave out First Name? Pow! It’s completely up you.

Tokenizers

Tokenizers split input strings into individual tokens. The tokens that are created are then made into a individual tokens as a stream for the query. Here are some popular tokenizers.

Letter: Creates tokens by breaking input text into subsets that consist of letters only; characters such as punctuation marks and numbers are omitted. Creation of a token ends whenever a non-letter character is encountered; for example, the text “Reqmnt: 7-element phrase” would return the following tokens: Reqmnt, element and phrase.
Single: Creates a single token from the entirety of the input text. For example, the text “in each place” would return the following token: in each place. This may be useful for handling URLs or email addresses, which can be prevented from being broken at punctuation or special-character boundaries. It may also be used to prevent multi-word phrases (for example, placenames such as “Milton Keynes” or “San Francisco”) from being broken up due to whitespace, so that they become indexed as a single term.
Unicode: Creates tokens by performing Unicode text segmentation on word boundaries, using the segment library.
Web: Creates tokens by identifying and removing HTML tags.
Whitespace: Creates tokens by breaking input text into subsets according to where whitespace occurs. For example, the text “in each place” would return the following tokens: in, each and place.

So now you’ve got your indexes only exposing the information you want searchable. You’ve got your tokenizers purifying incoming query strings into a representation ideal for your context. You can now begin to combine these tools into a formidable weapon, greater together than the sum of their parts (think Captain Planet or the Power Rangers).

Analyzers

Analyzers take the output of the tokenizer (called a token stream) and apply a few more controls on top to end up with the delicately crafted final representation of the search query with which to match. This is accomplished through character filtering and token filtering. Character filtering, as the name implies, removes unwanted characters from the input stream. Token filtering takes the token-stream output from the tokenizer and makes some additional modifications to the tokens themselves. An easy way to think about it is that the tokenizer tells the engine how to break up a large string into individual tokens, while the token filter further refines each of those output tokens.

So the final recipe looks like this:

Original Input -> Character Filter -> Tokenizer -> Token Filter = Final Query

BOOM!

Now here is where the magic happens. This highly configurable set of filters and tokenizers can be applied at an index level! This means that if your data model looks something like this:

{
  "firstName": "Clark",
  "lastName": "Kent",
  "email": "clark.kent@dailyplanet.example.com",
  "allergies": "kryptonite dust"
}

…you can set up unique indexes for firstName lastName, and email. This means that you can configure completely separate variations of filters/tokenizers/analyzers for each data attribute above.

Bring it all together

Here’s an actual example of preparing the FTS query.

public async ftsSearch(searchCriteria): Promise<any> {
  let query = searchCriteria
     .trim()
     .replace(/[`~!#$%&()_|\-=?;'",.<>\{\}\[\]\\\/]/gi, ' ')
     .replace(/\s+/gi, ' ')
     .split(' ')
     .map(val => '+' + val)
     .join(' ')
     .toLowerCase();

The searchCriteria input is being tweaked, massaged and sanitized into a more suitable context for optimal results. Going a step further we can add something like this.

searchCriteria = 'firstName:'
  .concat(query.trim())
  .concat(' ')
  .replace(/\s+/gi, ' ')
  .replace(/\s+/, ' ')
  .concat(searchCriteria);

As shown above, you can append additional query parameters into the original search string. In this case, we’re adding additional weight to the firstName field to ensure documents that have strong matches to firstName are returned.

“With great power comes great responsibility.” Use this information to craft an unwavering search engine suitable for the gods.

If you’d like more information and want to dive even deeper into the archives of ancient knowledge, check out the complete Couchbase FTS documentation. Godspeed, and good luck.

American Express Technology