af83

Autocomplete with tire

Full-text databases can do more than full text search. Elastic Search provides a full toolbox for indexing and manipulating text and data. Tire provides a nice and rubyish interface for using this tool.

Autocomplete is one of its basic features. Autocomplete suggests words starting with letters that you have just typed, and refining suggestions as you type.

Iterate and filter

Most databases handle that feature with a filter (with LIKE keyword in SQL, regular expression search with mongoDB). The strategy is simple: iterate on all results and keep only words which match the filter. This is brutal and hurts the hard drive. Elastic Search can do it too, with the prefix query.

With a small index, it plays well. For large indexes it will be more slow and painful.

Look for what you want

Elastic search is an index, iterating over an index is a bit paradoxical. The main feature of an index is finding quickly something and giving back a reference to the complete information. Just like an index in a book: your eyes scan quickly a list of alphabetically sorted words, and you have the page number for reading the information.

It's hard to build an index, but when it's done, it's very fast to find something. Usually, data are not modified very often, there is far more reading than writing, the cost of indexing is cheap when you compare it with the searching speed benefit.

Elastic Search indexes document with tokens extracted from properties. Prefix search is a specific need and is handled with edge ngram, a variant of ngram. Ngram is a group a contiguous letters extracted from a word. Edge ngram is a ngram built from the start or the end of a word.

For example, you are a biologist and want to index the word Heterastridium, with a min size of 3 and max size of 6. Too few letter is not enough deterministic, too many is a waste.

This word Heterastridium is tokenized as:

  • het
  • hete
  • heter
  • hetera

Like any elastic search query, I can add more criteria, filters or facets.

Tire in action

Edge ngram tokenization is very specific, it can't be used for full search, or even sorting, but one property can be indexed more than one time, and elastic search handles that nicely.

Code example

Boilerplate.

#encoding: utf-8
require 'rubygems'
require 'tire'
require 'json'
require 'active_support/core_ext/object/to_query'
require 'active_support/core_ext/object/to_param'

One analyzer for starting, one for sorting.

conf = {
  settings: {
    analysis: {
      analyzer: {
        my_start: {
          tokenizer: 'whitespace', #I know it's a title, with no ponctuation
          filter: %w{asciifolding lowercase my_edge} #no accent, downcase
        },
        my_sort: {
          tokenizer: 'keyword', #cut nothing, it's just for sorting
          filter: %w{asciifolding lowercase}
        }
      },
      filter: {
        my_edge: {
          type: 'edgeNGram', #1 to 10 letters, from the start
          min_gram: 1,
          max_gram: 10,
          side: 'front'
        }
      }
    }
  },
  mappings: {
    coral: {
      properties: {
        id: {type: 'string', index: 'not_analyzed', include_in_all: false},
        name: { type: 'multi_field', fields: {# this property needs multiple index
          start: {
            type: 'string', analyzer: 'my_start', include_in_all: false
          },
          sort: {
            type: 'string', analyzer: 'my_sort', include_in_all: false
          }
        }
      }
    }
  }
  }
}

Some data : scientific name of corals stolen from Wikipedia.

corals = [
  'Hydractinia echinata',
  'Heterastridium',
  'Hydractinia symbiolongicarpus',
  'Hydrichthys'
]

Feeding an empty index with corals.

Tire.index 'corals' do
  delete
  create conf

  cpt = 0
  import corals.map{ |coral|
    cpt += _posts/2012-01-12-autocomplete-with-tire.md1
    {id: cpt.to_s, name: coral, type: 'coral'}
  }
  refresh
end

Most of tire's job is to provide a complex JSON to elastic search.

Searching and sorting. Which words start with "hydr".

s = Tire.search 'corals' do
  query do
      string 'name.start:hydr'
  end
  sort do
    by :'name.sort', 'asc'
  end
end

s.results.each do |document|
  p document.name
end

The code example prints the list of words in the standard output, quickly and sorted.

Conclusion

Brute force is alway an option, ad hoc tools can do more, with less.

Elastic search provides specialized tools for common needs, use them!

blog comments powered by Disqus