Full-text databases can do more than full text search. Elastic Search provides a full toolbox for indexing and manipulating text and data. Tire provides a nice and rubyish interface for using this tool.
Autocomplete is one of its basic features. Autocomplete suggests words starting with letters that you have just typed, and refining suggestions as you type.
Iterate and filter
Most databases handle that feature with a filter (with LIKE
keyword in SQL, regular expression search with mongoDB).
The strategy is simple: iterate on all results and keep only words which match the filter.
This is brutal and hurts the hard drive. Elastic Search can do it too, with the prefix query.
With a small index, it plays well. For large indexes it will be more slow and painful.
Look for what you want
Elastic search is an index, iterating over an index is a bit paradoxical. The main feature of an index is finding quickly something and giving back a reference to the complete information. Just like an index in a book: your eyes scan quickly a list of alphabetically sorted words, and you have the page number for reading the information.
It's hard to build an index, but when it's done, it's very fast to find something. Usually, data are not modified very often, there is far more reading than writing, the cost of indexing is cheap when you compare it with the searching speed benefit.
Elastic Search indexes document with tokens extracted from properties. Prefix search is a specific need and is handled with edge ngram, a variant of ngram. Ngram is a group a contiguous letters extracted from a word. Edge ngram is a ngram built from the start or the end of a word.
For example, you are a biologist and want to index the word Heterastridium, with a min size of 3 and max size of 6. Too few letter is not enough deterministic, too many is a waste.
This word Heterastridium is tokenized as:
- het
- hete
- heter
- hetera
Like any elastic search query, I can add more criteria, filters or facets.
Tire in action
Edge ngram tokenization is very specific, it can't be used for full search, or even sorting, but one property can be indexed more than one time, and elastic search handles that nicely.
Code example
Boilerplate.
#encoding: utf-8
require 'rubygems'
require 'tire'
require 'json'
require 'active_support/core_ext/object/to_query'
require 'active_support/core_ext/object/to_param'
One analyzer for starting, one for sorting.
conf = {
settings: {
analysis: {
analyzer: {
my_start: {
tokenizer: 'whitespace', #I know it's a title, with no ponctuation
filter: %w{asciifolding lowercase my_edge} #no accent, downcase
},
my_sort: {
tokenizer: 'keyword', #cut nothing, it's just for sorting
filter: %w{asciifolding lowercase}
}
},
filter: {
my_edge: {
type: 'edgeNGram', #1 to 10 letters, from the start
min_gram: 1,
max_gram: 10,
side: 'front'
}
}
}
},
mappings: {
coral: {
properties: {
id: {type: 'string', index: 'not_analyzed', include_in_all: false},
name: { type: 'multi_field', fields: {# this property needs multiple index
start: {
type: 'string', analyzer: 'my_start', include_in_all: false
},
sort: {
type: 'string', analyzer: 'my_sort', include_in_all: false
}
}
}
}
}
}
}
Some data : scientific name of corals stolen from Wikipedia.
corals = [
'Hydractinia echinata',
'Heterastridium',
'Hydractinia symbiolongicarpus',
'Hydrichthys'
]
Feeding an empty index with corals.
Tire.index 'corals' do
delete
create conf
cpt = 0
import corals.map{ |coral|
cpt += _posts/2012-01-12-autocomplete-with-tire.md1
{id: cpt.to_s, name: coral, type: 'coral'}
}
refresh
end
Most of tire's job is to provide a complex JSON to elastic search.
Searching and sorting. Which words start with "hydr".
s = Tire.search 'corals' do
query do
string 'name.start:hydr'
end
sort do
by :'name.sort', 'asc'
end
end
s.results.each do |document|
p document.name
end
The code example prints the list of words in the standard output, quickly and sorted.
Conclusion
Brute force is alway an option, ad hoc tools can do more, with less.
Elastic search provides specialized tools for common needs, use them!