You are currently viewing TYPO3 and Apache Solr – The Indexing Process

TYPO3 and Apache Solr – The Indexing Process

One key to a successfull in-site search with Apache Solr is to understand how indexing works. In this post I explain how content is splitted into terms, which will be used by Solr to find relevant content.

In the first post of this series, I described how easy it is to get Apache Solr together TYPO3 up and running. If you missed it and / or have no clue, what’s all about, you can read it here: /2018/03/typo3-apache-solr-advanced-search-introduction/.

But the real fun comes, when you know some details about how Apache Solr is dealing with incoming data. It took a while for me, to understand how Apache Solr deals with it. In this post I will share the “big picture” and explain, what happens behind the scenes.

Fields and field types

This is the most obvious part: Data is passed from TYPO3 to Apache Solr. Each transferred records is saved as a document by Apache Solr. Each of these documents consists of several fields.

The TYPO3 Solr extension provides a good and reasonable configuration for TYPO3 standard content and some extensions, like EXT:news. Fields, that are well known from the TYPO3 backend, like page title, abstract, description and author are pushed to Solr. Of course the content of a page finds its way to Solr too.

On the Solr side each of these fields is connected with a field type. The field type defines how Solr will handle the data internally. The available field types are defined in the schema.xml of a Solr core definition.

The field types can distinguish between indexing and querying the solr server, the so called analyzers. This means the Solr can handle the incoming data differently, whether it is a query or not.

Tokenizer and filters

Within each analyzer one tokenizer and a series of filters is defined. A tokenizer spilts the the incoming text at defined boundaries and returns one or more so called tokens. In a second step these tokens are piped through the defined filters. The result is a set of terms.

While the indexing process is going on, these terms are saved to the Solr index and connected with the documents. In the query process, the term will be looked up and the related documents will be passed back to the TYPO3 extension and displayed in the search result.

Tokenizers

As already said, tokenizers are splitting the incoming text into tokens. There is only one tokenizer per analyzer. Apache Solr provides fourteen different tokenizers out of the box. If you know Java (the programming language), you can code custom tokenizers.

In the next lines I will explain some tokenizers and show what their outcome is.

Standard Tokenizer

The Standard Tokenizer splits the incoming text at whitespaces and punctuation including dots, colons, comma or @. If a dot is not followed by a whitespace, the text will not be split at this position. The split characters are not part of the resulting terms. Here an example:

Text incoming:
Please, email john.doe@foo.com by 03-09, re: m37-xq.

Tokens outgoing:
"Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"

Classic Tokenizer

The Classic Tokenizer is quite similar to the Standard Tokenizer, except it treats dashes and @ signs not as word boudaries. Furthermore all characters defined in http://unicode.org/reports/tr29/#Word_Boundaries are ignored, while splitting the text. The incoming text is the same as above, the result is slightly different:

Text incoming:
Please, email john.doe@foo.com by 03-09, re: m37-xq.

Tokens outgoing:
"Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"

White Space Tokenizer

As the name says this tokenizer will split the incoming text only at white spaces. Given the example from above the result will be

Tokens outgoing:
"Please,", "email", "john.doe@foo.com", "by", "03-09,", "re:", "m37-xq."

The previous tokenizer were quite convenient, because they are work like we are splitting text into tokens, while reading. The next two should give an idea how flexible the tokenizing can be handled.

Path Hierarchy Tokenizer

The Path Hierarchy Tokenizer builds tokens from a path string, starting with the most upper level of a path. Furthermore it is possible to replace the path delimiters of the incoming string with your standard. An example would be:

Text incoming:
"c:\usr\local\apache"

Tokens outgoing:
"c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"

Regular Expression Pattern Tokenizer

There are also some more “exotic” tokenizers, like the regular expression pattern tokenizer. It takes an argument, which is understood by java.util.regex.Pattern and splits the incoming text according to the defined regex.

For a complete overview about the tokenizers, you can head over to https://lucene.apache.org/solr/guide/6_6/tokenizers.html and find more facy stuff.

Filters

The result of the tokenizers is further processed by filters. Filters are used to create terms out of the tokens. The result of the filters is one or more terms.

They are quite similar to the tokenizers, but act on a single term and not on the complete text. Overall there are 46 filters available. Here are some examples for the filters, which are used within the TYPO3 SolR configuration and explain the requirements for different cores.

Stemmer Filter

There are some of them, like the Porter Stem Filter or the Snowball Porter Stemmer Filter. Both try to find out the stems of a word and convert it to terms, which will land in the solr server index. But stemming is a very language specific stuff, so the results of the filter will be very different depending on the language.

Here are two examples for the SnowballPorterFilterFactory. The first one is for english.

Text incoming
"flip flipped flipping"

Tokens out
"flip", "flipped", "flipping"

Filter output
"flip", "flip", "flip"

If the french language is used for stemming with the same incoming text, the filter output will look like this:

Filter output
"flip", "flipped", "flipping"

This example shows, why it is not possible to mix up different languages in a single solr core. The reason is the all field definitions are unique within one core. It does not make sense to mix up different languages within the same core. (Hmm, yes you can, but the results will not meet your expectations).

(Edge-) N-Gram Filters

The N-Gram filters split the incoming terms into smaller portions. The size of the resulting terms is defined by a minimum and a maximum limit. This example makes it more clear:

The minimum size of a resulting term is “3” and the maximum size is “5”.

Tokens incoming
""four", "score"

Filter output
"fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore"

The difference between the N-Gram and the Edge N-Gram filter is, that the Edge N-Gram filter always starts at the beginning of a string. The output of the above example would be:

Filter output
"fou", "four", "sco", "scor", "score"

Managed synonym and stop-word filters

Both filters work with a defined set words. The stop word filter will remove any term from the index, that matches one of its list. The synonym filter uses the same mechanism, but works in the opposite direction: every matching term is replaced with all of the defined synonyms.

In the current version of the TYPO3 solr extension, it is possilble to manage both lists per core from within the backend module.

Remove Duplicate Token Filter

This is a very useful filter at the end of the list oft filters. It removes all duplicate tokens / terms, which were produced. So only unique terms will be pushed into the index.

An example:

Text incoming
"Watch TV"

Tokenizer output
"Watch"(1) "TV"(2)

Synonym Filter Output
"Watch"(1) "Television"(2) "Televisions"(2) "TV"(2) "TVs"(2)

Stem Filter Output
"Watch"(1) "Television"(2) "Television"(2) "TV"(2) "TV"(2)

Remove Duplicate Filter Output
"Watch"(1) "Television"(2) "TV"(2)

This example shows in the last step how the the “Remove Duplicates Filter” works. But it shows also how different tokens mute, when they are piped through a series of filters. The digit after the colon indicates, where the resulting terms is coming.

These are just a few examples of filters. They shall give an impression of what is possible. The full list of all 46 filters is available at https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html

Dynamic fields

On the one hand Apache Solr offers possible to define each field in the schema.xm, which might be quite unhandy, if you need many many fields. But many of them probably have quite similar requirements for tokenizing and filtering. So why should it be necessary to create an extra field defintion for each field of an document?

This is where the dynamic fields come into the match. Dynamic field definition are provided by the solr configuration. They allow to use the same tokenizer and filter definitions for different fields.

The TypoScript configuration allows to use these fields. In the indexing part the new fields are defined.

plugin.tx_solr.index.queue.news.fields.teaserResult_textS {
   ...
}

This snippet creates a field teaserResult_textS in a solr document of the type news. The text after the _ defines how the content of the field is handled by solr. text says, that it uses the tokenizer and filter definitions of the text field. The last character can have two values: S or M. The S suffix stands for “Single valued”. In other words it is a single string. The M suffix is used for multi-valued fields. This can be compared to an array, which can hold multiple values.

And again: The complete list of dynamic fields is available on https://docs.typo3.org/typo3cms/extensions/solr/8.0.2/Appendix/DynamicFieldTypes.html.

Final words and outlook

I hope you have got a quite good impression of the flexibility to fill a Apache Solr core with data. The next post about TYPO3 and Apache Solr will be about how to influence the score and thus the sorting of the results. The fourth post will be about “Debugging solr indexing and scoring”.

Credits

I want to thank my supporters, who make this blog post possible. For this blogpost I welcome Paul Kamma as a bronze sponsor for my blog.

If you also appreciate my blog and want to support me, you can say “Thank You!”. Find out the possibilities here:

I found the blog post image on pixabay. It was published by Tomasz Proszek under the CC0 Creative Commons License. It was modified by myself using pablo on buffer.

Leave a Reply