Fork me on GitHub

Multi language handling in Solr

If you are building full-text search application for your website on top of Apache Solr once you will definitly face the problem on how to handle multiple language. There are at least 2 possibilities how to do it and here I will describe you both of them with possible bottlnecks you might have and solutions to upcoming problems.

1. Multicore approach

Perhaps most popular and well discribed is multicore approach. It is really easy to setup and start using it almost immidiatly if you already have configured one core (schema.xml and solrconfig.xml). You just need to dublicate existing core as many times as many languages you have. Than upon reading/writing select proper core depending on incoming language. That is all the magick. By the way, demo example of multicore solr configuration is included in each downloadable solr package by this path solr-4.X.X/example/multicore

2. Single core approach with language separation by document language field

This approach assume that you will use one core for storing all document languages and differentiate them by storing additional information about language inside solr document (e.g. language, iso code). This requires a bit more configuration therefore I will demonstrate you key changes that you will have to make to your solr schema and solrconfig files.

Lets assume we have 2 languages English and German.

1. Add configuration of language specific field types to schema.xml

<!-- fieldType for each language (default, en, de) -->
<fieldType name="text" class="solr.TextField" />
<fieldType name="text_en" class="solr.TextField" />
<fieldType name="text_de" class="solr.TextField" />

Here can be added additional information about language specific stopwords, synonyms and etc.

2. Add fields that might contain language specific data (e.g. title, desciption)

<!-- configuration of fields in schema.xml for each language (default, en, de) -->
<field name="title" type="text" indexed="true"  stored="true" />
<field name="title_en" type="text_en" indexed="true" stored="true" />
<field name="title_de" type="text_de" indexed="true" stored="true" /> 

3. If you use dynamicField definition in your schema which might also be language specific also do not forget to describe it

<dynamicField name="*_s" type="string" indexed="true" stored="true" />
<dynamicField name="*_s_en" type="string" indexed="true" stored="true" />
<dynamicField name="*_s_de" type="string" indexed="true" stored="true" />

4. Add field which will store information about language and iso code if necessary

<field name="language" type="string"  indexed="true" stored="true" />
<field name="iso_code" type="string"  indexed="true" stored="true" />

Upon writing to Solr do not forget to fill language and/or iso_code field as long as always fill default language specific fields (without prefix) and language specific fields with prefix. So in the end depending on Solr document language you will have something like:

<!-- for english document -->
<str name="title">some english title</str>
<str name="title_en">some english title</str>
<str name="language">en</str>
<str name="iso_code">en-us</str>

and on opposite side for German document

<!-- for german document -->
<str name="title">einige deutsch titel</str>
<str name="title_de">einige deutsch titel</str>
<str name="language">de</str>
<str name="iso_code">de-de</str>

Thats basically all.

Some additional hints for those who are using Solr SpellCheckComponent.

Someone maybe a bit disappointed to see German suggestion in English search and other way around. However I might say sometimes it could be pretty funny. But nevertheless to make SpellCheckComponent to function properly again you need to add some addtional configuration to your SpellCheckComponent in solrconfig.xml:

1. You need to add spellchecker for each language (in our case English and German)

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <str name="queryAnalyzerFieldType">textSpell</str>
    <lst name="spellchecker">
        <str name="name">default</str>
        <str name="field">spell</str>
        <str name="buildOnOptimize">true</str>
        <str name="buildOnCommit">true</str>
        <str name="spellcheckIndexDir">./spellchecker</str>
    </lst>
    <lst name="spellchecker">
        <str name="name">spell_en</str>
        <str name="field">spell_en</str>
        <str name="buildOnOptimize">true</str>
        <str name="buildOnCommit">true</str>
        <str name="spellcheckIndexDir">./spellchecker_en</str>
    </lst>
    <lst name="spellchecker">
        <str name="name">spell_de</str>
        <str name="field">spell_de</str>
        <str name="buildOnOptimize">true</str>
        <str name="buildOnCommit">true</str>
        <str name="spellcheckIndexDir">./spellchecker_de</str>
    </lst>
</searchComponent>

2. Register additional spellchecker dictionaries in requestHandler

<requestHandler name="/spellCheckCompRH" class="solr.SearchHandler" lazy="true">
    <lst name="defaults">
        <str name="spellcheck.dictionary">default</str>
        <str name="spellcheck.dictionary">spell_en</str>
        <str name="spellcheck.dictionary">spell_de</str>
        <str name="spellcheck.count">1</str>
    </lst>
    <arr name="last-components">
        <str>spellcheck</str>
    </arr>
</requestHandler>

3. Add prefixed spell fields for each language to solr schema.xml

<field name="spell" type="solr.TextField" indexed="true" stored="false" multiValued="true" />
<field name="spell_en" type="solr.TextField" indexed="true" stored="false" multiValued="true" />
<field name="spell_de" type="solr.TextField" indexed="true" stored="false" multiValued="true" />

4. In your application which generates Solr queries and communicatates to Solr server add &spellcheck.dictionary=spell_ param to search query depending on language. Please note that if you will note provide this parameter default spellchecker will be used which contain wrong index!!!*

Conclusion

Multicore approach

  • Very easy and straightforward way to start playing with. Requires almost ZERO configuration efforts and do not require deep understanding of Solr configuration options;
  • However you migh already guess that this approach has at least one main disadvantage which is maintanance and it will increase proportionally to the amout of languages you will have to deal with;

Single core approach with language separation by document language field

  • Requires a bit more configuration and understanding of Solr in the beginning;
  • Easy to maintain/deploy;

In my opinion both approaches fit nicely to any number of languages, however I would preferably use second one due to maintainance costs reasons.

Comments