The ultimate linguistic guide to software localisation for developers

There are lots of great guides out there for how to prep your product for internationalisation and localisation from an engineering perspective. Building software localisation into your product right from the start 鈥� even if you鈥檙e not ready to expand beyond one locale just yet 鈥� saves you a tonne of work and headaches down the line.

The effects of software localisation cascade down to every aspect of development and post-development, from UX and interface design to the basic engineering and core functionality of your product, and to documentation, support and marketing. With this in mind, getting a good grounding in the repercussions that designing for different locales has for the development process is a great idea for any software developer.

We鈥檒l start by explaining some basic concepts. Then, we鈥檒l look at examples of strings from different languages and explore the requirements that different locales have. Throughout this post, we鈥檒l refer to our fictional app 鈥淪uperApp鈥� in our examples.

Locales vs language variants

It might be helpful to start by looking at what we mean by a locale. This is a term used both in the tech and translation industries to refer to a country-specific variant of a language. If you鈥檙e not from a multilingual background, you鈥檇 be forgiven for thinking that it鈥檚 sufficient to think about languages such as English, Spanish and Swedish. If we want to make SuperApp available in one of these languages, surely it鈥檚 enough to translate the strings and be done with it?

The thing is, 鈥渓anguage鈥� is a fuzzy term and nowhere near granular enough for our needs. Let鈥檚 start with English. It鈥檚 spoken natively by over 400 million people and is an official language in 55 sovereign states 鈥� a group of countries commonly referred to as the Anglosphere. The language isn鈥檛 uniform across the Anglosphere: there are dozens of national varieties, each with their own conventions for things like pronunciation, grammar and spelling standards, and even how dates and numbers are formatted.

You鈥檙e more than likely already familiar with the two biggest varieties: British English and American English. These national standards can be expressed with the en-GB and en-US, respectively. The story is similar (albeit on a much smaller scale) for languages like Swedish 颅鈥� which is an official language in both Sweden (蝉别鈥慡痴) and Finland (蝉别鈥慒滨).

But is this a locale? Well, not quite. The tags above refer to the language variant only and do not include the user鈥檚 selected region settings. Region settings affect things such as how the date and time is expressed (e.g. 鈥�31 December鈥� being written as 31/12 or 12/31, and whether to use 12- or 24-hour clock by default), how numbers are formatted (e.g. using a dot or a comma as the decimal separator) and where currency symbols are placed (e.g. before or after the amount, with or without a space). If we bundle these region settings up with the language variety, then we get our locale.

On most operating systems, users can independently select their interface language and preferred region settings, meaning they can end up with locales that don鈥檛 necessarily align with national language variants. For example, many Icelanders use their computers in English but with their region set to Iceland. This locale would be expressed as en_IS (note the use of an underscore as opposed to a hyphen).

Although it鈥檚 important to understand the distinction between language variants and locales, thankfully, the hard work of accounting for all the different date and number formats is done for you on most platforms. Apple, for example, provides a wide range of formatters that adjust things like the decimal indicator and date format automatically for the user鈥檚 selected region settings, even if those region settings don鈥檛 correspond to the interface language.

One final consideration is the aspect of hierarchy when it comes to language variants. Your app may only support one broad variety of English (en) or Spanish (es), for example, rather than country-specific variants. Even though you don鈥檛 support their local variant, most users will still prefer to use the broad international or regional variant of their language rather than a different language altogether.

Let鈥檚 take Spanish as an example. Most often, software is localised into Peninsular Spanish (the variety spoken in Spain) first. This national standard also acts as the 鈥榖road鈥� variety and would sit at the top of the hierarchy, designated es. Now that we鈥檝e made SuperApp available in Spanish, we have decided to offer a more tailored experience for our Latin American users by supporting their regional language variant, which is designated 别蝉鈥�419. Going further, we’ve decided to offer our Mexican users an even more localised experience and translate our strings into Mexican Spanish, meaning we end up with an 别蝉鈥惭齿 variant as well. If a user鈥檚 preferred variant is not available, then they can cascade back up the list until they find their closest preferred language variant.

Things to consider when writing strings for software localisation

Now that we鈥檝e got a firm grip on locales, we can take a look at the ramifications of software localisation when it comes to writing and concatenating (or segmenting) your software strings.

Numbers and dates

We鈥檝e already briefly touched on the subject, so we should probably get this out of the way early. In almost all situations, there is essentially one golden rule to follow here: never hard-code date, time or number formats.

No matter what programming language and dev environment you鈥檙e using, there are fantastic date and number formatters available 鈥� either native or added through libraries such as 鈥� that take care of all the hard work for you, returning perfectly localised dates and numbers that respect your users鈥� region settings. The best advice here is to rely solidly on these and save yourself a world of trouble.

Word endings

In English, there are relatively few word endings (inflections) to consider. The vast majority of nouns are made plural by adding an -s or -es. When it comes to most verbs, we have only two forms in the present tense, for example, 蝉颈苍驳蝉听and sing. However, many languages have a greater variety of endings than English, and these can affect more classes of words than nouns and verbs; for example, many languages also inflect adjectives. The distribution can also vary by language: some languages, particularly Scandinavian ones, have less inflection than English on verbs but more on adjectives.

Let鈥檚 take this example from Norwegian:

顿盲谤 finns 1 rum ledigt p氓 denne prisen.
There is 1 room available at this price.

顿盲谤 finns 10 rum lediga p氓 denne prisen.
There are 10 rooms available at this price.

Here we can see that the verb finns is the same in both sentences, whereas in English we have two different forms, is and are. On the other hand, the adjective has changed: in the first sentence, it is singular (ledigt), and in the second, it is plural (lediga).

This affects how we concatenate our strings. As a general rule, it鈥檚 always best to avoid chopping strings up wherever you can. The translator will be able to offer a better-quality translation if we leave the string as intact as possible. Another reason for this, as we鈥檒l see below, is that word order can vary hugely between languages, so we should never assume that, for example, numbers will occur in the same position in the sentence.

Plurals

In the Swedish example above, we saw how word endings can change between singular and plural forms. In the Scandinavian languages and Finnish, we only have to worry about a singular and non-singular form. For other languages, the situation is slightly more complex. Let鈥檚 take an example from Icelandic:

1 b铆ll fannst 谩 镁essu ver冒i 铆 n谩grenninu.
1 car was found at this price nearby.

12 b铆lar fundust听谩 镁essu vir冒i 铆 n谩grenninu.
12 cars were found at this price nearby.

21 b铆ll fannst听谩 镁essu vir冒i 铆 n谩grenninu.
21 cars were found at this price nearby.

The first two sentences in this example show the same singular鈥損lural distinction we鈥檝e seen so far: when the number is more than 1, there is a different ending for the word. The singular word is听产铆濒濒听鈥渃ar鈥�, and the plural word is 产铆濒补谤听鈥渃ars鈥�. However, Icelandic also requires numbers ending in -1 (with the exception of 11) to use the singular form, whereas other languages, including English, might have the plural form. This is because of the way the number is constructed in Icelandic: 21 expands to tuttugu og einn 鈥渢wenty and one鈥�, so we鈥檙e literally saying 鈥渢wenty and one car鈥�. This is something we need to take into consideration in our logic when deciding which form of a string to serve up in Icelandic.

In the Slavic languages, we have to consider a different, even more complex set of rules. In Polish, for example, there are three possible forms to choose from, depending on the number used:

A singular form (e.g. 蝉补尘辞肠丑贸诲 鈥渃补谤鈥�);
A form used with 2, 3 and 4, and any numbers ending in -2, -3 or -4, except for 12, 13 and 14 (samochody);
A form used with all other numbers (蝉补尘辞肠丑辞诲贸飞).

In JavaScript, we could express this rule as follows:

function returnPolishForm(i) {
听 var form = 'genPlural'; // Our default form
	var lastDigit = i.toString().slice(-1);
	if(i==1) {
		form = 'singular'; // If i is 1
	} else {
  	if (lastDigit >= 2 && lastDigit <=4) {
			form = 'plural'; // If i ends in -2, -3, -4 and is not 12, 13, 14
      if(i >= 12 && i <=14) {
        form = 'genPlural'; // If i is 12, 13, 14
      }
		} else {
听 听 	form = 'genPlural'; // All other numbers
听 听 }
听 }
听 return form;
}

Let鈥檚 take the example we used for Icelandic from above and apply it to Polish:

W okolicy znaleziono 1 蝉补尘辞肠丑贸诲 w tej cenie.
1 car was found at this price nearby.

W okolicy znaleziono 2 samochody w tej cenie.
2 cars were found at this price nearby.

W okolicy znaleziono 5 蝉补尘辞肠丑辞诲贸飞 w tej cenie.
5 cars were found at this price nearby.

W okolicy znaleziono 23 samochody w tej cenie.
23 cars were found at this price nearby.

W okolicy znaleziono 25 蝉补尘辞肠丑辞诲贸飞 w tej cenie.
25 cars were found at this price nearby.

Note how the word for 鈥渃ar鈥� changes with the number. To serve the correct form of the string to the user, we need to add some logic that is specific to Polish. If we don鈥檛 do this, then we鈥檒l introduce a grammatical error that, in the best case, detracts from the user鈥檚 experience and, in the worst case, creates a severe misunderstanding.

Gender

Many languages have a feature called grammatical gender. These are essentially classes of nouns that inflect in a similar way. While they may be labelled masculine, feminine or neuter, a word鈥檚 grammatical gender doesn鈥檛 always align with its natural gender. In German, for example, the word for 鈥済irl鈥�, 惭盲诲肠丑别苍, is neuter. Gender doesn鈥檛 only affect nouns, though; it has knock-on effects on adjective endings and pronouns as well.

Pronouns

In English, we use the neuter pronoun it to refer to inanimate objects. A typical string in SuperApp might look something like this:

This document is over 50 MB in size. Would you like to send it anyway?

In Icelandic, this would be:

脼别迟迟补 skjal (n.) er yfir 50 MB a冒 st忙r冒. Viltu senda 镁补冒 (n.) samt?

The word for 鈥榙ocument鈥�, skjal, is grammatically neuter (n.). As a programmer, it may be tempting to split this message into two strings, as we have two sentences. Then, if we need to swap out the first string, say, to refer to a photo instead of a document, we can just concatenate them at runtime. However, if we change 鈥榙ocument鈥� to 鈥榩hoto鈥� here, we get an ungrammatical construction in Icelandic (indicated by the asterisk):

脼别蝉蝉颈 mynd听(f.) er yfir 50 MB a冒 st忙r冒. Viltu senda *镁补冒听(n.) samt?

The problem stems from the fact that mynd is feminine (f.), but 镁补冒 is neuter. This means that the gender doesn鈥檛 agree, making this pair of sentences ungrammatical. Instead of 镁补冒, we should have the feminine pronoun hana (literally 鈥榮he鈥�), which refers back to mynd. The better solution then is to keep these sentences together in one string and allow the linguist to translate it as one block.

Adjectives

Gender also affects how we address users. In English, particularly in user interfaces, we tend to see a lot of structures like this:

Are you sure you want to delete this folder?
Are you ready to turn on your camera and microphone?

These kinds of sentences work great in English regardless of the gender and number of the people we鈥檙e addressing. However, in languages such as Spanish that mark gender on adjectives, we need to account for feminine and masculine forms in order to be inclusive:

驴贰蝉迟谩蝉 seguro/segura que quieres eliminar esta carpeta? 
驴贰蝉迟谩蝉 listo/lista para encender tu c谩mara y tu micr贸fono?

In the first example, the translator can solve the problem somewhat creatively by rephrasing it to 驴Seguro que quieres eliminar esta carpeta?, which can be translated as 鈥業s it certain that you want to delete this folder?鈥�. This construction avoids addressing the user directly with an adjective.

However, the second phrase is more challenging to rework without addressing the user directly, so here we need to include both the masculine listo and the feminine lista to avoid excluding female users.

When writing strings, it鈥檚 good practice to avoid addressing the user directly with adjectives if you can help it. While a good translator will always find a solution, sometimes it might not be as neat as in English, and it could use more characters and subsequently take up more space in the UI.听

Text expansion and contraction

As we鈥檝e seen above, translation can drastically alter the length of software strings. Some languages require more words or characters to express the same meaning as in English, whereas others may require fewer. show the number of characters in a string may increase by up to 200%, and that this is most likely to happen in the shortest strings, typically those below 10 characters. French, Italian and Spanish are all languages that see character expansions in this range. For the Nordic languages, your strings may actually contract in certain contexts as well. For example:

听	String	Character count	Expansion
English	3 photos were deleted from the album 鈥淣ew York鈥�.	48	鈥撀�
French	3 photos ont 茅t茅 supprim茅es du album 芦 New York 禄.	50	+4%
Spanish	Se eliminaron 3 fotos del 谩lbum 鈥淣ueva York鈥�.	45	-6%
Danish	3听fotos听blev slettet fra albummet 鈥淣ew York鈥�.	45	-6%
Finnish	Albumista 鈥漀ew York鈥� poistettiin听3 valokuvaa.	45	-6%
Icelandic	3听myndum听var eytt 煤r safninu 鈥濶ew York鈥�.	40	-13%
Norwegian	3听bilder听ble slettet fra albumet 芦New York禄.	44	-2%
Swedish	3听bilder听har tagits bort fr氓n albumet 鈥漀ew York鈥�.	49	+2%

Another thing to note from the example phrases here is how the word order can vary from language to language. Notice how in Spanish, the verb comes at the start of the sentence, and our photo count is pushed further down. In Finnish, the album name is pushed up to the top of the sentence, directly following albumista 鈥榝rom the album鈥�.

Also, note how the punctuation varies from language to language. Each has slightly different conventions for things like speech marks. English uses 鈥� 鈥�, whereas Icelandic uses 鈥� 鈥� and French uses guillemets 芦禄 (with a space on either side of the enclosed word).

For this reason, we should avoid syntax like this:

var string = photoCount.' '
             .photosWereDeleteFromAlbumString
             .'鈥�'.albumName.'鈥�.';

The preferred syntax would contain placeholders that the linguist is free to move at will, which you can then replace with variables at runtime:

// English
'{photoCount} photos were deleted from the album 鈥渰albumName}鈥�.'
// Finnish
'Albumista 鈥漿albumName}鈥� poistettiin听{photoCount} valokuvaa.'

Note that the above examples don鈥檛 account for singular鈥損lural distinctions 鈥� further logic is required to accommodate for those.

Context is key for software localisation

The thing that perhaps best equips a linguist to be able to translate your strings successfully is adequate context. Knowing when and where a string appears enables the translator to make a whole range of linguistic decisions and ultimately provide a correct, high-quality and consistent localisation of your software.

We recommend sticking to these guiding principles:

1. Get your product into the hands of your translators

It鈥檚 crucial to loop translators into your development process early. Even if you鈥檝e not yet delivered your first public release, it鈥檚 vital that linguists understand your app鈥檚 purpose and how your UI is laid out. Giving them access to pre-release versions means you save yourself from future headaches and endless rounds of feedback and feedback implementation.

2. Provide local context

Software strings can be as short as one word. They might consist of a single verb: 鈥榙elete鈥�, for example. But is this verb functioning as an imperative (giving a command) or just as an infinitive (the dictionary form of the verb)? In English, they look the same, but that鈥檚 not necessarily the case in other languages. To enable the translator to make the right choice, give them access to view surrounding strings even if they鈥檝e already been translated, or even better, provide screenshots. Some tools can automate this process for you.

3. Give your translators access to other translations

If you鈥檝e already localised into several languages or variants, giving translators access to those can make a world of difference, especially for closely related languages. For example, if you鈥檝e already localised into Swedish and are now adding Danish and Norwegian, giving your translators access to the Swedish strings in a translation memory will help answer a lot of questions they鈥檒l have and may even allow them to recycle some existing translation solutions.

4. Keep an open line of communication

Translators are used to surmising the meaning of a text from the context they have available, but sometimes they just don鈥檛 have the key information to hand that would allow them to choose the right translation. Be receptive to translator queries and respond with as much information and context as you can.

5. Be open to adapting your product

It鈥檚 impossible for any one developer to account for all of the nuances of every language variant they might want to localise into. Leverage the linguistic expertise of your translators to improve how you write, segment and concatenate your strings. For example, you might need to account for a different word order than you anticipated, or you might need to adapt your logic to account for different word endings. Linguists can advise you on what works and what doesn鈥檛 for their language.

We鈥檝e covered a lot of ground in this post, but there鈥檚 always more that could be said. The main thing to take away is to approach software localisation with an open mind. Be prepared to give and receive feedback, adapt and iterate as you go, and take advantage of your translators鈥� linguistic expertise to deliver the best UX in your target locale.

Many developers are rightly wary about the software localisation process. After all, you鈥檙e essentially entrusting somebody else to deliver your core user experience in a specific market. You want to make sure that you deliver on tone of voice, brand values and naturalness, not just having a grammatically correct translation. The key to this is a collaborative partnership and close, regular communication.

If you and your translators are all aligned around the same end goal of delivering a fantastic experience, and they鈥檙e armed with the tools to make that happen, you鈥檒l reap the many benefits that software localisation has to offer.

This article was initially published in 2020 by Max Naylor, a former Sandberg team member, and has since been revised with updated data.

Editor鈥檚 Pick, Software localisation

sa国际传媒