Easy localization in games
Hello there !
I am working on a project for a client that requires localization and I don't know anything about it and how to do it (I googled) so I just went for it and came up with the following solution.
In my mind its really simple and it allows you to easily add other languages in a .txt format which makes it easy to send the English version to someone who speaks french for example and then just have him/her send the file back to you to drop into the project.
I am a bit worried about how it would scale though.
Here is the gist : https://gist.github.com/Kobusvdwalt/c14796f04d5486ab93e9
How do you handle localization ? Do you have any resources that explains possible solutions for localization ?
Feedback on my solution would also be great,
I am working on a project for a client that requires localization and I don't know anything about it and how to do it (I googled) so I just went for it and came up with the following solution.
In my mind its really simple and it allows you to easily add other languages in a .txt format which makes it easy to send the English version to someone who speaks french for example and then just have him/her send the file back to you to drop into the project.
I am a bit worried about how it would scale though.
Here is the gist : https://gist.github.com/Kobusvdwalt/c14796f04d5486ab93e9
How do you handle localization ? Do you have any resources that explains possible solutions for localization ?
Feedback on my solution would also be great,
Comments
So my first tip is let your client explicitly limit the languages your solution should support, and it will be useful if you can lay responsibility of the specifications in their hands, rather than your own (because many of the intricacies are hard to establish if you are not a native speaker). Also establish some clear bounds of the quality of the localization (getting things good enough is much easier than getting things perfect, provided you or them can determine for a given language, what is good enough).
The remainder is a list of things to watch out for:
- The UI may break if you do not design it to be able to handle longer words. (This is the most obvious).
- Right-to-left languages are very difficult to support. There are some Unicode characters that will "flip" the direction if it is supported by whatever you use to display the text. Be aware whether the translation service will use those or not. While IDE's such as VS and MonoDevelop display these correctly, and I think Unity displays it correctly in the editor, for example, it does not display it correctly in the engine.
- You may need different fonts to display characters for languages that don't use Latin alphabet. Only a handful of fonts have full Unicode support, so you may need to load different fonts too. Also, some very low quality fonts hardly supports things such as characters with diacritics, so even languages that do use the Latin alphabet may not be displayed correctly with arbitrary fonts. Gamy, open source fonts are typically in this category (although you can find good ones, but you must know to look for this in the first place). Good fonts are often expensive, and make sure your client is aware of this before you build all the tech.
- Some languages have very big character sets. And if you use languages with small character sets, but a lot of different ones, the total characters you use may still be big. If the engine you use uses bitmaps to render the fonts, you may also need a system to deal with this. By default, for instance, Unity renders (or used to? maybe it's better now) everything on a single texture, and at some point you loose quality if the sets become too big.
- Beware any text processing you have to do. C# provides some culture-specific methods for ToString, ToUpper and so on which can help with this. (Not all languages treat all characters the same when it comes to upper and lower case, or the way floating point numbers are formatted, etc.). But you should also be on the lookout for variable strings (strings constructed from partial strings, or using the Format function, incorporating game data) which will not always work out the way you think. And things like line-break algorithms, if you happen to have something like this, are extremely difficult to implement correctly for right-to-left languages. And you would need a way to switch the algorithm you use.
- If you have any grammar tricks to incorporate numbers into language, this could be another source of problems. For example: "You collected {0} red {1} ".Format(n.ToString(), n > 1? "ruby" : "rubies") will not work for many European languages (where red deflects with the number).
- The UI may also need changes that are not text dependent. For example, leader boards where the score is typically displayed on the right of the name may need to change, the order of logically ordered horizontal buttons, etc.
Of course, none of these pitfalls may apply to your project, in which case your solution seems fine, until you tested otherwise. Just wrap it in an interface, and code to the interface, so that it is easy to replace with a more efficient system if you need to.Edit 1: Three more tips regarding the process that I just thought of:
- Google the biggest translation service, and use one of the formats they provide. Even if you don't plan to use the service, it will be useful to be compatible with it for the future.
- Translators may need the context to provide correct translations. For this purpose, make it possible to run the program that will also display a number with the text that corresponds with the line number of the file (this can all be automated). This is especially helpful if you have instances of the next tip. You can then provide screenshots or the program ran in this mode for the context.
- Be aware that some words that are identical in English are different in different languages. So carefully check multiple occurrences of the same word to make sure whether it needs a single or multiple entries in your file.
Edit 2: And one more set of pitfalls: If you do any character processing (breaking up a word in characters, for example), also be aware that- some languages combine Unicode characters to form a single letter (I only discovered this recently, and am not 100% sure how it works. It seems there are meta Unicode characters that make these combinations.),
- some languages treat letters with diacritics as separate letters with their own place in the alphabet (while others don't), and
- some languages treat certain letter groups as individual letters as well (for example, ll in Spanish, or ij in Dutch).
The above things can make sorting strings alphabetically a non-trivial problem to localize.And one final tip, if you do your own research: Although Wikipedia often gives officially correct information, I am not always convinced that average native speakers will agree. For example, it gives a few special letters that uses character groups for Spanish (ch, ll, rr) which is officially the position of the Spanish Academy, but at least here in Chile I see no evidence that speakers actually treat them as such (their one soccer song, for example, spells out C and H separately). Now this may be a regional thing, or it may be a general thing. But it is really hard to know for sure without asking relevant native speakers.
1) Unity does not render Arabic diacritics correctly - we had to modify UI.Text (at least in 4.6) to render correctly - we only realized this toward the end of the development cycle and was a major pain point.
2) RTL languages also have specific formatting for menus and charts. In our case we had to design our in-game charts to flip, so that the 0% was on the right and 100% was on the left. I am pretty sure each culture has their own conventions - so watch out for that.
3) As a general development note, getting quality translations often takes time - the client will probably deliver at the last minute, and you will only find out that your text is rendering wrong just when you thought you were about to wrap up.
To mitigate this, we wrote a little app to allow us to load up some test localization samples and compare the in-game rendering to an image sample known to be correct, to catch any issues before we start.
To expand upon a few things from Herman's excellent answer: These are called combining characters. Their use isn't strictly a language-specific thing, more a preference/application specific thing. The consequence, however, is that there are regularly multiple ways, on a behind the scenes level, to represent the same characters. Usually, at a presentation level, you don't have to worry about the difference, because most applications will render either sequence of codepoints the same way. But this can be a problem when testing for string equivalence, which leads to a (really complicated) practice known as unicode normalization. This, of course, is a non-trivial culture specific problem, for reasons like Herman's point here: As an example of why both of these things can cause problems, a few simple examples.
Firstly, consider these two characters. They look the same (or they should if your font renderer is working correctly). However, in strictly binary terms, they are not. You can test this by copying the two characters into a text file, saving it, and opening that in a hex editor. You will see that the first character is 2 bytes (UTF-8 C3 A9 latin small letter e acute), and the second is 3 (UTF-8 65 latin small letter e and UTF-8 CC 81 combining acute)
é
é
Ligatures can have the same problem, although many fonts have stylised versions of ligatures that do look subtly different, such as ae vs æ.
The second problem is an issue with case. In languages where combined characters like that are considered to be separate words, they often have their own unique codepoint, such as the ij, which, again, looks like two separate letters ij. (Highlight those characters with the cursor to see the difference). This is made even more complicated with initial capitals. In Dutch specifically, and in most other languages that have multiple characters that are considered to be single letters (I'm not sure how many others there are, but Dutch is an example I'm quite familiar with), when ij is an initial, both characters are capitalised, such as an 'IJsvrij'.
And let's not even start on the Turkish dotless I
Long story short, characters which look the same might not actually be, and nothing is simple.
Ive decided to limit the number of "supported" languages to just those that use Latin script just to keep things simple for now. I never knew this was such a complicated thing. Great idea there. The way it works atm is that if the spanish and french words for left both happen to be "LUFT" it doesn't matter because both the spanish and french language files will contain the word "LUFT" at that line number. So every language is independent from each other and the index language.
@shanemarks Thanks, I will keep those in mind going forward but again I LUCKILY only need to do Eng and Afr.
Thanks again for the replies. This turned out to be way more complicated than I initially thought.