[SOLVED] Storage of non-standard characters in MSS driver
Author: i.sharp@pcisystems.co.uk (Iain Sharp)
We've had two customers comment recently that whilst we can store 'odd' characters in fields in Uniface correctly, they print out badly. This is due to our using Crystal reports to do all printing within the system. The characters (such as ė,č and others), are accepted by Uniface, and stored and retrieved in the MSS database in char(X) fields. Within uniface they display correctly, within SQL server and Crystal reports they show "ä, "c etc. I believe that switching the affected fields to Unicode storage would probably help, but am a little unclear from the documentation which packing code I should use to replace a C40 with a unicode field, and (in fact) whether this would indeed help. Alternatively, is it the MSS collation I should be looking at? The database has been migrated up from SQL 2000 and the collation is currently SQL_Latin1_General_CP1_CI_AS. I'd appreciate someone with experience letting me know the doubtless simple trick I'm missing. Thanks in advance Iain
7 Comments
Local Administrator
Hi Iain, Which $def_charset are you using? The character ė is part of ISO 8859-13 or Windows-1257, and č is part of ISO 8859-2 or Windows-1250. If you want to store the mentioned characters correctly then you have to use Unicode (i.e. UTF-8). So instead of using a C40 (in MSS: 'char(40)') you should use W40 (in MSS: 'nchar(40)'). I don't think that it makes a difference if you change the MSS collation. If you check the MSS doc then it says there that you should use Unicode data types (as nchar, nvarchar or ntext) in case you would like to store characters from different languages (i.e. character sets). For details see (e.g.): > Collation and Unicode Support Hope this helps. Daniel
Author: diseli (daniel.iseli@uniface.com)
Local Administrator
Apparently we are using CP1252 (although I'm not sure where it is reading this from). I will try altering the uniface data to W40 and the mss data to nchar(40) and see what happens. Although, to store 40 chars, shouldn't it be W80? (and nvarchar(80))
Author: Iain Sharp (i.sharp@pcisystems.co.uk)
Local Administrator
CP1252 is the default on an English Windows.
I think you are confusing the number of bytes required to store the data and the number of characters that can be stored. When using W40 (nchar(40)) then you can store 40 Unicode characters (at least when using MSS). Daniel
Author: diseli (daniel.iseli@uniface.com)
Local Administrator
So a nvarchar(40) may take up to 80(+4) bytes to store? Got it.
Author: Iain Sharp (i.sharp@pcisystems.co.uk)
Local Administrator
Hi Iain Both characters are from the CP1257 codepage Unicode UTF8 CP1257 Character U010D 0xC48D 0xEB LATIN SMALL LETTER C WITH CARON U0117 0xC497 0xE8 LATIN SMALL LETTER E WITH DOT ABOVE -- About the length -- With ASCII/CPnnnn, there is one charter equals one byte in storage. So a char(40) can hold 40 charaters Unicode on the other hand needs up to three bytes per character. Don't mixup this bytes with the bytes need to storage! Unicode defines codepoint 0x000000 up to 0x10FFFF, where the " Basic Multilingual Plane"(BMP) is locate in 0x0000 to 0xFFFF Except you want to store hieroglphs, the BMP is more then enough for normal use cases :-) Now it come sto the storage: Using three bytes for every single character/codepoint will blow up the database, as 90% of characters are ASCII one. To avoid the massive use of space, there a few coding formats, called UTF-nn (Unicode TransformationFormat), where n denotes the length in bit of the - let's say -"particle" All unicode pcode points fiiting into a particle without the high bit set need only one of them. So in UTF-8 ASCII (7bit) will only use one byte. The upper page of ASCII (CPxxx) and most of the codepoints will need two bytes. But for higher codepoints, there are three or four bytes needed. Back to UnifAce. Using W40 [W for wide] as interface defines a UTF-8 coding on datasbase, which will use nchar(40) [n for national] This field can hold up to 40 unicode codepoints and uses a storage of 80 bytes BUT: Carefull, if one is usind seldom characters in the high end of the unicode, less the 40 charaters could be stored. So if you realy need the upper unicode plane or even hieroglphys, double the size :-) Ingo
Author: istiller (i2stiller@gmx.de)
Local Administrator
Thanks Ingo, We're 15 years into this project and it's just cropped up, I've done a sample modification to one of the fields in our devel system and it's now working consistently in display in Uniface, SQL server and Crystal reports, so that will hold us for now. Thanks all. Iain
Author: Iain Sharp (i.sharp@pcisystems.co.uk)
Local Administrator
Hi, just a small note about Sql-Server collations. I'll say nothing about which one to use, but we started to use Unicode for some tables in one of our applications, an application where searching data is really important. For that database we nowadays use "Finnish_Swedish_CI_AI" collation. Observe the last "AI" (accent insensitive) instead of "AS", which means that the database handles the search-logic for you, ie. also special characters are found searching on the base character. Eg. ė is also found searching for e. Regards RogerW.
Author: rogerw (roger.wallin@abilita.fi)