[SOLVED] Storage of non-standard characters in MSS driver

Author: i.sharp@pcisystems.co.uk (Iain Sharp)

We've had two customers comment recently that whilst we can store 'odd' characters in fields in Uniface correctly, they print out badly.  This is due to our using Crystal reports to do all printing within the system. The characters (such as ė,č and others), are accepted by Uniface, and stored and retrieved in the MSS database in char(X) fields. Within uniface they display correctly, within SQL server and Crystal reports they show "ä, "c etc.  I believe that switching the affected fields to Unicode storage would probably help, but am a little unclear from the documentation which packing code I should use to replace a C40 with a unicode field, and (in fact) whether this would indeed help.  Alternatively, is it the MSS collation I should be looking at? The database has been migrated up from SQL 2000 and the collation is currently SQL_Latin1_General_CP1_CI_AS.  I'd appreciate someone with experience letting me know the doubtless simple trick I'm missing.  Thanks in advance Iain

7 Comments

  1. Hi Iain, Which $def_charset are you using? The character ė is part of ISO 8859-13 or Windows-1257, and č is part of ISO 8859-2 or Windows-1250. If you want to store the mentioned characters correctly then you have to use Unicode (i.e. UTF-8). So instead of using a C40 (in MSS: 'char(40)') you should use W40 (in MSS: 'nchar(40)'). I don't think that it makes a difference if you change the MSS collation. If you check the MSS doc then it says there that you should use Unicode data types (as nchar, nvarchar or ntext) in case you would like to store characters from different languages (i.e. character sets). For details see (e.g.): > Collation and Unicode Support Hope this helps. Daniel


    Author: diseli (daniel.iseli@uniface.com)
  2. Apparently we are using CP1252 (although I'm not sure where it is reading this from).  I will try altering the uniface data to W40 and the mss data to nchar(40) and see what happens.  Although, to store 40 chars, shouldn't it be W80? (and nvarchar(80))


    Author: Iain Sharp (i.sharp@pcisystems.co.uk)
  3. Iain Sharp said Apparently we are using CP1252 (although I'm not sure where it is reading this from). 

    CP1252 is the default on an English Windows.

    Iain Sharp said I will try altering the uniface data to W40 and the mss data to nchar(40 ) and see what happens.  Although, to store 40 chars, shouldn't it be W80? (and nvarchar(80))  

    I think you are confusing the number of bytes required to store the data and the number of characters that can be stored. When using W40 (nchar(40)) then you can store 40 Unicode characters (at least when using MSS). Daniel


    Author: diseli (daniel.iseli@uniface.com)
  4. So a nvarchar(40) may take up to 80(+4) bytes to store? Got it. 


    Author: Iain Sharp (i.sharp@pcisystems.co.uk)
  5. Iain Sharp said Apparently we are using CP1252 (although I'm not sure where it is reading this from).  I will try altering the uniface data to W40 and the mss data to nchar(40) and see what happens.  Although, to store 40 chars, shouldn't it be W80? (and nvarchar(80))  

    Hi Iain Both characters are from the CP1257 codepage   Unicode  UTF8     CP1257    Character   U010D    0xC48D   0xEB       LATIN SMALL LETTER C WITH CARON   U0117    0xC497   0xE8       LATIN SMALL LETTER E WITH DOT ABOVE    -- About the length --   With ASCII/CPnnnn, there is one charter equals one byte in storage.   So a char(40) can hold 40 charaters     Unicode on the other hand needs up to three bytes per character.   Don't mixup this bytes with the bytes need to storage!   Unicode defines codepoint 0x000000 up to 0x10FFFF, where the " Basic Multilingual Plane"(BMP) is locate in 0x0000 to 0xFFFF   Except you want to store hieroglphs, the BMP is more then enough for normal use cases :-)   Now it come sto the storage:   Using three bytes for every single character/codepoint will blow up the database, as 90% of characters are ASCII one.   To avoid the massive use of space, there a few coding formats, called UTF-nn (Unicode TransformationFormat), where n denotes the length in bit of the - let's say -"particle"   All unicode pcode points fiiting into a particle without the high bit set need only one of them.   So in UTF-8  ASCII (7bit) will only use one byte. The upper page of ASCII (CPxxx) and most of the codepoints will need two bytes. But for higher codepoints, there are three or four bytes needed.     Back to UnifAce.   Using W40 [W for wide] as interface defines a UTF-8 coding on datasbase, which will use nchar(40) [n for national]   This field can hold up to 40 unicode codepoints and uses a storage of 80 bytes     BUT: Carefull, if one is usind seldom characters in the high end of the unicode, less the 40 charaters could be stored.     So if you realy need the upper unicode plane or even hieroglphys, double the size :-)     Ingo


    Author: istiller (i2stiller@gmx.de)
  6. Thanks Ingo,  We're 15 years into this project and it's just cropped up, I've done a sample modification to one of the fields in our devel system and it's now working consistently in display in Uniface, SQL server and Crystal reports, so that will hold us for now.  Thanks all.  Iain


    Author: Iain Sharp (i.sharp@pcisystems.co.uk)
  7. Hi,  just a small note about Sql-Server collations. I'll say nothing about which one to use, but we started to use Unicode for some tables in one of our applications, an application where searching data is really important. For that database we nowadays use  "Finnish_Swedish_CI_AI" collation. Observe the last  "AI" (accent insensitive) instead of "AS", which means that the database handles the search-logic for you, ie. also special characters are found searching on the base character. Eg.  ė is also found searching for e. Regards RogerW.


    Author: rogerw (roger.wallin@abilita.fi)