View Issue Details

IDProjectCategoryView StatusLast Update
0000117LDMud 3.3Efunspublic2011-09-21 11:27
Reporterwarp Assigned Tozesstra  
PrioritynormalSeverityfeatureReproducibilityalways
Status closedResolutionno change required 
Target Version3.3.721 
Summary0000117: Lossy mode for convert_charset()
DescriptionWould it be possible to extend convert_charset() so that it optionally runs in a lossy mode, where characters, which are not convertible to the target charset, will be replaced by "?" or a specifyable string instead of function aborting completely?

We cannot use the function otherwise for e.g. UTF-8 to ISO-8859-1 conversions for text entered by users (e.g. say)
TagsNo tags attached.

Activities

lars

2004-09-20 20:46

reporter   ~0000170

The tricky part is to find out how many of the input characters mess up the conversion.

Various modes are imaginable:

  convert_charset(in, from-cs, to_cs, 1): if the conversion aborts on an unexpected sequence, the
first input character is removed from in, a '?' append to the result, and the conversion begins again. Repeat ad nauseatum.

  convert charset(in, from_cs, to_cs, fun): The function fun() receives the remaining in-string as argument, and returns an array consisting of ({ "string to add to the result", "new remaning in-string" }).

fippo

2004-10-14 12:54

reporter   ~0000202

Last edited: 2004-10-14 13:05

we had this problem with irc users switching between latin1 and utf8 charsets and return -1 instead of throwing an exception and handle the problem on 'mudlib' level.

brutal diff of strfuns.c (adjust return type in func_spec alike)... see next entry (edit does not seem to like attaching files)

--- projects/psyc/ldmud/3-3/src/strfuns.c 2004-04-28 05:57:59.000000000 +0200
+++ 3-3/src/strfuns.c 2004-10-14 21:49:11.000000000 +0200
@@ -483,22 +483,42 @@

             if (errno == EILSEQ)
             {
+#if 0
                 error("convert_charset(): Invalid character sequence at index %ld\n", (long)(pIn - get_txt(in_str)));
                 /* NOTREACHED */
+#endif
+ free_string_svalue(sp--);
+ free_string_svalue(sp--);
+ free_string_svalue(sp);
+
+ put_number(sp, -1);
                 return sp;
             }

             if (errno == EINVAL)
             {
+#if 0
                 error("convert_charset(): Incomplete character sequence at index %ld\n", (long)(pIn - get_txt(in_str)));
                 /* NOTREACHED */
+#endif
+ free_string_svalue(sp--);
+ free_string_svalue(sp--);
+ free_string_svalue(sp);
+
+ put_number(sp, -1);
                 return sp;
             }
-
+#if 0
             error("convert_charset(): Error %d at index %ld\n"
                  , errno, (long)(pIn - get_txt(in_str))
                  );
             /* NOTREACHED */
+#endif
+ free_string_svalue(sp--);
+ free_string_svalue(sp--);
+ free_string_svalue(sp);
+
+ put_number(sp, -1);
             return sp;
         } /* if (rc < 0) */
     } /* while (in_left) */

edited on: 10-14-04 15:05

warp

2005-04-01 04:13

reporter   ~0000358

Just wondered on the status of this enhancement request. Do you wait for feedback, did you forget it or did you decide to not change the behaviour?

Regarding the various modes you suggested: Both are ok, but the first one would definitely already solve my problem, while probably being simpler to implement.

fippo

2005-05-04 06:54

reporter   ~0000361

note: looking at the implementation of the iconv program (iconv_prog.c) that comes with glibc may be helpful, as it has an -c 'omit invalid characters from output' switch

szalicil

2005-06-26 15:48

reporter   ~0000380

you can use convert_charset(your_string, "UTF-8", "ISO-8859-1//TRANSLIT") instead, iconv will replace unconvertable characters to "?" or something else depending on iconv implementation

lynx

2006-03-06 18:40

reporter   ~0000492

Hi.. I have tried //TRANSLIT and it didn't help at all. Sigh. :(
Now I'm using catch(), but since these failures happen rather often,
it is a costy solution. I'm using catch() with the nolog flag. Does
it skip line number calculation in that case?

zesstra

2011-02-19 19:52

administrator   ~0001999

Out of curiosity: is it still the case, that appending "//TRANSLIT" does not work? And on which platforms/libiconv is that the case? At least on my system, it does work.
Another possibility might be also "//IGNORE".

zesstra

2011-09-21 11:27

administrator   ~0002061

Since there was no other feedback: I believe, with current iconv() the desired effect can be achieved with //TRANSLIT and/or //IGNORE and we therefore don't need to change anything.
If not, please re-open or tell me.

Issue History

Date Modified Username Field Change
2004-08-24 03:11 warp New Issue
2004-09-20 20:46 lars Note Added: 0000170
2004-10-14 12:54 fippo Note Added: 0000202
2004-10-14 13:01 fippo Note Edited: 0000202
2004-10-14 13:05 fippo Note Edited: 0000202
2005-04-01 04:13 warp Note Added: 0000358
2005-05-04 06:54 fippo Note Added: 0000361
2005-06-26 15:48 szalicil Note Added: 0000380
2006-03-06 18:40 lynx Note Added: 0000492
2011-02-19 19:52 zesstra Note Added: 0001999
2011-02-19 19:52 zesstra Assigned To => zesstra
2011-02-19 19:52 zesstra Status new => feedback
2011-02-23 22:02 zesstra Target Version => 3.3.721
2011-09-21 11:27 zesstra Note Added: 0002061
2011-09-21 11:27 zesstra Status feedback => closed
2011-09-21 11:27 zesstra Resolution open => no change required