0000117: Lossy mode for convert_charset() - MantisBT

ID	Project	Category	View Status	Date Submitted	Last Update

0000117	LDMud 3.3	Efuns	public	2004-08-24 03:11	2011-09-21 11:27

Reporter	warp	Assigned To	zesstra
Priority	normal	Severity	feature	Reproducibility	always
Status	closed	Resolution	no change required
Target Version	3.3.721

Summary	0000117: Lossy mode for convert_charset()
Description	Would it be possible to extend convert_charset() so that it optionally runs in a lossy mode, where characters, which are not convertible to the target charset, will be replaced by "?" or a specifyable string instead of function aborting completely? We cannot use the function otherwise for e.g. UTF-8 to ISO-8859-1 conversions for text entered by users (e.g. say)
Tags	No tags attached.

~~lars~~ 2004-09-20 20:46 reporter ~0000170	The tricky part is to find out how many of the input characters mess up the conversion. Various modes are imaginable: convert_charset(in, from-cs, to_cs, 1): if the conversion aborts on an unexpected sequence, the first input character is removed from in, a '?' append to the result, and the conversion begins again. Repeat ad nauseatum. convert charset(in, from_cs, to_cs, fun): The function fun() receives the remaining in-string as argument, and returns an array consisting of ({ "string to add to the result", "new remaning in-string" }).

fippo 2004-10-14 12:54 reporter ~0000202 Last edited: 2004-10-14 13:05	we had this problem with irc users switching between latin1 and utf8 charsets and return -1 instead of throwing an exception and handle the problem on 'mudlib' level. brutal diff of strfuns.c (adjust return type in func_spec alike)... see next entry (edit does not seem to like attaching files) --- projects/psyc/ldmud/3-3/src/strfuns.c 2004-04-28 05:57:59.000000000 +0200 +++ 3-3/src/strfuns.c 2004-10-14 21:49:11.000000000 +0200 @@ -483,22 +483,42 @@ if (errno == EILSEQ) { +#if 0 error("convert_charset(): Invalid character sequence at index %ld\n", (long)(pIn - get_txt(in_str))); /* NOTREACHED / +#endif + free_string_svalue(sp--); + free_string_svalue(sp--); + free_string_svalue(sp); + + put_number(sp, -1); return sp; } if (errno == EINVAL) { +#if 0 error("convert_charset(): Incomplete character sequence at index %ld\n", (long)(pIn - get_txt(in_str))); / NOTREACHED / +#endif + free_string_svalue(sp--); + free_string_svalue(sp--); + free_string_svalue(sp); + + put_number(sp, -1); return sp; } - +#if 0 error("convert_charset(): Error %d at index %ld\n" , errno, (long)(pIn - get_txt(in_str)) ); / NOTREACHED / +#endif + free_string_svalue(sp--); + free_string_svalue(sp--); + free_string_svalue(sp); + + put_number(sp, -1); return sp; } / if (rc < 0) / } / while (in_left) */ edited on: 10-14-04 15:05

warp 2005-04-01 04:13 reporter ~0000358	Just wondered on the status of this enhancement request. Do you wait for feedback, did you forget it or did you decide to not change the behaviour? Regarding the various modes you suggested: Both are ok, but the first one would definitely already solve my problem, while probably being simpler to implement.

fippo 2005-05-04 06:54 reporter ~0000361	note: looking at the implementation of the iconv program (iconv_prog.c) that comes with glibc may be helpful, as it has an -c 'omit invalid characters from output' switch

szalicil 2005-06-26 15:48 reporter ~0000380	you can use convert_charset(your_string, "UTF-8", "ISO-8859-1//TRANSLIT") instead, iconv will replace unconvertable characters to "?" or something else depending on iconv implementation

lynx 2006-03-06 18:40 reporter ~0000492	Hi.. I have tried //TRANSLIT and it didn't help at all. Sigh. :( Now I'm using catch(), but since these failures happen rather often, it is a costy solution. I'm using catch() with the nolog flag. Does it skip line number calculation in that case?

zesstra 2011-02-19 19:52 administrator ~0001999	Out of curiosity: is it still the case, that appending "//TRANSLIT" does not work? And on which platforms/libiconv is that the case? At least on my system, it does work. Another possibility might be also "//IGNORE".

zesstra 2011-09-21 11:27 administrator ~0002061	Since there was no other feedback: I believe, with current iconv() the desired effect can be achieved with //TRANSLIT and/or //IGNORE and we therefore don't need to change anything. If not, please re-open or tell me.

Date Modified	Username	Field	Change
2004-08-24 03:11	warp	New Issue
2004-09-20 20:46	~~lars~~	Note Added: 0000170
2004-10-14 12:54	fippo	Note Added: 0000202
2004-10-14 13:01	fippo	Note Edited: 0000202
2004-10-14 13:05	fippo	Note Edited: 0000202
2005-04-01 04:13	warp	Note Added: 0000358
2005-05-04 06:54	fippo	Note Added: 0000361
2005-06-26 15:48	szalicil	Note Added: 0000380
2006-03-06 18:40	lynx	Note Added: 0000492
2011-02-19 19:52	zesstra	Note Added: 0001999
2011-02-19 19:52	zesstra	Assigned To	=> zesstra
2011-02-19 19:52	zesstra	Status	new => feedback
2011-02-23 22:02	zesstra	Target Version	=> 3.3.721
2011-09-21 11:27	zesstra	Note Added: 0002061
2011-09-21 11:27	zesstra	Status	feedback => closed
2011-09-21 11:27	zesstra	Resolution	open => no change required