2014-07-09

Unicode character set

I'm implementing SRFI-115 on top of builtin regular expression engine adding missing feature to implement and struggling with its Unicode requirement. Even though the SRFI says Unicode support is an optional feature and those Unicode related SREs have no effect if it's not supported.

However I don't like taking conservative way. It's always better to implement full feature so that users don't have to consider implementation's limitation. Now the problem is how and compatibility with built in engine. If Unicode is supported then SRFI requires, for example, alphabetic to match L, Nl and Other_Alphabetic characters. Current builtin engine only considers ASCII for named character sets (e.g. [[:alpha]] or even \w). Well there are bunch of options to resolve this but followings are, I think, rational;
  1. Support these only in this SRFI
  2. Builtin engine should support Unicode as well
Option #1 is the easier way  to do it. Option #2 would break backward compatibility so I need to be very careful especially \w. I believe I've already wrote bunch of code which depend on the fact that it only matches ASCII characters.

Let's think about #1 first. This must be relatively easy so that I just need to convert the named character set to Unicode character set or ASCII character set depending on the expression. The problem, if I dare to call, is that I need to prepare 2 types of the same named character sets; one of them is already there, though. So all what I need are adding full Unicode character sets and switch them according to the context. Easy isn't it?

Now option #2. The problem is backward compatibility. There are 2 main possible breaking compatibility issues; one is regular expression itself and the other one is SRFI-14. The whole point to do this option is make my life easier for later stage to merging Unicode character sets to predefined ones such as char-set:letter. For regular expression engine, I probably just need to make sure by default it's ASCII context. However I'm not even sure whether or not I wrote a piece of code with SRFI-14 that depending on ASCII character set. (quick grep showed I have, so if I merge it then I need to check all code...)

So if I take #2 then the following things need to be done;
  • Separating Unicode/ASCII context on builtin regular expression engine
  • Adding full unicode set to builtin character sets (including SRFI-14)
  • Checks if there is a piece of code which depends ASCII charset
I feel somewhat it doesn't worth but cleaner than #1. Well, it's always good to have some challenges so I should take the harder way.

No comments:

Post a Comment