2015-04-10

Improving string->utf8 and utf8->string

The procedures string->utf8 and utf8->string can be categorised in those mostly used procedures (at least in my mind). Until 0.6.2, the conversion was done via textual port allocated however this approach allocates lots of chunk of memory (one conversion allocates at least 32bytes for binary, and 128bytes for text) and calls bunch of C function which may not be inlined because of function pointer.

If we don't know which encoding to use, then this is not bad. But we know it's UTF8. So I've made changes to use UCS4->UTF8 and UTF8->UCS4 conversion directly. And making sure memory allocation only happens once at a conversion. Now, if changes are for performance improvements, then I must measure how much it's improved. So, I've used the following piece of code to benchmark.
#!r6rs
(import (rnrs) (time))

(define t)

(define (file->string file) (call-with-input-file file get-string-all))

(define-syntax dotimes
  (syntax-rules ()
    ((_ (i count) body ...)
     (do ((i 0 (+ i 1)))
         ((= i count) #t)
       body ...))))

;; CMakeLists.txt has appox 20KB contents.
(define bv (string->utf8 (file->string "CMakeLists.txt")))
(define s (file->string "CMakeLists.txt"))

(time
 (dotimes (i 10000)
   (set! t (utf8->string bv))))

(time
 (dotimes (i 10000)
   (set! t (string->utf8 s))))
If I use small string or bytevector, then I probably wouldn't see much improvements, so I used a bit bigger one. (the CMakeLists.txt now contains 20KB of data... I feel like I need to refactor it...)

And the result.
$ sash test2.scm

;;  (dotimes (i 10000) (set! t (utf8->string bv)))
;;  4.170478 real    4.163981 user    2.8e-500 sys

;;  (dotimes (i 10000) (set! t (string->utf8 s)))
;;  2.646136 real    2.638321 user    0.003989 sys

$ ./build/sagittarius test2.scm

;;  (dotimes (i 10000) (set! t (utf8->string bv)))
;;  1.437317 real    1.431425 user    0.003978 sys

;;  (dotimes (i 10000) (set! t (string->utf8 s)))
;;  1.210154 real    1.208894 user    0.000000 sys
It's now as twice as faster than befure. The result doesn't show but the GC occurrence is also improved. (approx. it occurs a half number.) This is kinda obvious because it doesn't allocate unnecessary memory at all now.

I've also compared with other implementations, Vicare, Mosh and Ypsilon. To run on Vicare and Mosh, I needed to create the compatible layered (time) library like this:
;; time.vicare.sls
(library (time)
    (export time)
    (import (only (vicare) time)))

;; time.mosh.sls
(library (time)
    (export time)
    (import (only (mosh) time)))
I couldn't find out how to create such a library on Racket, so just skipped.
Then this is the results.
$ vicare -L . test2.scm
running stats for (dotimes (i 10000) (set! t (utf8->string bv))):
    103 collections
    1818 ms elapsed cpu time, including 98 ms collecting
    1821 ms elapsed real time, including 100 ms collecting
    864268560 bytes allocated
running stats for (dotimes (i 10000) (set! t (string->utf8 s))):
    26 collections
    1539 ms elapsed cpu time, including 7 ms collecting
    1539 ms elapsed real time, including 7 ms collecting
    216186272 bytes allocated
$ mosh --loadpath=. test2.scm

;;3.1819961071014404 real 3.178515 user 0.0 sys

;;6.042346000671387 real 6.038440000000001 user 0.0 sys

$ ypsilon test2.scm

;;  0.198673 real    0.211882 user    0.021317 sys

;;  0.087744 real    0.108816 user    0.012085 sys
Ypsilon is the fastest. It's because Ypsilon uses UTF-8 as its internal string representation. Thus it just needs to copy it without conversion.Whenever I saw this type of difference because of the different choice, I would always think I've made a wrong decision. Well, it's too late anyway...

For now, it's only for UTF8 conversion procedures but if I start using other encodings (e.g. UTF16) a lot, I might apply the same trick.

No comments:

Post a Comment