Dynamic compilation (failure)

I don't have much problem with current performance of Sagittarius but it's always better to be faster. And I thought (yes, just thought), a good idea has come up. Well, conclusion first, it wasn't at all so if you want to read successful story, don't waste your time.

The idea was like this: if I can compile procedures to C, then it would be faster. Sounds fine, right? Of course there is always caveats:
  1. Calling Scheme procedure from C is expensive
    • I've learnt this from JIT experiment
  2. Macro can not be simply mapped to C procedure
  3. Which compiler should use for this?
2 and 3 are later stage issue, so I've checked item nr 1 first.

To check it, I've created a Gist: https://gist.github.com/ktakashi/8d9271600348db136877 There are 3 revisions but the most important ones are first and second one. The last one is just squeezing a bit of performance. The first revision is simply mapped procedure call to internal apply function. This is super slow as I expected. So I've switched to the second revision which uses CPS to do things. Well, as you can see it, it's not bad but not faster than pure Scheme. (NOTE: hashtable-map is implemented in Scheme world.)

Now, I didn't get satisfied result so I've also created very simple tail recursive fact in C. It's like this:
static SgObject fact(SgObject *SG_FP, int SG_ARGC, void *data_)
  SgObject m = SG_MAKE_INT(1), r = SG_MAKE_INT(1);
  SgObject n = SG_FP[0];
  while (TRUE) {
    if (Sg_NumEq(m, SG_FP[0])) {
      return Sg_Mul(m, r);
    } else {
      SgObject t1 = Sg_Add(m, SG_MAKE_INT(1));
      SgObject t2 = Sg_Mul(m, r);
      m = t1;
      r = t2;
  return SG_UNDEF;  /* dummy */
static SG_DEFINE_SUBR(fact__STUB, 1, 0, fact, SG_FALSE, NULL);
Tail recursive call can be a mere loop, so this must be much much much (times 100) faster than pure Scheme (this was my hope). So I've compared.
;; Scheme implementation
(define (fact n)
  (let loop ((m 1) (r 1))
    (if (= m n)
        (* m r)
        (loop (+ m 1) (* m r)))))

;; Expression to compare
(time (dotimes (i 10000) (fact 1000)))
Then I've got the incredible result!!
Scheme implementation

;;  (dotimes (i 10000) (fact 1000))
;;  0.000000 real    0.000000 user    0.000000 sys
C implementation

;;  (dotimes (i 10000) (fact 1000))
;;  4.692019 real    2.483000 user    1.622000 sys
Hooray!!! It's incomparable! WHAAAATTT???!!!

Well, if I just close my computer without evaluating this result, I would do the same thing in future. It's painful but I need to digest it... I guess there are couple of reasons but the biggest thing is that the VM is extremely turned especially those basic arithmetic operations. For example, addition of fixnum doesn't call C function but just do it on VM. If I disassemble the fact, then I can only see 21 VM instructions and only MUL uses C function call.
(disasm fact)

;; size: 21
;;    0: CONSTI_PUSH(1)
;;    1: CONSTI_PUSH(1)
;;    2: LREF_PUSH(1)
;;    3: LREF(0)
;;    4: BNNUME 5                  ; (if (= m n) (* m r) (loop (+ m ...
;;    6: LREF_PUSH(1)
;;    7: LREF(2)
;;    8: MUL
;;    9: RET
;;   10: LREF(1)
;;   11: ADDI(1)
;;   12: PUSH
;;   13: LREF_PUSH(1)
;;   14: LREF(2)
;;   15: MUL
;;   16: PUSH
;;   17: SHIFTJ(2 1)
;;   18: JUMP -17
;;   20: RET
The rest is simply done on VM. Thus, there is no overhead of calling C function.The C code version, on the other hand, it requires 3 C function calls for each iteration, plus inside of the C function. 3 times difference can be a huge difference (well, it actually is).

If I couldn't get any performance improved in this micro benchmark, then it wouldn't be any improvement in practice either. So, I just record this result and move forward...

No comments:

Post a Comment