The idea was like this: if I can compile procedures to C, then it would be faster. Sounds fine, right? Of course there is always caveats:
- Calling Scheme procedure from C is expensive
- I've learnt this from JIT experiment
- Macro can not be simply mapped to C procedure
- Which compiler should use for this?
To check it, I've created a Gist: https://gist.github.com/ktakashi/8d9271600348db136877 There are 3 revisions but the most important ones are first and second one. The last one is just squeezing a bit of performance. The first revision is simply mapped procedure call to internal apply function. This is super slow as I expected. So I've switched to the second revision which uses CPS to do things. Well, as you can see it, it's not bad but not faster than pure Scheme. (NOTE:
hashtable-map
is implemented in Scheme world.)Now, I didn't get satisfied result so I've also created very simple tail recursive fact in C. It's like this:
static SgObject fact(SgObject *SG_FP, int SG_ARGC, void *data_) { SgObject m = SG_MAKE_INT(1), r = SG_MAKE_INT(1); SgObject n = SG_FP[0]; while (TRUE) { if (Sg_NumEq(m, SG_FP[0])) { return Sg_Mul(m, r); } else { SgObject t1 = Sg_Add(m, SG_MAKE_INT(1)); SgObject t2 = Sg_Mul(m, r); m = t1; r = t2; } } return SG_UNDEF; /* dummy */ } static SG_DEFINE_SUBR(fact__STUB, 1, 0, fact, SG_FALSE, NULL);Tail recursive call can be a mere loop, so this must be much much much (times 100) faster than pure Scheme (this was my hope). So I've compared.
;; Scheme implementation (define (fact n) (let loop ((m 1) (r 1)) (if (= m n) (* m r) (loop (+ m 1) (* m r))))) ;; Expression to compare (time (dotimes (i 10000) (fact 1000)))Then I've got the incredible result!!
Scheme implementation ;; (dotimes (i 10000) (fact 1000)) ;; 0.000000 real 0.000000 user 0.000000 sys C implementation ;; (dotimes (i 10000) (fact 1000)) ;; 4.692019 real 2.483000 user 1.622000 sysHooray!!! It's incomparable! WHAAAATTT???!!!
Well, if I just close my computer without evaluating this result, I would do the same thing in future. It's painful but I need to digest it... I guess there are couple of reasons but the biggest thing is that the VM is extremely turned especially those basic arithmetic operations. For example, addition of fixnum doesn't call C function but just do it on VM. If I disassemble the
fact
, then I can only see 21 VM instructions and only MUL
uses C function call.(disasm fact) ;; size: 21 ;; 0: CONSTI_PUSH(1) ;; 1: CONSTI_PUSH(1) ;; 2: LREF_PUSH(1) ;; 3: LREF(0) ;; 4: BNNUME 5 ; (if (= m n) (* m r) (loop (+ m ... ;; 6: LREF_PUSH(1) ;; 7: LREF(2) ;; 8: MUL ;; 9: RET ;; 10: LREF(1) ;; 11: ADDI(1) ;; 12: PUSH ;; 13: LREF_PUSH(1) ;; 14: LREF(2) ;; 15: MUL ;; 16: PUSH ;; 17: SHIFTJ(2 1) ;; 18: JUMP -17 ;; 20: RETThe rest is simply done on VM. Thus, there is no overhead of calling C function.The C code version, on the other hand, it requires 3 C function calls for each iteration, plus inside of the C function. 3 times difference can be a huge difference (well, it actually is).
If I couldn't get any performance improved in this micro benchmark, then it wouldn't be any improvement in practice either. So, I just record this result and move forward...
No comments:
Post a Comment