Let's start Scheme

2015-01-10

Shallow string copy

Since Sagittarius 0.5.10, symbols and keywords are also target of GC. And both are using string as it's real name. The strings were copied and made as literal string before so that symbol->string could just return the given symbols name. It's not the same story now. The whole purpose of GCable symbols and keywords is saving memory. For example, what would happen if symbols were not GCed:
(dotimes (i 10000000)
  (string->symbol (format "symbol.~a" i)))
;; -> 10000000 symbols will be created
You would think nobody writes this type of code. But what if the input was user dependent? To avoid these type of out of memory, I've decided to introduce GCable symbols and keywords.

Now, I didn't make literal string GCable. I don't remember the exact reason but I think it was too naive implementation to do it and caused a lot of problem (maybe I should try again but feels like it would be very tricky to do it since symbols' and keywords' names are string). So whenever symbol->string is called and the name is not a literal string then the procedure copies the string. Thus it can be O(n) process. Stupid isn't it? So I'm thinking to introduce shallow string copy that only takes O(1).

To do it, I need to reconstruct C world string structure. It's currently defined like this:
struct SgStringRec
{
  SG_HEADER;
  unsigned int literalp: 1;
  int size             : (SIZEOF_INT*CHAR_BIT-1);
  SgChar value[1];
};
This is because of GC efficiency. Strings are allocated as atomic memory means there is no pointer inside. This makes GC impact a bit lighter because collector won't traverse inside of allocated pointer (according to Boehm GC document). If I make shallow string copy takes O(1) then the structure should be something like this:
struct SgStringRec
{
  SG_HEADER;
  unsigned int literalp: 1;
  int size             : (SIZEOF_INT*CHAR_BIT-1);
  SgChar *start;
  SgChar *end; /* or add one more flag if this is shallow copied */
};
Then copy string just puts the source pointer. If modification happens, it just make sure the pointer will be deep copied. Seems fine, the consideration is how much GC performance impact it would cause.

The GC impact can be 2 parts:
  • Making string non atomic pointer
  • Limited range shallow copy
The first one is Boehm GC thing so I just need to profile it. The second one can be a problem. Lets say you create a huge string and substring it like this:
(define (split-string s)
  (values (substring s 0 10)
          (substring s 10 20)
          (substring s 20 30)))
The number of character you want is only 30. However if the input string is, let's say, 10000, then the original source pointer would remain until all 3 strings are GCed (or modified). On the other hand, if string copy happens so many times with small string, then actual memory consumption would be smaller. Well, it's not about memory consumption but processing order so this might be kinda trade off.

So to estimate how much impact we would have, I've counted the number of possibly affected procedures.
string-copy103
substring178
symbol->string167
string-set!158
Not so many. This result even includes test cases. symbol->string is not called in many place. So the overhead of O(n) process may not be an issue. (Well, I know this is rubbish because it doesn't show how many it's actually called.)

Hmmmm, should I?

No comments:

Post a Comment