Speed up text character lookup
- No files found.
Created by: dvkt
These are two little "optimizations" that, under some circumstances, make LDPL a lot faster when working with text.
I was trying to parse a big file using LDPL, one character at a time, but it was going kinda slow. So I started tracking down where time was spent, and it looked like it was mainly in two function: join()
and utf8_substr()
.
The first thing I did was run some benchmarks. It looked like a += b
was a heck of a lot faster than join(a, b, a)
, so I made the compiler special case any concat-style joins.
The second one is kinda nutty: Every time you call GET CHARACTER AT - FROM - IN
, the whole string has to be searched starting at the beginning and all the positions of all the characters have to be re-discovered. Because of the variable widths of characters in utf8. Kind of a necessary evil, since it's so nice that LDPL supports utf8! But it means if you call get character at 1000 from $filetext in var
then get character at 1001 from $filetext in var
, and so on in a loop, each call is re-discovering the string and going through all 1000+ characters to find the one you asked for. It's like that movie Memento - everyone is always getting re-introduced! For big files, this can really get sluggish.
So, I added a little utf8cache
inside charat()
that maps positions of characters to their utf8 code points the first time a search is done on a string. This means the get
in the loop basically becomes a constant time lookup, instead of O(whatever^n). There might still be some issues with how I compute the string id but it seems to work okay. All the tests pass and all the examples seem to work, as well as my projects.
I started getting a benchmark ready of the slowness, but I had to bail after 30 minutes:
LDPL 3.0.4 -- 10 runs
$ time ./slow-bin
*** STARTING TEST ***************************************************
> performing 10 runs
^C
real 30m29.856s
user 30m26.815s
sys 0m1.417s
So I ran it again with just 1 run and it took a minute:
LDPL 3.0.4 -- 1 run
$ ./slow-bin
*** STARTING TEST ***************************************************
> performing 1 runs
> looped 146720 times (1 runs of 146720 chars)
> took 67 seconds
*************************************************** TEST COMPLETE ***
LDPL w/ character cache
Now here's the benchmark with this patch, starting out with 10 runs:
$ ./slow-bin 10
*** STARTING TEST ***************************************************
> performing 10 runs
> looped 3408240 times (10 runs of 340824 chars)
> took 1 seconds
*************************************************** TEST COMPLETE ***
So yes, much faster, but a crazy approach. And maybe it'll be slower for small strings? But it's made gild a lot faster, for one, which has been nice.
You can run the benchmark by cloning this gist and running slow.ldpl
: https://gist.github.com/dvkt/443f5ddebe24e0fe406c051bee19cf3e
Any other ideas for how to speed up the lookup? Seems like most people use a big utf8 library which I didn't want to bring it. The cache could be improved too though.
- You're only seeing other activity in the feed. To add a comment, switch to one of the following options.