Speed up text character lookup (!93) · Merge requests · Lartu / ldpl

Closed

Created 5 years ago by Lartu @lartuMaintainer

Report abuse

Report abuse

Speed up text character lookup

Overview 1
Commits 7
Changes 2

No files found.

Created by: dvkt

These are two little "optimizations" that, under some circumstances, make LDPL a lot faster when working with text.

I was trying to parse a big file using LDPL, one character at a time, but it was going kinda slow. So I started tracking down where time was spent, and it looked like it was mainly in two function: join() and utf8_substr().

The first thing I did was run some benchmarks. It looked like a += b was a heck of a lot faster than join(a, b, a), so I made the compiler special case any concat-style joins.

The second one is kinda nutty: Every time you call GET CHARACTER AT - FROM - IN, the whole string has to be searched starting at the beginning and all the positions of all the characters have to be re-discovered. Because of the variable widths of characters in utf8. Kind of a necessary evil, since it's so nice that LDPL supports utf8! But it means if you call get character at 1000 from $filetext in var then get character at 1001 from $filetext in var, and so on in a loop, each call is re-discovering the string and going through all 1000+ characters to find the one you asked for. It's like that movie Memento - everyone is always getting re-introduced! For big files, this can really get sluggish.

So, I added a little utf8cache inside charat() that maps positions of characters to their utf8 code points the first time a search is done on a string. This means the get in the loop basically becomes a constant time lookup, instead of O(whatever^n). There might still be some issues with how I compute the string id but it seems to work okay. All the tests pass and all the examples seem to work, as well as my projects.

I started getting a benchmark ready of the slowness, but I had to bail after 30 minutes:

LDPL 3.0.4 -- 10 runs

$ time ./slow-bin 
*** STARTING TEST ***************************************************
> performing 10 runs
^C

real    30m29.856s
user    30m26.815s
sys 0m1.417s

So I ran it again with just 1 run and it took a minute:

LDPL 3.0.4 -- 1 run

$ ./slow-bin
*** STARTING TEST ***************************************************
> performing 1 runs
> looped 146720 times (1 runs of 146720 chars)
> took 67 seconds
*************************************************** TEST COMPLETE ***

LDPL w/ character cache

Now here's the benchmark with this patch, starting out with 10 runs:

$ ./slow-bin 10
*** STARTING TEST ***************************************************
> performing 10 runs
> looped 3408240 times (10 runs of 340824 chars)
> took 1 seconds
*************************************************** TEST COMPLETE ***

So yes, much faster, but a crazy approach. And maybe it'll be slower for small strings? But it's made gild a lot faster, for one, which has been nice.

You can run the benchmark by cloning this gist and running slow.ldpl: https://gist.github.com/dvkt/443f5ddebe24e0fe406c051bee19cf3e

Any other ideas for how to speed up the lookup? Seems like most people use a big utf8 library which I didn't want to bring it. The cache could be improved too though.

You're only seeing other activity in the feed. To add a comment, switch to one of the following options.

Please register or sign in to reply

0 Assignees

Assign to

0 Reviewers

Request review from

None

Milestone

None

Assign milestone

None

Time tracking

No estimate or time spent

Labels

None

Assign labels

No matching results

Manage project labels

Lock merge request

Unlocked

participants

Reference:

Source branch: github/fork/dvkt/mini-opts

0 pending comments