This commit is contained in:
Justin Ethier 2017-06-03 18:21:37 -04:00
parent aac5240a0b
commit 746a0e3e54

173
tail-blog-post-notes.txt Normal file
View file

@ -0,0 +1,173 @@
Initial speed test for `tail` - Cyclone does very poorly compared to other Schemes:
[justin@justin-pc r7rs-benchmarks]$ ./bench cyclone tail
Testing tail under Cyclone
Including postlude /home/justin/Documents/r7rs-benchmarks/src/Cyclone-postlude.scm
Compiling...
Running...
Running tail:25
Elapsed time: 32.261715 seconds (32) for tail:25
+!CSVLINE!+cyclone-0.5.1,tail:25,32.261715
real 0m32.379s
user 0m31.783s
sys 0m0.513s
Change `Makefile.config` in cyclone-bootstrap to do profiling. The `-O2` option in the lines below are replaced with `-g -pg`:
CFLAGS ?= -g -pg -fPIC -rdynamic -Wall -Iinclude -L.
COMP_CFLAGS ?= -g -pg -fPIC -rdynamic -Wall -I$(PREFIX)/include -L$(PREFIX)/lib
Then Cyclone is rebuilt:
[justin@justin-pc cyclone-bootstrap]$ sudo make clean ; ./install.sh
Once this is done a `gmon.out` file will be generated each time Cyclone or a compiled Cyclone program is executed.
Set up for running the `tail` benchmark directly:
[justin@justin-pc r7rs-benchmarks]$ cd /tmp/larcenous/Cyclone/
[justin@justin-pc Cyclone]$ cp -r ~/Documents/r7rs-benchmarks/inputs/ .
[justin@justin-pc Cyclone]$ mkdir outputs
Recompile `tail.scm` to get a version with profiling, and then run it to generate a `gmon.out` file:
[justin@justin-pc Cyclone]$ cyclone tail.scm
[justin@justin-pc Cyclone]$ ./tail < inputs/tail.input
[justin@justin-pc Cyclone]$ ls
gmon.out inputs outputs tail tail.c tail.o tail.scm
[justin@justin-pc Cyclone]$ gprof ./tail gmon.out > report.txt
Let's examine the start of `report.txt` to see the functions that are taking up the most of the program's runtime:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
99.19 61.09 61.09 2331598 0.00 0.00 Cyc_length
0.10 61.15 0.06 11221 0.00 0.00 gc_minor
0.08 61.20 0.05 777200 0.00 0.00 __lambda_3
0.08 61.25 0.05 777097 0.00 0.00 __lambda_300
0.03 61.27 0.02 4664280 0.00 0.00 Cyc_st_add
0.03 61.29 0.02 1562923 0.00 0.00 gc_thr_add_to_move_buffer
0.03 61.31 0.02 777572 0.00 0.00 __lambda_278
0.03 61.33 0.02 777178 0.00 0.00 dispatch
0.03 61.35 0.02 777147 0.00 0.00 __lambda_297
0.03 61.37 0.02 777135 0.00 0.00 __lambda_368
0.03 61.39 0.02 59 0.00 0.00 gc_empty_collector_stack
0.02 61.41 0.02 5441402 0.00 0.00 Cyc_is_pair
0.02 61.42 0.02 ck_pr_cas_char
0.02 61.43 0.01 3109193 0.00 0.00 Cyc_is_null
0.02 61.44 0.01 3109170 0.00 0.00 Cyc_car
Well that's interesting, `tail` is spending all of its time computing `Cyc_length`.
Run gdb:
[justin@justin-pc Cyclone]$ cat inputs/tail.input
25
"inputs/bib"
"outputs/tail.output"
ignored
A compiled Cyclone program is just a regular C program so we can use `gdb` to debug it:
[justin@justin-pc Cyclone]$ gdb ./tail
GNU gdb (GDB) 7.12.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./tail...done.
(gdb) run
Starting program: /tmp/larcenous/Cyclone/tail
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7ffff6031700 (LWP 1850)]
25
"inputs/bib"
"outputs/tail.output"
^C
Thread 1 "tail" received signal SIGINT, Interrupt.
0x00007ffff6fd44ed in read () from /usr/lib/libc.so.6
(gdb) break Cyc_length
Breakpoint 1 at 0x53b145: file runtime.c, line 1713.
(gdb) c
Continuing.
ignored
After continuing a few times, the code breaks here:
#0 Cyc_length (data=0x7cd4e0, l=0x7ffffffbc660) at runtime.c:1713
#1 0x00000000004af96b in __lambda_368 (data=0x7cd4e0, argc=3, _=0x7ffff6635340, k_734906=0x7ffff63331c0, f_731928=0x7ffffffb2dd0, lis1_731927=0x7ffffffbc660,
lists_731926_raw=0x7ca420 <Cyc_void_symbol>) at scheme/base.c:22680
#2 0x00000000004afde1 in __lambda_367 (data=0x7cd4e0, argc=1, self_736727=0x7ffffff9d4c0, r_734921=0x7ca420 <Cyc_void_symbol>) at scheme/base.c:22703
#3 0x00000000004b7aa7 in __lambda_299 (data=0x7cd4e0, argc=3, _=0x7ffff66350e0, k_735061=0x7ffffff9d4c0, char_731994=0x2a, port_731993_raw=0x7ffff6333200) at scheme/base.c:23706
#4 0x000000000055cf5e in do_dispatch (data=0x7cd4e0, argc=3, func=0x4b7523 <__lambda_299>, clo=0x7ffff66350e0, b=0x7ffffff9ccf0) at dispatch.c:6
Opening the source code for `scheme/base.c` you can see this is in the `for-each` function:
static void __lambda_368(void *data, int argc, closure _,object k_734906, object f_731928, object lis1_731927, object lists_731926_raw, ...) {
load_varargs(lists_731926, lists_731926_raw, argc - 3);
Cyc_st_add(data, "scheme/base.sld:for-each");
if( (boolean_f != Cyc_is_null(lis1_731927)) ){
And in `scheme/base.sld` you can see where `length` is being called:
(define (for-each f lis1 . lists)
(if (not (null? lis1))
(if (pair? lists)
(let recur ((lists (cons lis1 lists)))
(receive (cars cdrs) (%cars+cdrs lists)
(if (pair? cars)
(begin
(apply f cars)
(recur cdrs)))))
;; Fast path.
(if (eq? 1 (length lis1))
(f (car lis1))
(begin (f (car lis1))
(for-each f (cdr lis1)))))))
The code can be simplified to make it more obvious what is going on:
(define (for-each f lis1 . lists)
(if (not (null? lis1))
(if (eq? 1 (length lis1))
(f (car lis1))
(begin (f (car lis1))
(for-each f (cdr lis1))))))
Basically on every iteration of `for-each` the code is calling `length` to see if `f` can be called directly. Well, that's not good - the main `for-each` loop itself has a time complexity of O(N). The runtime depends directly on the length of `lis1`. But each time `length` is called it must examine the entire contents of `lis1`, which is another O(N) operation. Combined with the outer loop this raises the overall time complexity to O(N^2) - which can really add up for large values of N.
This reminds me of an old article from Joel Spolsky that talks about the same issue with respect to strings:
http://global.joelonsoftware.com/English/Articles/Interviewing.html
Is their function fast? Look at how many times they call strlen. I've seen O(n^2) algorithms for strrev when it should be O(n), because they are calling strlen again and again in a loop.
Anyway the solution is to check directly for null, instead of scanning the whole string:
(if (null? (cdr lis1))
(f (car lis1))
(begin (f (car lis1))
After rebuilding with this fix we can re-run the tail benchmark. Remember the older code took over 32 seconds to finish: