tim | TMI: From the ridiculous to the sublime

One of the things about working on compilers that I find satisfying is that what looks like a completely mundane bug at first can expose an interesting question. (Maybe it goes the other direction sometimes too, but I can't think of any examples offhand.)

I mentioned on Wednesday that I was working on trying to isolate a segfault bug resulting from a combination of my (seemingly trivial) changes to trans (that is, what we call the code generation pass), and Eric's changes to trans. After spending a lot of Friday on this too, I was able to isolate the code in trans that was causing the crash:

ret move_val(bcx, DROP_EXISTING, lhs_res.val,
                     {bcx: bcx, val: target, kind: owned},
                     dty);

bcx has type block, which is the type I changed to be a class. But actually, because the class has to refer to itself recursively, the type block is a type synonym that's a box of a class (a ref-counted pointer to a class): block = @block_, where block_ is a class. Actually, it's not that either! The type definition for block looks like:

enum block = @block_

In Rust, this is syntactic sugar for:

enum block { block(@block_) }

which has similar semantics to Haskell's "newtype" declarations: that is, block and @block_ should have the exact same runtime representation, but the type system distinguishes them.

There's nothing wrong with any of that, but it turns out the Rust compiler doesn't always hew to the principle that the left-hand side and right-hand side of the newtype should always be treated the same way. Particularly, making block a newtype instead of a synonym for a boxed type like it was before -- when it looked something like

type block = @{ ... some record fields ... }

means that block arguments started getting passed by reference everywhere, instead of by value as before, which is a bit silly since block is just a pointer-sized thing.

While that's silly (and possibly slow), that shouldn't cause a compiled program to segfault either. What caused the segfault was an interaction between everything described above, and the last-use and borrowck analyses. I didn't discover this on my own, by the way; all I really did was isolate the code above (which itself took a long time to do). The rest of this post is the result of discussing the problem on IRC with Niko.

Rust has a last-use analysis as part of its liveness analysis pass, which determines when a pointer-typed variable is "dead" within its scope and thus can be zeroed out. If the code generator actually adds the code to zero out variables at their last uses, that can potentially shrink the program's memory footprint by letting memory get freed sooner (if ref-counting is in effect), or similarly let memory get GCed sooner (if garbage collection is in effect). The Rust compiler currently does this optimization, but that's in tension with Rust's overall philosophy of predictability: in order to know when memory gets freed, a resource-conscious programmer has to think about how the liveness analysis work. Since all interesting static analyses are conservative (which is to say, they miss some opportunities to point out a potential optimization), this is not intuitive.

Rust also has a separate analysis pass called "borrow checking", basically about enforcing memory safety in the presence of Rust's sometimes-subtle memory model (you can follow the link for more details).

In the code that I quoted above, there's an implicit move-out (zeroing) of bcx: again,

ret move_val(bcx, DROP_EXISTING, lhs_res.val,
                     {bcx: bcx, val: target, kind: owned},
                     dty);

This time I bolded the reference to bcx that's the last use of it within the current block. Since I'm not showing you the whole block, you'll have to trust me that it is the last use -- but it would be pretty silly if there was any code after this ret statement, and naturally, the second use of bcx here would be the last one, since arguments evaluate left to right.

But the problem with zeroing bcx after it's copied into the record literal field is that there's still a reference to it! The first use of bcx here as the first argument to move_val creates a reference, because bcx gets passed by reference to the callee, because of its type (as I explained before). When bcx gets zeroed out before the call, that means that when the code for move_val begins executing, it'll find that it has a dangling pointer as its first argument, and since move_val happens to dereference its first argument right off the bat, that causes a null dereference.

It would normally be the job of the borrow checking pass to flag the move here and say "you can't do that, there's still a reference to bcx" -- but because the move results from an internal optimization and is not visible in the language, borrow checking isn't aware of it. It's conceivable to change borrow checking so it "knows about" the output of last use, and checks the moves resulting from last-use information too, but it might be easier and more predictable for the programmer if this optimization simply went away and the programmer was responsible for inserting their own moves (with, of course, safety checking from the compiler so that they can never compile a program that would segfault; we believe in "well-typed programs don't go wrong" here).

Fortunately, there's an easy workaround to get Eric's code working and stop the segfault: make move_val take its first argument by value, which is easy to do using argument modes -- in fact, it's a one-character fix! (With a 50-word comment explaining why.) This is just an annotation telling the compiler to do what it should have really "known" to do: pass a pointer by value. However, that just masks the real problem, which is the unsoundness of the combination of last-use analysis, use of the results of last-use analysis in code generation, and borrow checking that's oblivious to last-use information. What to do about that is an open question.

The other bug I worked on on Friday -- also quite fun, and quicker to fix -- was #2631. First, thanks to gdb, I was able to minimize a large code base passed along to me by Erick (not to be confused with Eric) to about a 20-line test case that demonstrated the internal compiler error that he was seeing. It turned out that there was some extra information being added to a side table for certain types of expressions in the AST -- unary, binary, and index operations (all of which could be user-defined overloaded operators), but not all of the code that updated or consulted the table was considering that an index operation might be overloaded, too. Once I saw that, I just had to add all the missing cases and everything worked.

And finally, I'm still banging my head against the memory leak that I mentioned in the first paragraph of Wednesday's post, and mysteriously, when I pulled changes and rebased, the memory leak turned into an LLVM assertion failure. *cue Twilight Zone music*. I find those bugs particularly annoying to debug, as the LLVM error messages are usually singularly useless and a stack trace doesn't help... and I suspect I might need to hack LLVM itself to get some useful information in this case. Sad, because it takes forever to recompile.

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Tim's journal

Taking metaphors too far since 1995

TMI: From the ridiculous to the sublime

Profile

November 2021

Links

Most Popular Tags

Style Credit

Expand Cut Tags