Vincent Liu: Software Development

Showing posts with label Software Development. Show all posts

Friday, September 14, 2012

Have you ever reused code?

The term 'Code Reuse' feels like a software developer's cliche that had since fallen into the list with other unfashionable tech lexicons. Nevertheless the terminology still lingers on like a bad smell, never fully ready to die off. These days, code reuse feels more like the definition of a myth - a story everybody has heard of, but nobody has witnessed.

If you are ever geeky enough to have raised code reuse as a conversation piece, you'll probably notice that almost everybody have something good to say about it, from a vague feel-good feeling about how good a thing it is, to how it may have profoundly changed a person's life (ok, I exaggerated about this one). If you had to ask anybody for 5 good examples, I'm sure you'll be hard pressed to find anybody with a sensible answer. How about we start with yourself: when was the last time you've reused your own code in a meaningful, substantive way?

These days, the only visible code reuse I know of, is only when I rely on code from a software library - often an external library written by somebody else. Be it a data structure, a fancy graphical widget, or complex mathematical computations, there is probably a library out there which will cater to your need. Writing from scratch is something you never seem to do anymore.

But relying on software libraries is just not my romanticised version of code reuse, the one where the object-oriented programming paradigm had so promised so long ago. Remember the textbook claims on writing your own well-abstracted objects, and how you'll be rewarded in reusing them for all perpetuity? Personally, that lofty promise has certainly fallen short of my expectations from when I was a starry-eyed kid coding in OOP for the first time, to the more experienced software developer today.

So what went wrong? Nothing actually.

Code that has well-defined purposes, inputs and outputs, which are so often used, are easily defined and hence usually gets 'factorised' into code libraries. These libraries get battle-tested by many other developers over time, ironing out any residual kinks, as well as any lingering bugs. Over time, a well-used library makes more compelling sense to use than to roll your own, since it minimises the risk and uncertainty from newly introduced code.

So whatever's that's left for you to work on, are likely new and unique issues you are solving, making it naturally unfactorisable. And if certain portions of code do become apparent enough for you to find a commonality, that's perhaps when you'll refactor your own code to reuse these commonalities, although I suspect the possibility of such situations are getting less likely. Maybe like me, you're feeling a little cheated as well.

Code reuse today is just an euphemism of relying on other people's code - well, it is still reuse, just not your own code, not unless you happen to be a software library writer. But chances are, you are usually not.

I might as well go one step further and declare that we never reuse our own code anymore - as a corollary to the famous Bikeshed Problem. Not all of us will gain sufficient experience in building our own nuclear reactor (or more efficient data structures and algorithms), so what's left remaining is only to focus on the colour of the bikeshed (or button placements within a HTML form) because that's the only thing that's left to do when other people have done all the heavy lifting for you. And that's how it should be, after all, didn't they tell us not to reinvent the wheel?

It's why any boy and his dog today can write an application with some knowledge of HTML, CSS and Javascript - nobody needs to know how to code a rasteriser for transforming vectors into pixels, write their own graphics routines so that they can display a button, input, or to write their own binary search tree in order to use a hashmap, since they don't have to - the first principles of software systems are all conveniently abstracted into libraries, frameworks, and easy APIs that they can use.

It is not a bad thing, but it is also to no wonder why any arts major can simply write a web application and proclaim themselves to be a software developer these days. While I wouldn't mind them doing a webpage for me, I won't go as far to trust a lay-coder on anything that's of any algorithmic complexity.

On the flip side, it's never been better to be a software developer; we are more productive from the assortment of libraries that are at our disposal, from the myriad software frameworks to numerous tools that we utilise today - all of which has allowed us to write software systems that would be difficult in the past, a relative breeze today.

As software development goes these days, we are indeed standing on the shoulders of giants.

Friday, August 24, 2012

Google Drive does not work if your network is slow

I solely use Google Docs, erm Drive, for working with documents these days. While it's named "Drive" now, I'll refer it as the old incarnation "Docs" as it's the document editor that I'm about to rant here.

Google Docs as an editor, is simple to use, and is very accessible - there is no need to install any specific software for it, all you need to do is to open it up through the web browser. But the best thing I like about it, is that I can edit the document without a care and not have to worry about saving the document somewhere so that I can resume editing elsewhere later. Everything is available as long as I have access to the Internet.

Now, that's all fine and dandy, except if you have a "slow" connection. And when I say "slow", I don't mean the archaic 56kbps speeds back in the heyday where people still dial-up a modem connected to a copper phone line. Slow in Google's context, apparently meant anything at mobile broadband speeds (@1mbps).

Google Docs had been working fine, prior to the fairly recently change they've introduced the "we'll save as you type" feature. The old Google Docs wasn't that bandwidth hungry since saving the document was in coarser time blocks instead of the consistent synching that they are doing now.

With the recent changes, Google Docs appear to either suck up more bandwidth, or have lower latency requirements that my humble mobile broadband dongle does not appear to satisfy anymore. For whatever I type in, after 2 minutes working into the document, Google Docs will just hang at "Saving..." and then produce this screen:

This error is consistently reproducible, and it's not even a complex document we're talking about here - it's essentially a text file editable by vim that I copy and paste into sometimes. I don't get how Google gets this so wrong - we're talking about a document editor for a simple file, for god's sake, what kind of network requirements do you need in order to make it work?!

Monday, July 11, 2011

The x86_64 Calling Convention

I suppose I can consider myself an 'old-school' developer now; even though I have been reading the AMD64 ABI documentation, I still haven't fully absorbed it into my head yet, which is evidenced by the recent two situations I had today where RTFM-ing would have had saved me hours of GDB debugging pain.

I have been coding some assembly instructions to make C-calls at runtime to a debugging routine, but the call seems to always ends up mysteriously trampling the JIT-ed routines, making the VM take unexpected execution paths and causing some unlikely assertions to be fired.

The situation is confounded by a number of issues:

the code generated is dynamic, and therefore there are no debugging symbols associated with them compared to code typically generated by the assembler/compiler;
there are different types of call-frames for a given method; 1 for a pre-compiled stub, 1 for a frame that's crossed-over from JIT-ed code to native code, and 1 for the JIT-ed code itself;
when the eventual assertion does manifest, the code is already far away in the rabbit-hole from where the original problem manifested. And because some of the JIT-ed code actually makes a "JMP", unlike a "CALL", you can't actually figure out where the code originated from, since %rip is never saved on the call stack.

While situations 1 and 2 make debugging difficult by having the need to keep a lot of contextual information in order to figure out what's going on, situation 3 is just impossible to debug if the bug is non-deterministic in nature. For example, each compiled method in the VM generates a small assembly stub that replaces the actual code to be executed; when the stub gets executed for the first time, it triggers of the JIT compiler at runtime to compile the real method from its intermediate representation. The compiled method then replaces the stub, hence subsequent invocations will simply call the already JIT-generated method, thereby executing at native-speed, like just as you would get on compiled code.

To optimise on space, the stubs are made as small as possible (~20 bytes), and the common execution body shared by all stubs is factored into a common block. All stubs will eventually perform a global "JMP" instruction to this common block. In order to faciliate communication, all shared data between the stub and the common code block is passed on the thread stack, where the common offset to the method handle is agreed upon.

While the design is elegant, it is also impossible to debug when it breaks; the non-deterministic-ness of the bug seems to surface from time-to-time, where it seems to suggest that the thread stack got corrupted or that it's not passing the method handle correctly. Even when GDB is left running, by the time the assertion triggers, it's already past the fact, and therefore it is unable to trace back to the originating path.

I thought it might be a good idea to inject some debugging calls to trace the execution and stack pointer at runtime, so that I can figure out which stub was last called and the stack height when the call was made; the two information combined should give me sufficient hints on where the problem might lie. However, my injected code has introduced two other issues that I had overlooked, which brings me back into the discussion of the x86_64 ABI again; if you ever wanted to template any assembly instructions into your code that relies on an external library call, do keep these 2 points from the ABI specification in mind:

Save ALL caller-saves registers, not just only the ones that you are using.
(§3.2.2) The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 (32) when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame.

I have to say that I've dismissed (1) since I've gotten use to the style of only documenting and saving the registers that was used; the convention was something that I had picked up from Peter Norton's 1992 book, "Assembly Language for the PC". For those who don't know, he's the "Norton" that Symantec's Norton Antivirus is named after. I still have the out-of-print book on my desk as a keepsake; it reminds me of the the memories of reading it and scribbling code on a piece of paper at my local library. Remarkably, that was how I learnt assembly, since I didn't have a computer back then. Thumbing through the book today, I still have an incredible respect for Peter's coding prowess. He had a way of organising his procedures so elegantly such that each of them all fitted perfectly together from chapter to chapter.

Sorry, got sidetracked. So yes, point (1) - to save ALL registers; this is necessary because all caller-saved registers can actually be occupied by the JIT routines as input arguments to the callee; while this typically means the 6 defined registers (%rdi, %rsi, %rdx, %rcx, %r8, %r9) for general input (see §3.2.3), other registers can also be trashed upon a call return, so as a rule-of-thumb save everything, except the callee-saved registers (%rbx, %rbp, %r12 to %r15), which are guaranteed to be preserved.

Point (2) - I haven't observed a reproducible side effect from this; however the failure points between adhering to it and not actually causes a visible difference in the JIT-ed code's path; therefore there is a need to be on the side of caution. I seem to have observed that some memory faults from not following this directive, but I can't ascertain this for a fact yet.

Finally, a self-inflicted bug that I'd like to remind myself of; remember make sure to deduct from %rsp if any memory has been written onto the thread stack; otherwise any function calls may unknowingly overwrite it!

For all the trouble with debugging that I've gotten myself into, there is at least a silver-lining to it; I had made the problem deterministic, or if it isn't the same problem, it was a similar class of problem that I can consistently reproduce to analyse its behaviour and learn from the mistakes I have been making. Because of the determinism, I was able to use GDB's reversible debugging feature to record the execution from the stub to the common code to gain a better understanding of how the generated code actually works. It's a really nifty feature, and I'm glad to have it as my first useful case of applied reversible debugging in practice.

Saturday, June 04, 2011

Page Faults

While going through some code emitted by a Just-In-Time compiler (JIT), I’ve encountered a curious piece of code which suggested that if access to the process' utilisable stack space isn’t done incrementally, it will cause an “access violation”.

Normally, I wouldn’t have bothered with the problem. But in this case, the JIT-ed code makes uninitialised reads to the process stack, causing valgrind to generate a huge amount of spurious warnings in its log. This makes it difficult to sieve through relevant details, and makes it impossible to generate a static suppression for because the JIT-ed code's call frames are dynamically generated and arbitrary in nature.

I didn’t see what’s wrong with touching memory that's already accessible by the application, so I didn’t really grok what the “access violation” exactly implies. A little sleuthing is required, so I wrote a little code to test out the “access violation”:

#.equ INCREMENT, 0x1000
.equ INCREMENT, 0x1800

.section .text

FORMAT_STR:
 .string "%d\n"

.globl _start
_start:
 push %rbp
 movq %rsp, %rbp

 # loop and keep touching stack space
 movq $640, %rcx
 movq $-0x1000, %rbx
again:
 movq (%rbp, %rbx), %rax
 subq $INCREMENT, %rbx

 movq %rbx, %r12   # use callee-save to prevent push to stack   
 movq %rbx, %rsi
 movq $FORMAT_STR, %rdi
 movq $0, %rax
 call printf       # but call pushes to stack too :(
 movq %r12, %rbx
 loop again

 movq $1, %rbx
 movq $1, %rax
 int $0x80

Note the commented code at the first line; this represents the original page size boundary in which the JIT emitted that’s causing the offending uninitialised memory access; if the size isn’t extended, the code will segmentation fault at around the 8MB mark. The corresponds to what ‘ulimit -s’ reports on the OS.

However, if the size gets incremented to 0x1800 bytes for example, the code will segmentation fault way much earlier, at around the 140k mark, which puzzled me. Looking at ‘dmesg’ shows something interesting:

[806331.042666] incremental[28910]: segfault at 7fff983ad8a8 ip 0000000000400256 sp 00007fff983cf8a8 error 4 in incremental[400000+1000]

I’m surprised that the kernel actually reports this error, so I started searching for the error string on Google code search. The likely matches came from cygwin which does mention “access violation” and xen-source where it indicates a page fault.

Reading through the definition suggests that I’m causing a hard page fault, but I wanted to make sure that the error code 4 is exactly meaning this. Some cursory research led me to a pretty helpful CS page explaining page faults, with excerpts from Linux 1.0’s sources; scanning through it which showed that "error 4" means that the error comes from user space (as opposed to kernel space).

The code also indicates that if an allocation exceeds the OS page size, the kernel is free to abort the program, which explains the error. Further research also led me to the getpagesize() system call, which verified that the page size for Linux is set at 4KB.

So mystery solved. I suppose the next thing I can do, is to make a nasty hack in the JIT to make spurious writes instead of reads instead; that should get rid of all the valgrind false positives, but I can’t say it’s the most elegant way of resolving the issue.

Saturday, April 02, 2011

No Technical Support Provided

Readers,

If you're came here through the links from my other blog posts, please take time to read what I have to say.

The solutions to the problems I solve, are large done to "scratch my own itch". I like to share these solutions through my blog with the hope that it'll be useful to others facing the same issues. However, I have a pretty intense full-time job in my own company, which leaves me very little time to be providing any specific guidance in any related problems that you may face.

You may want to comment on the blog post, and hopefully, some good Samaritans may give out more wisdom or advice. Or you may be the one who's giving help to others, and good on you if that's the case. That is what community spirit is all about.

Unfortunately, I cannot be here to provide technical support for you and if I don't have further insight or time to the problems you have, you are largely on your own.

Thank you for understanding.

Saturday, June 05, 2010

Why a volatile is different in C/C++ and Java

The volatile keyword is highly subjective to the language and the platform it is implemented on. While Java provides a consistent behaviour of volatile across all architectures, this is not the case for languages that are directly compiled into the native machine platform, such as in the case of C/C++. Let's try to understand why this is the case.

Let a, b

be member of a set of program actions

, and $v_{n} (a)$ be function that applies the volatility requirement to an action, where the subscript

denotes the

-th iteration in which a volatile action is applied, and $\rightarrow$ be the precedes operator, which is explained previously. For all program actions, the following rules holds:

$v_n(a) \rightarrow v_{n+1}(a)$

$a \rightarrow v_n(b) \Rightarrow a \rightarrow v_{n+i}(b)$ where

$i \in \mathbb{N}$

Rule 1 says that all volatile functions enforce a total order, where the function $v_{n} (a)$ always precedes $v_{n+1} (a)$ , where rule 2 says that if an action

precedes a volatile function on an action

on the

-th iteration, then the action

must necessarily precede all subsequent volatile functions applied to

.

This is a very strong memory requirement in Java, in fact it is much stronger than compared to C/C++. The C/C++ language specification has no such restriction on memory ordering, and leaves it to the compiler implementation to decide how non-volatile actions are ordered around volatile actions.

Let's consider how the these rules affect program execution with a simple code sample:


int a = 0;
int b = 0;
volatile int count = 0;

a  = 1;
count = 1;
b = 2;
count = 2;

In C/C++, the volatile keyword only guarantees that the count variable cannot be reordered against each other, ie. if count == 2, then count = 1 must necessarily precede it. However, there is neither a guarantee that a == 1, nor that b == 2.

In Java, given the stronger guarantee defined above, then if count == 1, then the assertion a == 1 must be true. Similarly, if count == 2 then assertion that a == 1 && b == 2 must be true. This is what is means by the strict memory guarantee that Java offers that C/C++ does not.

However this does not mean that C/C++ will not behave the same way Java does. Whether if it does so depends on (1) whether if the compiler performs any code reordering that may be in a surprising order, but legit order, and (2) whether if the underlying machine architecture upholds the same strict memory order, provided that the compiler does not perform any surprising code reordering.

For instance, compiling code on gcc with -O0 set on all x86 platforms will conform to (and is stricter than) Java's memory model, but other architectures such as the PowerPC, Alpha, Itanium all uphold a weaker memory model which can exhibit surprising program behaviours that a programmer may not expect. Caveat Lector!

Anyhow, if you are interested in more memory model consistency rules, you might want to watch Intel's explanation of the x86 memory model where the nuances of memory ordering is explained in good detail. Enjoy!

Saturday, May 22, 2010

Google Docs's Equation Editor does not Create Permalink Images

I initially thought I could get away with it by linking Latex images for my equations directly from Google Docs to Blogger, but it seems that the image URLs are not permalinks, causing the images in my blog posts to be missing after a while.

Wordpress has a plugin to support this directly, but there's a nary of solution for Blogger.

Still trying to look for possible workarounds to this problem. Hopefully it's not one that involves manually copying and linking the images directly. :(

Saturday, May 15, 2010

Difference between Atomic and Volatile?

There seems to be a tendency for people to confuse an atomic action with a volatile memory access. This is quite natural, since they exhibit similar properties and makes it hard for people to distinguish between the two.

So what is the common property between an atomic action and a volatile memory? They both exhibit the property of total ordering. Let's assume for $\forall a, b, c \in A$ where

is the set of possible actions a program can execute, and $a \rightarrow b$ denotes that an action

precedes action

, then the property of total order means:

$a \rightarrow b \wedge b \rightarrow a \Rightarrow a = b$

$a \rightarrow b , b \rightarrow c \Rightarrow a \rightarrow c$

$a \rightarrow b \vee b \rightarrow a$

Literally, the set of rules mean (1) that for any action

that precedes

and any action

that precedes

, then the action

must necessarily be the same action as

. (2) If

precedes

, and

precedes

, then

must precede

. (3)

must precede

. These are just some mathematical (or convoluted) formalisms of saying any action is guaranteed to be in sequential order.

But that's where the similarity ends.

An atomic action is literally what it means: an action that is indivisible. Volatile memory only provides a guarantee of sequential order, but offers no guarantee that an interleaving of actions cannot occur. For an illustration, lets take a look at the following code:

a = 0;
a++;

Assume that 'a' is a volatile variable. From the code, we naturally think that the increment operator (++) will atomically increment the value of 'a' to 1, but this assumption is slightly faulty:

[a] = 0;
r0 = [a];
r0 = r0 + 1;
[a] = r0;

The square brackets between [a] denotes a value gotten from memory, while r0 represents a register in which the addition operation actually occurs. An increment operation is actually a composite action, composed of a read, then a write action. (By the way, it is implied that register actions are atomic, despite being a composite action. This is because an operation within a register cannot be directly observed).

So increment is a composite action, compared to being an atomic action, what's the harm? Lets go through the same example, but with 2 concurrent processors executing simultaneously:

CPU1	CPU2
a++;	a++;

Lets assume that a is initially 0 to start. Given that iff the actions from CPU1 and CPU2 are both volatile and atomic, then the only possible transition value that can be seen after execution can only be 2.

However, if the actions from CPU1 and CPU2 are only volatile, then the actions can translate to this:

CPU1	CPU2
r0 = a;	r1 = a;
r0 = r0 + 1;	r1= r1 + 1
a = r0;	a = r1

For all the possible decompositions of

into uniprocessor-

interleavings, one can see that it is possible that both registers r0 and r1 both read a as 0 simultaneously which causes the resultant possible values to be either 1 or 2. Clearly seeing 1 is wrong, since if the operation is atomic, then 1 cannot be a legal value at the end of the execution.

Conversely, a volatile variable can be deemed atomic. Consider the example below:

CPU1	CPU2
a=1;	a=2;

In this case, the only possible set of terminal values are 1 or 2, which is exactly the same as you would get for an atomic action. Therefore, a volatile memory access is atomic iff you perform only a read or write action, which are atomic in nature. But a volatile variable is no guarantee of atomicity if a composite action is executed.

Hence the label atomic actions is a slight misnomer, because it doesn't imply that the action is atomic, but rather it means that it guarantees a composite action is executed as a single atomic action. A well known example of such a composite action is the Compare-and-Swap instruction.

For simplicity, just remember that atomicity is not a property of memory, but a property its actions, to which a volatile memory access may or may not uphold.

Saturday, May 08, 2010

Concurrency vs. Multi-Threading

A lot of people tend to treat concurrency and multi-threading as the same thing. In many ways, they are correct given the similarities between the two. But when trying to reason program behaviour, teasing out the distinction is somewhat important. Concurrency,

, is by definition multi-threading; but multi-threading,

, is not necessarily concurrent.

To make it clear why concurrency is not strictly multi-threading, we start off by defining what concurrency is. Let

be the set of all program actions and that $A, B \subseteq P$ and $\forall a, b, c \in P$ . Also, $a \rightarrow b$ denotes that the action

precedes action

$A \rightarrow B \wedge B \rightarrow A \Rightarrow A = B$

$a \rightarrow b , b \rightarrow c \Rightarrow a \rightarrow c$

$A \rightarrow A$

The explanation is actually straightforward when expressed in plain English.

The set of rules says (1) if all the actions in set

precedes all the actions in set

, and all the actions in set

precedes all the actions in set

, then the set of actions are equivalent; (2) for any action

that precedes action

, and any action

that precedes action

, then action

precedes action

. This is known as a transitive relation, but note that while this transitive relation describes an action, broadly this relationship holds within a set of actions as well. (3) All the set of actions in

precedes all the set of actions in

, i.e. this loosely means that if an action has happened before, then it cannot happen again, which is used to describe how program flows.

To put it simply, concurrency means that for a given quantum of time, there can be one or more program actions happening simultaneously. A set of actions on a given thread T1 will be causally ordered by only on the thread itself, however this set of actions may or may not have an ordering relationship with the set of actions on another thread T2.

Understanding what concurrency is, how is multi-threading different?

Consider multi-threading on a single CPU machine. In a uni-core CPU, all actions are totally ordered, ie no two actions can happen at a single quantum of time, but this is not the case for a multi-core (concurrent) CPU. Obviously, we can easily make a multi-core CPU function like a single core CPU, but certainly not the other way around, which is why $C \geq MT$ .

Even though most times

type programs are what we are interested in, we do usually assign an arbitary program order to these programs to make it functionally behave like a single-core

program, which is much easier to reason with. You see these

to totally ordered

conversions from time to time, such as in my discussion of the Implementation of Volatile in the JVM.

To put it simply, threading is a way of describing program actions for concurrent programs, and sequential programs can also be reasoned in the context of threads, but not the other way round. For a simple illustration, consider the following program:


public class MyLock {
  private volatile int thread_id;
  public void lock ( )
  {
     int my_id = Thread.getID ( );
     thread_id = my_id;
     while ( thread_id == my_id ) { } // wait loop
  }
}

The code is multi-threaded; assume that there are 2 threads executing at the moment, and with thread T1, T2 returning IDs of 1, 2 correspondingly.

There is an interesting property with this piece of code - if the threads are concurrent, then eventually the thread_id value will change, and the code can never halt (and the threads will be fairly-scheduled as an interesting side effect). But if the threads are sequential, then one thread will enter the lock ( ) code, and halts the processor in a wait loop.

For those who may be interested, the code in this example is a part of an implementation of Peterson's algorithm, and a simple and poignant example to show how concurrent programs can be different from multi-threading ones.

Vincent Liu

Have you ever reused code?

Google Drive does not work if your network is slow

The x86_64 Calling Convention

Page Faults

No Technical Support Provided

Why a volatile is different in C/C++ and Java

Google Docs's Equation Editor does not Create Permalink Images

Difference between Atomic and Volatile?

Concurrency vs. Multi-Threading

Categories

Blog Archive