From gorton@spf64.amt.tay1.dec.com Wed Oct 18 15:35:30 1995
Path: news.tuwien.ac.at!newsfeed.ACO.net!Austria.EU.net!EU.net!news.sprintlink.net!in2.uu.net!news1.digital.com!pa.dec.com!sousa.tay1.dec.com!spf64.amt.tay1.dec.com!gorton
From: gorton@spf64.amt.tay1.dec.com (Richard Gorton)
Newsgroups: comp.emulators.misc
Subject: FreePort Express technical info
Date: 18 Oct 1995 14:35:30 GMT
Organization: Digital Equipment Corporation
Lines: 375
Distribution: world
Message-ID: <4633bi$pl0@sousa.tay1.dec.com>
Reply-To: gorton@spf64.amt.tay1.dec.com (Richard Gorton)
NNTP-Posting-Host: spf64.amt.tay1.dec.com
X-Newsreader: mxrn 6.18-8


At SunWorld '95, Digital announced FreePort Express, a SPARC SunOS
to Alpha OSF/1 binary translator.  Here are some of the technical
details about it, and Digital's approach to user-mode binary translation
in general.

What is Binary Translation?
---------------------------
When one needs to be able to run an existing piece of software
on a new platform, there are three basic ways to make this happen,
depending upon source code and compiler availability, and required
performance level of the software.  These are: native compilation,
interpretation, and binary translation.

Native compilation is the traditional approach, and is usually the
first considered.  A native compilation requires the user to have sources,
may require a developer familiar with the source code to do the port,
and may end up taking man-months or even man-years to complete.
However, with highly optimizing compilers, native compilation can produce
code which has a very high performance level.

An emulated solution requires the least amount of user interaction, and
can get an existing application running on a new platform in the shortest
amount of time.  Unfortunately, an emulator also delivers the lowest
performance level, typically running at only a very small fraction of
the possible native performance level.  Emulators with dynamic compilation
capabilities, and emulators with natively compiled libraries, promise
significantly increased performance levels.

Binary translation is a hybrid solution to the migration problem.
It is a compiler based technology, in that it requires an up-front
conversion of the executable from the original form, but unlike a
native port, source code is not required.  Digital's binary
translators are automated to the point of requiring no expert
knowledge of the original application, so the amount of time to
perform a migration is typically less than an hour.

  Because binary translators get to view the entire executable, rather
than individual source files, there are optimizations opportunities
available to a binary translator that a compiler simply cannot perform.
Consequently, the performance of translated executables can be relatively
close to native performance, depending upon architectural, operating
system, and compiler optimization capability differences between the
original and target platforms.
Thus, if the source and target architectures and operating systems are
very similar, and the source platform compilers are mature while the
target platform compilers are immature,
translated code may out-perform natively compiled code.

[Table 1]
Migration		source	relative     	effort to	amount of user
technique		needed	performance	do port	interaction
---------			------	-----------		 ---------	--------------
Native compilation	Yes	High	     	Most	Most
Binary translation		No	Moderate    	Minimal	Minimal
Emulation		No	Low	     	Zero	Zero


The history behind Digitals binary translation efforts
------------------------------------------------------
Digital's efforts in the area of binary translation are being driven
by the Alpha Migration Tools group, which, as a part of Digital
Semiconductor, has the charter to develop tools to aid the migration
of applications to Alpha based platforms.  The group was formed
early into the Alpha program, and initially produced simulators and
emulators for other software groups to use before getting involved
with building binary translators.  The group has extensive compiler,
emulator, and operating system expertise, which are the requisite
ingredients to building a binary translator.  Compiler techniques
are used to convert the raw input instruction sequences from the source
architecture to the target architecture.  As part of every translator
there is an interpreter component which is used as a fallback in the
rare cases where the translator cannot convert some portion of the
source program.  Once the raw bits of the executables have been
converted, there remains the issue of dealing with operating system
differences.  Extensive knowledge of the internals of the target
operating system are key in making the connection between the source
and target environments.

General workings
----------------
The first two binary translator products, VEST and mx,
provide migration support to the installed VAX/OpenVMS and RISC ULTRIX
customer bases respectively.  These two translators have been
shipping as products since the initial releases of OpenVMS and
Digital UNIX on Alpha.*  FreePort Express is the first binary translator
to provide support from a non-Digital platform: SPARC SunOS 4.1.x.
It is also Digital's first binary translator product to convert big-endian
executables into little-endian executables.

* VEST = DECmigrate for OpenVMS - QL-MWMA*-**
  mx = DECmigrate for Digital UNIX - QL-MWNA*-**

[Table 2]
	   Source  Source  Source OS	Target  Target  Target OS
Product    Arch	   OS      size/endian	Arch	OS	size/endian
-------    ------  ------  -----------	------	------	-------
VEST	   VAX     OpenVMS 32/little	Alpha	OpenVMS	32/little
mx	   mips    Ultrix  32/little   	Alpha	UNIX	64/little
FreePort   SPARC   SunOS   32/big	Alpha	UNIX	64/little
  Express

Our approach to date has been to take a static approach towards the
translation process, much like a compiler does, and translate the
executable prior to execution.  Consequently, our translator
products are geared towards software developers, and people whom
are comfortable performing compiles.

Why SPARC SunOS?
----------------
When we started the project, it seemed to us that another RISC UNIX
to RISC UNIX binary translator would be viable; any new RISC UNIX
translator we pursued would have to deal with endian-ness conversion
issues.  Additionally, we would need to pick a source architecture
with a significant enough installed base to maximize the usefulness
of whatever translator we built.  At the time, SUN was the market
share leader in UNIX systems, and SPARC processor implementations
were lagging the rest of the RISC processors in performance.  
In fact, SUN was (and still is) agressively pushing their customers
to adopt Solaris 2.x, and abandon SunOS 4.1.x, in spite of the fact
that the Sun user community was not impressed with the quality
of Solaris 2.0, 2.1, etc.
If we could provide support to the installed SunOS user base
with no loss of performance, it ought to be possible for Digital
to provide a better migration path than Sun for SunOS 4.x users;
it would also provide Digital with the opportunity to sell systems
into sites which have historically been Sun only.

Perceived product requirements
---------------------------------------
Unlike our previous UNIX translator, mx, a SunOS translator would
not only have to handle statically linked executables, but dynamically
linked executables and the associated shared libraries as well
A consequence of this, which we did not fully appreciate, when
we started to implement support these features was the translation
of  SunOS and SPARC specific relocations
into relocations usable by Digital UNIX.
Not only would we need to understand and handle link-time relocations, 
as seen in shared libraries, but load-time relocations as well.
Additionally, we would need to come up with a method to map
SPARC-specific architectural features, such as register windows
into Alphas more conventional register file layout.
Finally, we would need to cause our translated code to perform
at an acceptable performance level.  Our performance
goal  for translated code was to perform at least as well 
on the target Alpha system as it did on the original SPARC based
system, for comparably priced systems.  While this might not
sound particularly difficult, given that Alpha systems perform
at roughly 2x SPARC systems, we didn't believe that such
a goal would be easily achievable, given the architectural
differences between SPARC and Alpha, and the performance
degradation which would occur due to handling the endianness
differences between SunOS and Digital UNIX. 

Architectural issues
-------------------------
In order to be able to make translated code work at all, we needed to
model  the following SPARC features on Alpha:
) SPARC V8
) Byte ordering
) Delay slots
) Condition codes
) Register Windows
) Misc SPARC specific architectural features

) SPARC V8
Both Alpha and SPARC are RISC architectures whose instructions
are 32 bits long.  SPARC V8 is a strictly big-endian architecture,
while Alpha is bi-endian.  Digital's current system offerings are
all little-endian.
Most SPARC based machines running SunOS use the SPARC V7 or
SPARC V8 architecture; while SPARC V9 has existed officially since
Summer 1992 or so, SPARC V9 systems were not in the field
at the start of this project, and were not expected to be mainstream
by the time we completed our translator.  For user-mode code, the most
significant instruction additions to SPARC V8 are integer multiply,
integer divide, and Quad precision floating point operations.

) Byte ordering
The Alpha architecture is Bi-Endian; Digital UNIX systems are
Little-Endian, while SunOS 4.1.x is Big-Endian.  In order to make
translated applications function under Digital UNIX, the programs
data has to exist in memory in its native (Big-Endian) format, but
in Little-Endian format when it is being operated on - i.e. when
it is in a register.  This means that every access to or from memory
has to perform a byte-swap before operations happen.  The fastest
swap sequence we have formulated to swap and sign-extend a 32-bit datum
is 8 Alpha instructions, which comes out to be 12 cycles for a 21064[A]
cpu.  If someone can come up with a faster sequence, we would certainly
like to hear about it.  Clearly, reducing memory accesses is a huge
performance win for translated code.  There are a number of optimizations
which can be usefully applied to eliminating byte swaps:
	1) Simple memory copies need not swap byte order
	2) load/logical op/store need not swap
	3) Load +/- small constant (0-255) can bypass parts of the swap
	4) Unnecessary saves and reloads of registers across calls as
		determined by a usage graph can be optimized away.
	5) Loads of address values from detected jump tables can throw away
		the byte swaps if the address table contents is replace at
		translate time with the appropriately swapped destination
		address.
	6) strictly local storage as determined by usage graph
		can be stored in little-endian order
To date, we have implemented all but the last optimization, and are unlikely
to attempt the last.
 
) Delay slots
Alpha doesn't have delay slots, while SPARC does.  Furthermore, there
are some really unpleasant (and quite legal) delay slot tricks which
we have to deal with.  For example, the following instruction sequences
are particularly tricky:

	call	foo			sets %o7 as return address
	save	%sp, <size>, %sp		return address now in %i7
----------------
	call	foo			sets %o7 as return address
	add	%o7, <adjustment>, %o7	adjust return address to elsewhere
----------------
	call	.+0xc			technique used to load data values
	ld	[o7+8], <reg>		from istream - seen in COBOL compilers
	unimp	<value>			<reg> gets <value>
----------------
	call	foo			tail recursion removal
	restore				effectively a jump to foo
----------------
	ba,a	L_exit			control flow instruction in the
	bne,a	foo			delay slot of an annulled branch.
	ld	[%i1], %i5		seen in strncmp()
L_exit	ret
	restore
----------------

) Condition Codes
SPARC has condition codes, but they require explicit instruction
usage to set them.  We have discovered that in most cases, the
condition code bits are set and tested immediately, after which
they are dead.  These can be replace with branch on register
contents instructions.  Dataflow analysis helps identify when this is
the case.  In the infrequent case when we do have to materialize all
4 condition code bits, we can do so in 4-6 Alpha instructions.  Also,
many compilers put condition code setting instructions in delay slots:
	subcc	%g1, <value>, %g0
	blt	someplace
	addcc	%l0, <value>, %l0
Which means that not having the dataflow analysis is costly in terms
of execution time performance.

) Register Windows
SPARC's register windowing feature (SAVE and RESTORE instructions)
effectively requires us to perform multiple stores/copies for
a SAVE instruction, and multiple copies/loads for a RESTORE instruction.
By analyzing our dataflow graph, we can save/restore only those registers
which are actually necessary.  As an additional optimization, we store
these register values in their homing area on the stack in little-endian
format, which further saves costly byte-swapping computations.

) Miscellaneous SPARC features
Some other nits we have found:
	) SunOS 4.1.x has a 4K page size.  Digital UNIX has an 8K
	  page size.  Fortunately for us, UNIX applications
	  are independant of page size (we have yet to see one
	  which is dependant on the page size being 4K).
	) %y register for mulscc, multiply, and divide.
	  A simplistic translation of anything using mulscc instructions
	  requires more than 15 instructions.  However, the only
	  routines which seem to use mulscc instructions (and the %y
	  register) heavily are multiply and divide routines, which
	  we can recognize and at translation time, and replace with
	  optimal but equivalent Alpha instruction sequences.
	) Tagged add & subtract instructions.
	  The only place we have seen these actually used (so far)
	  is in our own hand-generated test sequences.
	) non-user code accessible registers such as %psr, %wim, %tbr, %asr
	  are not supported.
	) Coprocessor instructions are not supportable
	) Conditional trap instructions.  Some Ada compilers
	  utilized these instructions to perform range checking.
	) SPARC floating point register manipulation.
	  Double precision IEEE math is performed in consequitive
	  even/odd register pairs on SPARC, and loads and stores to
	  these register pairs is usually done via a pair of separate
	  load or store instructions, even though the architecture
	  does support the loading of a register pair with the ldd instruction.
	  Alpha, however, accesses in-memory double precision floating
	  point values as a single 64-bit quantity, and single precision
	  accesses via 32-bit operations.  Our current implementation in the
	  translator is somewhat biased towards double precision; all
	  floating point values are canonicalized (in their corresponding
	  Alpha registers) to be 64 bit quantities.  If we can detect that
	  an operation on a floating point value is going to be a
	  single or double precision operation for sure, we can then
	  generate good code.  Some 'clever' compilers can generate
	  SPARC instruction sequences, presumably as an artifact
	  of the instruction scheduler in those compilers, which require
	  us to generate pessimistic code.  The common cause of this
	  is a double precision load, which has been split into two
	  single precision loads, with the contents of one of the two
	  registers in the pair getting modified between the first and
	  second loads.
	

) 32 vs. 64 bit registers
The fact that Alpha has 64-bit registers is not actually a problem, as
there are 32-bit arithmetic and memory operations in the architecture.
In order to preserve the correct operation of translated applications,
we ensure that the translated code resides in the low 2Gb of memory,
so that pointers can safely be truncated to 32 bits.

Beyond the realm of strict instruction and register mapping between SPARC
and Alpha, there are various semantic idioms which are not easily directly
mapped.  For example,  the CALL instruction put the address of  the CALL
instruction into %o7; as there is a delay slot instruction which gets executed
any return to the calling location must actually return to %o7+8.  By assuming
that we can correctly detect calls to, and returns from subroutines, we can
then dispense with the necessity of having to exactly model the behavior of
SPARC's CALL instruction, and directly utilize Alpha's BSR and RET
instructions without adding in or subtracting 8.
A complication to modeling the call/return sequence with the Digital UNIX
calling sequence arises in the detection and handling of Position Independent
Code ('pic' code).  The SPARC System V ABI supplement specifies that
the method of generating references to data, or external calls in pic code is
to first materialize the current address, usually  (but not always)
via a CALL .+8, and then to add in the relative offset to the appropriate address
in the GOT (Global Offset Table).  On Alpha, one can generate the current
address by performing a BR  <reg>, .+4.  Thus, by modeling the SPARC
call/return instruction sequence directly as the OSF/1 calling 
[Larry should write this stuff, and it needs to be consistent for naming of OS's]
sequence, the translator needs understand a variety of idioms as
being used to establish the current PC as a prelude to addressing
something through the GOT or the PLT (Procedure Linkage Table)
A simple CALL .+8 is easy to recognize as meaning 'establish current PC',
but only if the instruction in the delay slot of  the call does NOT modify %o7.
For example, the sequnce:
	CALL .+8
	RESTORE

Is more acurately a tail recursion optimization causing a jump to .+8, which is
presumably the start of a new routine.  But what about compiler modified
sequences, some of which are not particularly optimal?  For example, at
least one version of GCC emits the following code as 'establish current PC':
	CALL .+12
	SETHI	%hi(value), %l7
	SETHI	%hi(value), %l7
	OR	%l7, %lo(value), %l7

The original instruction sequence was probably:
	CALL .+8
	NOP
	SETHI	%hi(value), %l7
	OR	%l7, %lo(value), %l7

but modified a control-flow peephole optimization.
Likewise, annulled branch optimizations need to be understood,
most effectively by undoing them internally to the translator , so that
we can determine the intended control flow.  One common optimization
is to change NOPs in delay slots by incrementing the branch destination
by 4, converting the conditional branch into an annulled conditional branch,
and copying the instruction at the original branch target into the delay slot
of the conditional branch:

	Original			Optimized
	----------			-------------
	blt	L_next		blt,a	L_next+4
	nop			add	%g1, 0x123, %o2
	...			...
next:	add	%g1, 0x123, %o2	add	%g1, 0x123, %o2
	subcc	%o2, 0x123, %g0	subcc	%o2, 0x123, %g0


-- 
Richard Gorton			Project Leader: mx & FreePort Express
Alpha Migration Tools		Digital Equipment Corp.
Reply-to: gorton@tallis.enet.dec.com
All standard disclaimers apply.