From gorton@spf64.amt.tay1.dec.com Wed Oct 18 15:35:30 1995 Path: news.tuwien.ac.at!newsfeed.ACO.net!Austria.EU.net!EU.net!news.sprintlink.net!in2.uu.net!news1.digital.com!pa.dec.com!sousa.tay1.dec.com!spf64.amt.tay1.dec.com!gorton From: gorton@spf64.amt.tay1.dec.com (Richard Gorton) Newsgroups: comp.emulators.misc Subject: FreePort Express technical info Date: 18 Oct 1995 14:35:30 GMT Organization: Digital Equipment Corporation Lines: 375 Distribution: world Message-ID: <4633bi$pl0@sousa.tay1.dec.com> Reply-To: gorton@spf64.amt.tay1.dec.com (Richard Gorton) NNTP-Posting-Host: spf64.amt.tay1.dec.com X-Newsreader: mxrn 6.18-8 At SunWorld '95, Digital announced FreePort Express, a SPARC SunOS to Alpha OSF/1 binary translator. Here are some of the technical details about it, and Digital's approach to user-mode binary translation in general. What is Binary Translation? --------------------------- When one needs to be able to run an existing piece of software on a new platform, there are three basic ways to make this happen, depending upon source code and compiler availability, and required performance level of the software. These are: native compilation, interpretation, and binary translation. Native compilation is the traditional approach, and is usually the first considered. A native compilation requires the user to have sources, may require a developer familiar with the source code to do the port, and may end up taking man-months or even man-years to complete. However, with highly optimizing compilers, native compilation can produce code which has a very high performance level. An emulated solution requires the least amount of user interaction, and can get an existing application running on a new platform in the shortest amount of time. Unfortunately, an emulator also delivers the lowest performance level, typically running at only a very small fraction of the possible native performance level. Emulators with dynamic compilation capabilities, and emulators with natively compiled libraries, promise significantly increased performance levels. Binary translation is a hybrid solution to the migration problem. It is a compiler based technology, in that it requires an up-front conversion of the executable from the original form, but unlike a native port, source code is not required. Digital's binary translators are automated to the point of requiring no expert knowledge of the original application, so the amount of time to perform a migration is typically less than an hour. Because binary translators get to view the entire executable, rather than individual source files, there are optimizations opportunities available to a binary translator that a compiler simply cannot perform. Consequently, the performance of translated executables can be relatively close to native performance, depending upon architectural, operating system, and compiler optimization capability differences between the original and target platforms. Thus, if the source and target architectures and operating systems are very similar, and the source platform compilers are mature while the target platform compilers are immature, translated code may out-perform natively compiled code. [Table 1] Migration source relative effort to amount of user technique needed performance do port interaction --------- ------ ----------- --------- -------------- Native compilation Yes High Most Most Binary translation No Moderate Minimal Minimal Emulation No Low Zero Zero The history behind Digitals binary translation efforts ------------------------------------------------------ Digital's efforts in the area of binary translation are being driven by the Alpha Migration Tools group, which, as a part of Digital Semiconductor, has the charter to develop tools to aid the migration of applications to Alpha based platforms. The group was formed early into the Alpha program, and initially produced simulators and emulators for other software groups to use before getting involved with building binary translators. The group has extensive compiler, emulator, and operating system expertise, which are the requisite ingredients to building a binary translator. Compiler techniques are used to convert the raw input instruction sequences from the source architecture to the target architecture. As part of every translator there is an interpreter component which is used as a fallback in the rare cases where the translator cannot convert some portion of the source program. Once the raw bits of the executables have been converted, there remains the issue of dealing with operating system differences. Extensive knowledge of the internals of the target operating system are key in making the connection between the source and target environments. General workings ---------------- The first two binary translator products, VEST and mx, provide migration support to the installed VAX/OpenVMS and RISC ULTRIX customer bases respectively. These two translators have been shipping as products since the initial releases of OpenVMS and Digital UNIX on Alpha.* FreePort Express is the first binary translator to provide support from a non-Digital platform: SPARC SunOS 4.1.x. It is also Digital's first binary translator product to convert big-endian executables into little-endian executables. * VEST = DECmigrate for OpenVMS - QL-MWMA*-** mx = DECmigrate for Digital UNIX - QL-MWNA*-** [Table 2] Source Source Source OS Target Target Target OS Product Arch OS size/endian Arch OS size/endian ------- ------ ------ ----------- ------ ------ ------- VEST VAX OpenVMS 32/little Alpha OpenVMS 32/little mx mips Ultrix 32/little Alpha UNIX 64/little FreePort SPARC SunOS 32/big Alpha UNIX 64/little Express Our approach to date has been to take a static approach towards the translation process, much like a compiler does, and translate the executable prior to execution. Consequently, our translator products are geared towards software developers, and people whom are comfortable performing compiles. Why SPARC SunOS? ---------------- When we started the project, it seemed to us that another RISC UNIX to RISC UNIX binary translator would be viable; any new RISC UNIX translator we pursued would have to deal with endian-ness conversion issues. Additionally, we would need to pick a source architecture with a significant enough installed base to maximize the usefulness of whatever translator we built. At the time, SUN was the market share leader in UNIX systems, and SPARC processor implementations were lagging the rest of the RISC processors in performance. In fact, SUN was (and still is) agressively pushing their customers to adopt Solaris 2.x, and abandon SunOS 4.1.x, in spite of the fact that the Sun user community was not impressed with the quality of Solaris 2.0, 2.1, etc. If we could provide support to the installed SunOS user base with no loss of performance, it ought to be possible for Digital to provide a better migration path than Sun for SunOS 4.x users; it would also provide Digital with the opportunity to sell systems into sites which have historically been Sun only. Perceived product requirements --------------------------------------- Unlike our previous UNIX translator, mx, a SunOS translator would not only have to handle statically linked executables, but dynamically linked executables and the associated shared libraries as well A consequence of this, which we did not fully appreciate, when we started to implement support these features was the translation of SunOS and SPARC specific relocations into relocations usable by Digital UNIX. Not only would we need to understand and handle link-time relocations, as seen in shared libraries, but load-time relocations as well. Additionally, we would need to come up with a method to map SPARC-specific architectural features, such as register windows into Alphas more conventional register file layout. Finally, we would need to cause our translated code to perform at an acceptable performance level. Our performance goal for translated code was to perform at least as well on the target Alpha system as it did on the original SPARC based system, for comparably priced systems. While this might not sound particularly difficult, given that Alpha systems perform at roughly 2x SPARC systems, we didn't believe that such a goal would be easily achievable, given the architectural differences between SPARC and Alpha, and the performance degradation which would occur due to handling the endianness differences between SunOS and Digital UNIX. Architectural issues ------------------------- In order to be able to make translated code work at all, we needed to model the following SPARC features on Alpha: ) SPARC V8 ) Byte ordering ) Delay slots ) Condition codes ) Register Windows ) Misc SPARC specific architectural features ) SPARC V8 Both Alpha and SPARC are RISC architectures whose instructions are 32 bits long. SPARC V8 is a strictly big-endian architecture, while Alpha is bi-endian. Digital's current system offerings are all little-endian. Most SPARC based machines running SunOS use the SPARC V7 or SPARC V8 architecture; while SPARC V9 has existed officially since Summer 1992 or so, SPARC V9 systems were not in the field at the start of this project, and were not expected to be mainstream by the time we completed our translator. For user-mode code, the most significant instruction additions to SPARC V8 are integer multiply, integer divide, and Quad precision floating point operations. ) Byte ordering The Alpha architecture is Bi-Endian; Digital UNIX systems are Little-Endian, while SunOS 4.1.x is Big-Endian. In order to make translated applications function under Digital UNIX, the programs data has to exist in memory in its native (Big-Endian) format, but in Little-Endian format when it is being operated on - i.e. when it is in a register. This means that every access to or from memory has to perform a byte-swap before operations happen. The fastest swap sequence we have formulated to swap and sign-extend a 32-bit datum is 8 Alpha instructions, which comes out to be 12 cycles for a 21064[A] cpu. If someone can come up with a faster sequence, we would certainly like to hear about it. Clearly, reducing memory accesses is a huge performance win for translated code. There are a number of optimizations which can be usefully applied to eliminating byte swaps: 1) Simple memory copies need not swap byte order 2) load/logical op/store need not swap 3) Load +/- small constant (0-255) can bypass parts of the swap 4) Unnecessary saves and reloads of registers across calls as determined by a usage graph can be optimized away. 5) Loads of address values from detected jump tables can throw away the byte swaps if the address table contents is replace at translate time with the appropriately swapped destination address. 6) strictly local storage as determined by usage graph can be stored in little-endian order To date, we have implemented all but the last optimization, and are unlikely to attempt the last. ) Delay slots Alpha doesn't have delay slots, while SPARC does. Furthermore, there are some really unpleasant (and quite legal) delay slot tricks which we have to deal with. For example, the following instruction sequences are particularly tricky: call foo sets %o7 as return address save %sp, , %sp return address now in %i7 ---------------- call foo sets %o7 as return address add %o7, , %o7 adjust return address to elsewhere ---------------- call .+0xc technique used to load data values ld [o7+8], from istream - seen in COBOL compilers unimp gets ---------------- call foo tail recursion removal restore effectively a jump to foo ---------------- ba,a L_exit control flow instruction in the bne,a foo delay slot of an annulled branch. ld [%i1], %i5 seen in strncmp() L_exit ret restore ---------------- ) Condition Codes SPARC has condition codes, but they require explicit instruction usage to set them. We have discovered that in most cases, the condition code bits are set and tested immediately, after which they are dead. These can be replace with branch on register contents instructions. Dataflow analysis helps identify when this is the case. In the infrequent case when we do have to materialize all 4 condition code bits, we can do so in 4-6 Alpha instructions. Also, many compilers put condition code setting instructions in delay slots: subcc %g1, , %g0 blt someplace addcc %l0, , %l0 Which means that not having the dataflow analysis is costly in terms of execution time performance. ) Register Windows SPARC's register windowing feature (SAVE and RESTORE instructions) effectively requires us to perform multiple stores/copies for a SAVE instruction, and multiple copies/loads for a RESTORE instruction. By analyzing our dataflow graph, we can save/restore only those registers which are actually necessary. As an additional optimization, we store these register values in their homing area on the stack in little-endian format, which further saves costly byte-swapping computations. ) Miscellaneous SPARC features Some other nits we have found: ) SunOS 4.1.x has a 4K page size. Digital UNIX has an 8K page size. Fortunately for us, UNIX applications are independant of page size (we have yet to see one which is dependant on the page size being 4K). ) %y register for mulscc, multiply, and divide. A simplistic translation of anything using mulscc instructions requires more than 15 instructions. However, the only routines which seem to use mulscc instructions (and the %y register) heavily are multiply and divide routines, which we can recognize and at translation time, and replace with optimal but equivalent Alpha instruction sequences. ) Tagged add & subtract instructions. The only place we have seen these actually used (so far) is in our own hand-generated test sequences. ) non-user code accessible registers such as %psr, %wim, %tbr, %asr are not supported. ) Coprocessor instructions are not supportable ) Conditional trap instructions. Some Ada compilers utilized these instructions to perform range checking. ) SPARC floating point register manipulation. Double precision IEEE math is performed in consequitive even/odd register pairs on SPARC, and loads and stores to these register pairs is usually done via a pair of separate load or store instructions, even though the architecture does support the loading of a register pair with the ldd instruction. Alpha, however, accesses in-memory double precision floating point values as a single 64-bit quantity, and single precision accesses via 32-bit operations. Our current implementation in the translator is somewhat biased towards double precision; all floating point values are canonicalized (in their corresponding Alpha registers) to be 64 bit quantities. If we can detect that an operation on a floating point value is going to be a single or double precision operation for sure, we can then generate good code. Some 'clever' compilers can generate SPARC instruction sequences, presumably as an artifact of the instruction scheduler in those compilers, which require us to generate pessimistic code. The common cause of this is a double precision load, which has been split into two single precision loads, with the contents of one of the two registers in the pair getting modified between the first and second loads. ) 32 vs. 64 bit registers The fact that Alpha has 64-bit registers is not actually a problem, as there are 32-bit arithmetic and memory operations in the architecture. In order to preserve the correct operation of translated applications, we ensure that the translated code resides in the low 2Gb of memory, so that pointers can safely be truncated to 32 bits. Beyond the realm of strict instruction and register mapping between SPARC and Alpha, there are various semantic idioms which are not easily directly mapped. For example, the CALL instruction put the address of the CALL instruction into %o7; as there is a delay slot instruction which gets executed any return to the calling location must actually return to %o7+8. By assuming that we can correctly detect calls to, and returns from subroutines, we can then dispense with the necessity of having to exactly model the behavior of SPARC's CALL instruction, and directly utilize Alpha's BSR and RET instructions without adding in or subtracting 8. A complication to modeling the call/return sequence with the Digital UNIX calling sequence arises in the detection and handling of Position Independent Code ('pic' code). The SPARC System V ABI supplement specifies that the method of generating references to data, or external calls in pic code is to first materialize the current address, usually (but not always) via a CALL .+8, and then to add in the relative offset to the appropriate address in the GOT (Global Offset Table). On Alpha, one can generate the current address by performing a BR , .+4. Thus, by modeling the SPARC call/return instruction sequence directly as the OSF/1 calling [Larry should write this stuff, and it needs to be consistent for naming of OS's] sequence, the translator needs understand a variety of idioms as being used to establish the current PC as a prelude to addressing something through the GOT or the PLT (Procedure Linkage Table) A simple CALL .+8 is easy to recognize as meaning 'establish current PC', but only if the instruction in the delay slot of the call does NOT modify %o7. For example, the sequnce: CALL .+8 RESTORE Is more acurately a tail recursion optimization causing a jump to .+8, which is presumably the start of a new routine. But what about compiler modified sequences, some of which are not particularly optimal? For example, at least one version of GCC emits the following code as 'establish current PC': CALL .+12 SETHI %hi(value), %l7 SETHI %hi(value), %l7 OR %l7, %lo(value), %l7 The original instruction sequence was probably: CALL .+8 NOP SETHI %hi(value), %l7 OR %l7, %lo(value), %l7 but modified a control-flow peephole optimization. Likewise, annulled branch optimizations need to be understood, most effectively by undoing them internally to the translator , so that we can determine the intended control flow. One common optimization is to change NOPs in delay slots by incrementing the branch destination by 4, converting the conditional branch into an annulled conditional branch, and copying the instruction at the original branch target into the delay slot of the conditional branch: Original Optimized ---------- ------------- blt L_next blt,a L_next+4 nop add %g1, 0x123, %o2 ... ... next: add %g1, 0x123, %o2 add %g1, 0x123, %o2 subcc %o2, 0x123, %g0 subcc %o2, 0x123, %g0 -- Richard Gorton Project Leader: mx & FreePort Express Alpha Migration Tools Digital Equipment Corp. Reply-to: gorton@tallis.enet.dec.com All standard disclaimers apply.