X-Git-Url: https://oss.titaniummirror.com/gitweb?a=blobdiff_plain;f=gmp%2Fmpn%2Fia64%2FREADME;fp=gmp%2Fmpn%2Fia64%2FREADME;h=9252271ab7a6487be3733a333e733a6180139f61;hb=6fed43773c9b0ce596dca5686f37ac3fc0fa11c0;hp=0000000000000000000000000000000000000000;hpb=27b11d56b743098deb193d510b337ba22dc52e5c;p=msp430-gcc.git diff --git a/gmp/mpn/ia64/README b/gmp/mpn/ia64/README new file mode 100644 index 00000000..9252271a --- /dev/null +++ b/gmp/mpn/ia64/README @@ -0,0 +1,277 @@ +Copyright 2000, 2001, 2002, 2003, 2004, 2005 Free Software Foundation, Inc. + +This file is part of the GNU MP Library. + +The GNU MP Library is free software; you can redistribute it and/or modify +it under the terms of the GNU Lesser General Public License as published by +the Free Software Foundation; either version 3 of the License, or (at your +option) any later version. + +The GNU MP Library is distributed in the hope that it will be useful, but +WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public +License for more details. + +You should have received a copy of the GNU Lesser General Public License +along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. + + + + IA-64 MPN SUBROUTINES + + +This directory contains mpn functions for the IA-64 architecture. + + +CODE ORGANIZATION + + mpn/ia64 itanium-2, and generic ia64 + +The code here has been optimized primarily for Itanium 2. Very few Itanium 1 +chips were ever sold, and Itanium 2 is more powerful, so the latter is what +we concentrate on. + + + +CHIP NOTES + +The IA-64 ISA keeps instructions three and three in 128 bit bundles. +Programmers/compilers need to put explicit breaks `;;' when there are WAW or +RAW dependencies, with some notable exceptions. Such "breaks" are typically +at the end of a bundle, but can be put between operations within some bundle +types too. + +The Itanium 1 and Itanium 2 implementations can under ideal conditions +execute two bundles per cycle. The Itanium 1 allows 4 of these instructions +to do integer operations, while the Itanium 2 allows all 6 to be integer +operations. + +Taken cloop branches seem to insert a bubble into the pipeline most of the +time on Itanium 1. + +Loads to the fp registers bypass the L1 cache and thus get extremely long +latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2. + +The software pipeline stuff using br.ctop instruction causes delays, since +many issue slots are taken up by instructions with zero predicates, and +since many extra instructions are needed to set things up. These features +are clearly designed for code density, not speed. + +Misc pipeline limitations (Itanium 1): +* The getf.sig instruction can only execute in M0. +* At most four integer instructions/cycle. +* Nops take up resources like any plain instructions. + +Misc pipeline limitations (Itanium 2): +* The getf.sig instruction can only execute in M0. +* Nops take up resources like any plain instructions. + + +ASSEMBLY SYNTAX + +.align pads with nops in a text segment, but gas 2.14 and earlier +incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making +it come out as break instructions. We use the ALIGN() macro in +mpn/ia64/ia64-defs.m4 when it might be executed across. That macro +suppresses any .align if the problem is detected by configure. Lack of +alignment might hurt performance but will at least be correct. + +foo:: to create a global symbol is not accepted by gas. Use separate +".global foo" and "foo:" instead. + +.global is the standard global directive. gas accepts .globl, but hpux "as" +doesn't. + +.proc / .endp generates the appropriate .type and .size information for ELF, +so the latter directives don't need to be given explicitly. + +.pred.rel "mutex"... is standard for annotating predicate register +relationships. gas also accepts .pred.rel.mutex, but hpux "as" doesn't. + +.pred directives can't be put on a line with a label, like +".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that. +gas is happy with it, and past versions of HP had seemed ok. + +// is the standard comment sequence, but we prefer "C" since it inhibits m4 +macro expansion. See comments in ia64-defs.m4. + + +REGISTER USAGE + +Special: + r0: constant 0 + r1: global pointer (gp) + r8: return value + r12: stack pointer (sp) + r13: thread pointer (tp) +Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127 +Caller-saves but rotating: r32- + + +================================================================ +mpn_add_n, mpn_sub_n: + +The current code runs at 1.25 c/l on Itanium 2. + +================================================================ +mpn_mul_1: + +The current code runs at 2 c/l on Itanium 2. + +Using a blocked approach, working off of 4 separate places in the operands, +one could make use of the xma accumulation, and approach 1 c/l. + + ldf8 [up] + xma.l + xma.hu + stf8 [wrp] + +================================================================ +mpn_addmul_1: + +The current code runs at 2 c/l on Itanium 2. + +It seems possible to use a blocked approach, as with mpn_mul_1. We should +read rp[] to integer registers, allowing for just one getf.sig per cycle. + + ld8 [rp] + ldf8 [up] + xma.l + xma.hu + getf.sig + add+add+cmp+cmp + st8 [wrp] + +These 10 instructions can be scheduled to approach 1.667 cycles, and with +the 4 cycle latency of xma, this means we need at least 3 blocks. Using +ldfp8 we could approach 1.583 c/l. + +================================================================ +mpn_submul_1: + +The current code runs at 2.25 c/l on Itanium 2. Getting to 2 c/l requires +ldfp8 with all alignment headache that implies. + +================================================================ +mpn_addmul_N + +For best speed, we need to give up using mpn_addmul_1 as the main multiply +building block, and instead take multiple v limbs per loop. For the Itanium +1, we need to take about 8 limbs at a time for full speed. For the Itanium +2, something like mpn_addmul_4 should be enough. + +The add+cmp+cmp+add we use on the other codes is optimal for shortening +recurrencies (1 cycle) but the sequence takes up 4 execution slots. When +recurrency depth is not critical, a more standard 3-cycle add+cmp+add is +better. + +/* First load the 8 values from v */ + ldfp8 v0, v1 = [r35], 16;; + ldfp8 v2, v3 = [r35], 16;; + ldfp8 v4, v5 = [r35], 16;; + ldfp8 v6, v7 = [r35], 16;; + +/* In the inner loop, get a new U limb and store a result limb. */ + mov lc = un +Loop: ldf8 u0 = [r33], 8 + ld8 r0 = [r32] + xma.l lp0 = v0, u0, hp0 + xma.hu hp0 = v0, u0, hp0 + xma.l lp1 = v1, u0, hp1 + xma.hu hp1 = v1, u0, hp1 + xma.l lp2 = v2, u0, hp2 + xma.hu hp2 = v2, u0, hp2 + xma.l lp3 = v3, u0, hp3 + xma.hu hp3 = v3, u0, hp3 + xma.l lp4 = v4, u0, hp4 + xma.hu hp4 = v4, u0, hp4 + xma.l lp5 = v5, u0, hp5 + xma.hu hp5 = v5, u0, hp5 + xma.l lp6 = v6, u0, hp6 + xma.hu hp6 = v6, u0, hp6 + xma.l lp7 = v7, u0, hp7 + xma.hu hp7 = v7, u0, hp7 + getf.sig l0 = lp0 + getf.sig l1 = lp1 + getf.sig l2 = lp2 + getf.sig l3 = lp3 + getf.sig l4 = lp4 + getf.sig l5 = lp5 + getf.sig l6 = lp6 + add+cmp+add xx, l0, r0 + add+cmp+add acc0, acc1, l1 + add+cmp+add acc1, acc2, l2 + add+cmp+add acc2, acc3, l3 + add+cmp+add acc3, acc4, l4 + add+cmp+add acc4, acc5, l5 + add+cmp+add acc5, acc6, l6 + getf.sig acc6 = lp7 + st8 [r32] = xx, 8 + br.cloop Loop + + 49 insn at max 6 insn/cycle: 8.167 cycles/limb8 + 11 memops at max 2 memops/cycle: 5.5 cycles/limb8 + 16 fpops at max 2 fpops/cycle: 8 cycles/limb8 + 21 intops at max 4 intops/cycle: 5.25 cycles/limb8 + 11+21 memops+intops at max 4/cycle 8 cycles/limb8 + +================================================================ +mpn_lshift, mpn_rshift + +The current code runs at 1 cycle/limb on Itanium 2. + +Using 63 separate loops, we could use the double-word shrp instruction. +That instruction has a plain single-cycle latency. We need 63 loops since +this instruction only accept immediate count. That would lead to a somewhat +silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp +each cycle plus shl/shr going down I1 for a further limb every second +cycle). + +================================================================ +mpn_copyi, mpn_copyd + +The current code runs at 0.5 c/l on Itanium 2. But that is just for L1 +cache hit. The 4-way unrolled loop takes just 2 cycles, and thus load-use +scheduling isn't great. It might be best to actually use modulo scheduled +loops, since that will allow us to do better load-use scheduling without too +much unrolling. + +Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium +2, according to tests/devel/try. Cache bank conflicts? + + + +REFERENCES + +Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3, +Intel document 245317-004, 245318-004, 245319-004 October 2002. Volume 1 +includes an Itanium optimization guide. + +Intel Itanium Processor-specific Application Binary Interface (ABI), Intel +document 245370-003, May 2001. Describes C type sizes, dynamic linking, +etc. + +Intel Itanium Architecture Assembly Language Reference Guide, Intel document +248801-004, 2000-2002. Describes assembly instruction syntax and other +directives. + +Itanium Software Conventions and Runtime Architecture Guide, Intel document +245358-003, May 2001. Describes calling conventions, including stack +unwinding requirements. + +Intel Itanium Processor Reference Manual for Software Optimization, Intel +document 245473-003, November 2001. + +Intel Itanium-2 Processor Reference Manual for Software Development and +Optimization, Intel document 251110-003, May 2004. + +All the above documents can be found online at + + http://developer.intel.com/design/itanium/manuals.htm + + +---------------- +Local variables: +mode: text +fill-column: 76 +End: