Table of contents

SSE and AVX Mutation Idea (xlate)

All Streaming SIMD Extensions (SSE) instructions using legacy encoding can be translated to Advanced Vector Extensions (AVX) encoding. This is already something that most compilers offer these days when using the correct compilation flag (i.e., -QxAVX with MSVC) to compile all legacy SSE instructions into their AVX counterpart.

Compilers will mostly do that for optimisation reason; however, in our case, this is not necessarily what we are looking for. What is more interesting is to be able to modify the encoding of instructions whilst not modifying their operands and the result of their execution. For this very reason a module for a mutation engine could be developed to translate all legacy SSE instructions to AVX format or from AVX to Legacy SSE format to modify the signature of a piece of code.

Therefore, the objective of this paper is to shortly discuss what SSE and AVX are and how it is possible to switch from one format to another without too many difficulties.

Streaming SIMD Extensions (SSE)

Before SSE there was the MMX facilities. The MMX facilities is the first to implement the concept of Single-Instruction, Multiple Data (SIMD), an instruction set that can perform arithmetic and logical operations on multiple data (i.e., bytes, words, or dwords) – hence the name SIMD. The goal being to reduce memory access operations which take a lot of time to process. This is quite useful for modern media, communication and graphic applications.

Example of SIMD operation: addss xmm1,xmm2

However, MMX data registers are aliases for the low 64-bit part of the x87 FP data registers (i.e., ST(0)ST(7)) on top of limitations or edge cases when a routine uses both instruction sets (e.g., data lose and performance issues). Additionally, MMX facilities does not support operations on floating point values.

This is what the first version of the SSE facilities tries to address via new SIMD and non-SIMD instructions. Also, with it comes completely new 128-bit data registers (i.e., XMM0 – XMM7) which can be used to operate on scalar (single integer/FP for non-SIMD) or packed (for SIMD) operands, such as:

  • 16 bytes packed integers
  • 8 words packed integers
  • 4 dwords packed integers
  • 2 qword packed integers
  • 4 single-precision (SP) floating point
  • 2 double-precision (DP) floating point

Other features have been implemented such as:

  • Enhancement of specific type of memory write operation on cacheable memory via Non-Temporal stores (streaming stores) instructions, for example: MOVNTPS, MOVNTQ, or MASKMOVQ;
  • Video and media-specific instructions, for example: PAVD, LDDQU;
  • Thread synchronisation instructions, for example: PAUSE, MONITOR, MWAIT; and
  • Legacy prefix branch hints to help with misprediction when dealing with conditional branches.

Requesting the feature flags via the CPUID instruction can be used to identify whether a specific SSE instruction set is available.

Instruction Set Introduced With Year Number of Instructions CPUID Flag Register
SSE Pentium III 1999 70 002000000h EDX
SSE2 130nm Pentium 4 2002 114 004000000h EDX
SSE3 90nm Pentium 4 2004 13 000000001h ECX
SSsE3 Core 2 Duo 2006 32 000000200h ECX
SSE4.1 45nm Core 2 Duo (Penryn) 2008 47 000080000h ECX
SSE4.2 Core i7 (Nehalem) 2008 7 000100000h ECX

Simple MASM code that can be used in conjunction with the above table, example for SSE:

SSESupport PROC 
    xor eax, eax
    inc eax
    cpuid

    and edx, 002000000h
    shr edx, 19h
    xchg eax, edx 
    ret
SSESupport ENDP 

Advanced Vector Extensions (AVX)

Introduced with Sandy Bridge (Q1 2011) Intel processors, the AVX facilities continue to enhance SIMD functionalities and offers new features such as:

  • New 256-bit data register (i.e., YMM0 – YMM7) with the lowest 128-bit being aliases for XMM data registers.
  • New encoding with the new 2- or 3-byte VEX prefix for legacy SSE instructions and new AVX instructions; and
  • Non-destructive operand operations to reduce the number of copies and load operations; and
  • Up to three source operands (with 4 operands instructions) using the upper 4-bit of a 8-bit immediate data value (i.e., VEX[vvvv] + ModRM[rm] + ModRM[reg] + imm8[7:4]).

Note that for this very paper there I will not go too much into AVX2 and I will not cover AVX-256 and Fused-Multiply-Add (FMA) extensions, which brings even more functionalities. Also, with AVX-256 another encoding is possible with the EVEX prefix (i.e., replacement of VEX).

Before being able to use AVX it is important to check that both the processor and the operating system does support the AVX instruction set and 256- and 128-bit data registers. The following MASM code can be used to check the aforementioned from either User-Mode (UM) or Kernel-Mode (KM):

AVXSupport PROC
    ; Get features flags 
    xor eax, eax 
    inc eax 
    cpuid 

    mov eax, ecx 
    mov ebx, ecx 

    ; Check for OSXSAVE and AVX support 
    and eax, 008000000h
    shr eax, 1Bh
    and ebx, 010000000H
    shr ebx, 1Ch

 
    ; Get return value 
    and al, bl
    jz return

get_xcr0: 
    xor ecx, ecx 
    XGETBV 
    mov ebx, eax 

    ; Check for XMM and YMM registers 
    and eax, 04h
    shr eax, 02h
    and ebx, 02h
    shr ebx, 01h

    ; Get return value
    and al, bl 
    return:
    ret 
AVXSupport ENDP

AVX and the new VEX

As mentioned in the previous section, AVX offers a new way to encode instructions (including legacy SSE) with a compact 2- or 3-byte Vector Extension (VEX) prefix, respectively starting with C5h and C4h bytes. The composition of the 2- and 3-byte VEX prefix can be found in the Intel documentation:

Fields are as follows:

  • R: like REX[R] in 1’s complement (inverted) form:
    • 1b: Same as REX[R] = 0b (must be set to this in 32-bit mode otherwise LES/LDS)
    • 0b: Same as REX[R] = 1b (64-bit mode only)
  • X: Like REX[X] in 1’s compliment (inverted) form:
    • 1b: Same as REX[X] = 0b (must be set to this in 32-bit mode otherwise LES/LDS)
    • 0b: Same as REX[X] = 1b (64-bit mode only)
  • B: Like REX[B] in 1’s compliment (inverted) form:
    • 1b: Same as REX[B] = 1b (ignored in 32-bit mode)
    • 0b: Same as REX[B] = 0b (64-bit mode only)
  • W: This can either be used like REX[W] or as an additional escape extension. This will be opcode specific.
  • m-mmmm: Used to specify opcode escape sequence (will always be 0Fh when using 2-byte VEX prefix):
    • 00000b: Reserved and will #UD
    • 00001b: Implied 0Fh escape opcode (Table 2)
    • 00010b: Implied 0F38h escape opcodes (Table 3)
    • 00011b: Implied 0F3Ah escape opcodes (Table 4)
    • 00100b-11111b: Reserved will #UD
  • vvvv: Used in conjunction with a ModRM byte to specify an additional register as source or destination. This is encoded in 1’s complement form (inverted) or 1111b if unused:
    • 1111b: XMM0/YMM0
    • 1110b: XMM1/YMM1
    • 1101b: XMM2/YMM2
    • 1100b: XMM3/YMM3
    • 1011b: XMM4/YMM4
    • 1010b: XMM5/YMM5
    • 1001b: XMM6/YMM6
    • 1000b: XMM7/YMM7
    • 0111b: XMM8/YMM8
    • 0110b: XMM9/YMM9
    • 0101b: XMM10/YMM10
    • 0100b: XMM11/YMM11
    • 0011b: XMM12/YMM12
    • 0010b: XMM13/YMM13
    • 0001b: XMM14/YMM14
    • 0000b: XMM15/YMM15
  • L: Vector length bit used to promote operands to 256-bit:
    • 0b: scalar or 128-bit vector operand
    • 1b: 256-bit vector operand
  • pp: Specify a SIMD prefix used as an additional escape opcode:
    • 00b: None
    • 01b: 66h
    • 10b: F3h
    • 11b: F2h

Additionally, for a very small subset of AVX2 instructions, a new Vector SIB (VSIB) byte can be used for memory addressing. This is a special case that will not be discussed there. You can find the list of AVX instructions encoded with a VEX prefix that do uses a VSIB byte later in this paper.

Finally, like all the other prefixes (e.g., Operand Size Override or REX), VEX must be positioned before the primary opcode. Additionally, as shown in the above schema, VEX has bit fields equivalent to the REX prefix, to encode escape opcodes (i.e.,0Fh, 0F38h and 0F3Ah) and SIMD prefixes (i.e., 66h, F2h and F3h). Therefore, if any of them are used with a VEX prefix, an Undefined Instruction (i.e., #UD) exception will be raised.

Translation between Legacy SSE to AVX

As aforementioned, all legacy SSE instructions can be converted into AVX format. However, it does not mean that all AVX instructions can be encoded in legacy SSE format. Over the years SSE has been deprecated and newest instructions can only be encoded using VEX (or EVEX if we consider AVX-512).

Additionally, there is three things to be careful about when translating instructions:

  • First, some instructions when encoded via VEX uses an additional non-destructive (ND) operand to limit the number of read/write access to/from registers and memory addresses. It means that the ND operand which is encoded with VEX[vvvv] needs to be interpreted. Note that this ND operand can be either a source or a destination; however, only a subset on AVX instructions (13) uses VEX[vvvv] as a destination operand (1st operand) and only AVX2 instructions using VSIB byte and few others uses (13) VEX[vvvv] as a second source operand (3rd operand).
  • Second, again, some AVX instructions does not exist in a legacy encoding format (lot of AVX2 and all AVX-512). Therefore, attending to encode them in a legacy format will produce nonsense at best or #UD
  • Thirdly, any AVX instructions operating on 256-bit operands (i.e., VEX[L] = 1b) cannot be encoded in legacy SSE format because of the data limitation of the XMM data registers (i.e., 128-bits).

Later in this paper the lists of instructions using an additional non-destructive operand and AVX only instructions are provided. There is also the list of AVX2 instructions using VSIB for vector memory access addressing.

Example 1: Basic 2-byte VEX Encoded Instruction

Mnemonic Operands Encoding
MOVD r/m32, xmm 66 0F 7E /r
VMOVD r/m32, xmm VEX.128.66.0F.W0 7E /r

2-byte VEX prefix will be used because no non-destructive additional register is required, the general-purpose data register do not need to be promoted to 64-bit, and, the instruction operates on 128-bit XMM register. Elements to consider:

  • SIMD prefix 66h: VEX[pp] = 01b
  • Escape opcode 0Fh: implied with a 2-byte VEX prefix
  • 64-bit register promotion W0: VEX[W] = 0b (not used in 2-byte VEX version)
  • 128-bit only operands: VEX[L] = 0b

Example 2: Basic 3-byte VEX Encoded Instruction with 64-bit

Mnemonic Operands Encoding
MOVDQ r64/m64, xmm1 66 REX.W 0F 7E /r
VMOVDQ r64/m64, xmm1 VEX.128.66.0F.W1 7E /r

3-byte VEX prefix will be used because the general-purpose data register DO need to be promoted to 64-bit. Even if non-destructive additional register is required and the instruction operates only on 128-bit XMM register. Elements to consider:

  • SIMD prefix 66h: VEX[pp] = 01b
  • Escape opcode 0Fh: VEX[m-mmmm] = 00001b
  • 64-bit register promotion W1: VEX[W] = 0b
  • 128-bit only operands: VEX[L] = 0b

Example 3: 3-byte VEX Encoded Instruction with SIB

Mnemonic Operands Encoding
MOVDQ xmm1, xmm2/m128 66 0F D6 /r
VMOVDQ xmm1, xmm2/m128 VEX.128.66.0F.WIG D6 /r

Similar instruction as example two but with a SIB byte and an immediate data value. Also, special case when the AVX instruction does not care about VEX[W] bit field (i.e., WIG). Elements to consider:

  • SIMD prefix 66h: VEX[pp] = 01b
  • Escape opcode 0Fh: VEX[m-mmmm] = 00001b
  • 64-bit base register promotion: VEX[B] = 0b
  • 128-bit only operands: VEX[L] = 0b

Example 4: With Non-Destructive Operand

Mnemonic Operands Encoding
PADDQ xmm1, xmm2/m128 66 0F D4 /r
VPADDQ xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F.WIG D4 /r

In this example there is an instruction using an additional non-destructive operand which will be encoded in the VEX[vvvv] bit field. The complete list of such instruction can be found in one of the tables below. It is important to note that the destination operand need to be encoded twice to prevent unwanted read/write to/from a register or memory address. Elements to consider:

  • SIMD prefix 66h: VEX[pp] = 01b
  • Source operand XMM4: VEX[vvvv] = 1011b
  • Escape opcode 0Fh: implied with a 2-byte VEX prefix
  • 128-bit only operands: VEX[L] = 0b

WIB and Synonymous Mutation

Some AVX instructions does not care about the state of the VEX[W] bit – it will be ignored. It means that when encoding an AVX instructions with such specification, it is possible to set the bit to either 0 or 1 and therefore generate again another encoding. The modification is small but enough to modify a byte and thus break signature in some cases.

Let’s take for example the following 2 instructions:

C4 C1 79 D6 4C 4A 01		vmovq mmword ptr [r10+rcx*2+1],xmm1
C4 C1 F9 D6 4C 4A 01 		vmovq mmword ptr [r10+rcx*2+1],xmm1

In the first version VEX[W] = 0b while on the second version VEX[W] = 1b. Both are valid ways to encode this instruction, will execute and not #UD. Obviously, it will work only when instruction is encoded using 3-byte VEX prefix.

Final Notes

First, when looking at all the examples, it is possible to see that when translating from SSE to AVX or from AVX to SSE encoding the ModRM (and potentially SIB) byte, memory displacement and immediate data value are not affected at all. Only the legacy prefixes used as SIMD prefixes, REX prefix, escape opcode and primary opcodes will be modified, making the whole process easier.

It should be noted that mixing both legacy SSE code and AVX is impacting badly the performances of the CPU. AVX modify the upper bits of the YMM data registers while Legacy SSE instructions cannot modify them. As a result, the upper bits can be in a clean, modified and unsaved (also known as dirty), or preserved/Non_INIT state. As a result, when executing an SSE instruction after an AVX instruction, and vis versa, the processor need to save the state of the register (equivalent to a XSAVE instruction).

Therefore, if different code block can be identified, the VZEROUPPER instruction should be executed before and after executing AVX instructions to clean the upper bits of the YMM registers and set them in a clean mode.

To assite in the understanding of this paper, the following resources can be consulted:

Appendix A : Non-Destructive Operands Instructions

The table below list all the instructions that can be directly translated from AVX VEX encoding to legacy SSE and from legacy SSE encoding to AVX VEX.

Type Mnemonic Oprands Legacy Encoding VEX Encoding Description
SSE COMISS xmm1, xmm2/m32 0F 2F /r VEX.128.0F.WIG 2F /r Compare low single-precision floating-point values in xmm1 and xmm2/mem32 and set the EFLAGS flags accordingly.
SSE CVTSS2SI r32, xmm1/m32 F3 0F 2D /r VEX.128.F3.0F.W0 2D /r Convert one single-precision floating-point value from xmm1/m32 to one signed doubleword integer in r32.
SSE CVTSS2SI r64, xmm1/m32 F3 REX.W 0F 2D /r VEX.128.F3.0F.W1 2D /r Convert one single-precision floating-point value from xmm1/m32 to one signed quadword integer in r64.
SSE CVTTSS2SI r32, xmm1/m32 F3 0F 2C /r VEX.128.F3.0F.W0 2C /r Convert one single-precision floating-point value from xmm1/m32 to one signed doubleword integer in r32 using truncation.
SSE CVTTSS2SI r64, xmm1/m32 F3 REX.W 0F 2C /r VEX.128.F3.0F.W1 2C /r Convert one single-precision floating-point value from xmm1/m32 to one signed quadword integer in r64 using truncation.
SSE LDMXCSR m32 0F AE /2 VEX.LZ.0F.WIG AE /2 Load MXCSR register from m32.
SSE MOVAPS xmm1, xmm2/m128 0F 28 /r VEX.128.0F.WIG 28 /r Move aligned packed single-precision floating-point values from xmm2/mem to xmm1.
SSE MOVAPS xmm2/m128, xmm1 0F 29 /r VEX.128.0F.WIG 28 /r Move aligned packed single-precision floating-point values from xmm1 to xmm2/mem.
SSE MOVHPS m64, xmm1 0F 17 /r VEX.128.0F.WIG 17 /r Move two packed single-precision floating-point values from high quadword of xmm to m64.
SSE MOVLPS 0F 13/r 0F 13/r VEX.128.0F.WIG 13/r Move two packed single-precision floating-point values from low quadword of xmm1 to m64.
SSE MOVMSKPS reg, xmm 0F 50 /r VEX.128.0F.WIG 50 /r Extract 4-bit sign mask from xmm2 and store in reg. The upper bits of r32 or r64 are zeroed.
SSE MOVUPS xmm1, xmm2/m128 0F 10 /r VEX.128.0F.WIG 10 /r Move unaligned packed single-precision floating-point from xmm2/mem to xmm1.
SSE MOVUPS xmm2/m128, xmm1 0F 11 /r VEX.128.0F.WIG 11 /r Move unaligned packed single-precision floating-point from xmm1 to xmm2/mem.
SSE RCPPS xmm1, xmm2/m128 0F 53 /r VEX.128.0F.WIG 53 /r Computes the approximate reciprocals of packed single-precision values in xmm2/mem and stores the results in xmm1.
SSE RSQRTPS xmm1, xmm2/m128 0F 52 /r VEX.128.0F.WIG 52 /r Computes the approximate reciprocals of the square roots of packed single-precision values in xmm2/mem and stores the results in xmm1.
SSE SQRTPS xmm1, xmm2/m128 0F 51 /r VEX.128.0F.WIG 51 /r Computes Square Roots of the packed single-precision floating-point values in xmm2/m128 and stores the result in xmm1.
SSE STMXCSR m32 0F AE /3 VEX.LZ.0F.WIG AE /3 Store contents of MXCSR register to m32.
SSE UCOMISS xmm1, xmm2/m32 0F 2E /r VEX.128.0F.WIG 2E /r Compare low single-precision floating-point values in xmm1 and xmm2/mem32 and set the EFLAGS flags accordingly.
SSE PEXTRW reg, xmm, imm8 66 0F C5 /r ib VEX.128.66.0F.W0 C5 /r ib Extract the word specified by imm8 from xmm and move it to reg, bits 15:0. Zero-extend the result. The upper bits of r64/r32 is filled with zeros.
SSE PEXTRW reg/m16, xmm, imm8 66 0F 3A 15 /r ib VEX.128.66.0F3A.W0 15 /r ib Extract a word integer value from xmm2 at the source word offset specified by imm8 into reg or m16. The upper bits of r64/r32 is filled with zeros.
SSE PMOVMSKB reg, xmm 66 0F D7 /r VEX.128.66.0F.WIG D7 /r Move a byte mask of xmm to reg. The upper bits of r32 or r64 are zeroed
SSE MOVNTPS m128, xmm1 0F 2B /r VEX.128.0F.WIG 2B /r Move packed single-precision values xmm1 to mem using non-temporal hint.
SSE2 COMISD xmm1, xmm2/m64 66 0F 2F /r VEX.128.66.0F.WIG 2F /r Compare low double-precision floating-point values in xmm1 and xmm2/mem64 and set the EFLAGS flags accordingly.
SSE2 CVTSD2SI r32, xmm1/m64 F2 0F 2D /r VEX.128.F2.0F.W0 2D /r Convert one double-precision floating-point value from xmm1/m64 to one signed doubleword integer r32.
SSE2 CVTSD2SI r64, xmm1/m64 F2 REX.W 0F 2D /r VEX.128.F2.0F.W1 2D /r Convert one double-precision floating-point value from xmm1/m64 to one signed quadword integer sign-extended into r64.
SSE2 CVTTSD2SI r32, xmm1/m64 F2 0F 2C /r VEX.128.F2.0F.W0 2C /r Convert one double-precision floating-point value from xmm1/m64 to one signed doubleword integer in r32 using truncation.
SSE2 CVTTSD2SI r64, xmm1/m64 F2 REX.W 0F 2C /r VEX.128.F2.0F.W1 2C /r Convert one double-precision floating-point value from xmm1/m64 to one signed quadword integer in r64 using truncation.
SSE2 CVTPD2PS xmm1, xmm2/m128 66 0F 5A /r VEX.128.66.0F.WIG 5A /r Convert two packed double-precision floating-point values in xmm2/mem to two single-precision floating-point values in xmm1.
SSE2 CVTPS2PD xmm1, xmm2/m64 0F 5A /r VEX.128.0F.WIG 5A /r Convert two packed single-precision floating-point values in xmm2/m64 to two packed double-precision floating-point values in xmm1.
SSE2 CVTPD2DQ xmm1, xmm2/m128 F2 0F E6 /r VEX.128.F2.0F.WIG E6 /r Convert two packed double-precision floating-point values in xmm2/mem to two signed doubleword integers in xmm1.
SSE2 CVTTPD2DQ xmm1, xmm2/m128 66 0F E6 /r VEX.128.66.0F.WIG E6 /r Convert two packed double-precision floating-point values in xmm2/mem to two signed doubleword integers in xmm1 using truncation.
SSE2 CVTDQ2PD xmm1, xmm2/m64 F3 0F E6 /r VEX.128.F3.0F.WIG E6 /r Convert two packed signed doubleword integers from xmm2/mem to two packed double-precision floating-point values in xmm1.
SSE2 CVTPS2DQ xmm1, xmm2/m128 66 0F 5B /r VEX.128.66.0F.WIG 5B /r Convert four packed single-precision floating-point values from xmm2/mem to four packed signed doubleword values in xmm1.
SSE2 CVTTPS2DQ xmm1, xmm2/m128 F3 0F 5B /r VEX.128.F3.0F.WIG 5B /r Convert four packed single-precision floating-point values from xmm2/mem to four packed signed doubleword values in xmm1 using truncation.
SSE2 CVTDQ2PS xmm1, xmm2/m128 0F 5B /r VEX.128.0F.WIG 5B /r Convert four packed signed doubleword integers from xmm2/mem to four packed single-precision floating-point values in xmm1.
SSE2 MOVAPD xmm1, xmm2/m128 66 0F 28 /r VEX.128.66.0F.WIG 28 /r Move aligned packed double-precision floating-point values from xmm2/mem to xmm
SSE2 MOVAPD xmm2/m128, xmm1 66 0F 29 /r VEX.128.66.0F.WIG 29 /r Move aligned packed double-precision floating-point values from xmm1 to xmm2/mem.
SSE2 MOVHPD m64, xmm1 66 0F 17 /r VEX.128.66.0F.WIG 17 /r Move double-precision floating-point value from high quadword of xmm1 to m64.
SSE2 MOVLPD m64, xmm1 66 0F 13/r VEX.128.66.0F.WIG 13/r Move double-precision floating-point value from low quadword of xmm1 to m64.
SSE2 MOVMSKPD reg, xmm 66 0F 50 /r VEX.128.66.0F.WIG 50 /r Extract 2-bit sign mask from xmm and store in reg. The upper bits of r32 or r64 are filled with zeros.
SSE2 MOVUPD xmm1, xmm2/m128 66 0F 10 /r VEX.128.66.0F.WIG 10 /r Move unaligned packed double-precision floating-point from xmm2/mem to xmm
SSE2 MOVUPD xmm2/m128, xmm1 66 0F 11 /r VEX.128.66.0F.WIG 11 /r Move unaligned packed double-precision floating-point from xmm1 to xmm2/mem.
SSE2 SQRTPD xmm1, xmm2/m128 66 0F 51 /r VEX.128.66.0F.WIG 51 /r Computes Square Roots of the packed double-precision floating-point values in xmm2/m128 and stores the result in xmm1.
SSE2 UCOMISD xmm1, xmm2/m64 66 0F 2E /r VEX.128.66.0F.WIG 2E /r Compare low double-precision floating-point values in xmm1 and xmm2/mem64 and set the EFLAGS flags accordingly.
SSE2 MOVD xmm, r/m32 66 0F 6E /r VEX.128.66.0F.W0 6E /r Move doubleword from r/m32 to xmm.
SSE2 MOVD r/m32, xmm 66 0F 7E /r VEX.128.66.0F.W0 7E /r Move doubleword from xmm to r/m32.
SSE2 MOVQ xmm, r/m64 66 REX.W 0F 6E /r VEX.128.66.0F.W1 6E /r Move quadword from r/m64 to xmm.
SSE2 MOVQ r/m64, xmm 66 REX.W 0F 7E /r VEX.128.66.0F.W0 7E /r Move quadword from xmm register to r/m64.
SSE2 MOVDQA xmm1, xmm2/m128 66 0F 6F /r VEX.128.66.0F.WIG 6F /r Move aligned packed integer values from xmm2/mem to xmm1.
SSE2 MOVDQA xmm2/m128, xmm1 66 0F 7F /r VEX.128.66.0F.WIG 7F /r Move aligned packed integer values from xmm1 to xmm2/mem.
SSE2 MOVDQU xmm1, xmm2/m128 F3 0F 6F /r VEX.128.F3.0F.WIG 6F /r Move unaligned packed integer values from xmm2/m128 to xmm1.
SSE2 MOVDQU xmm2/m128, xmm1 F3 0F 7F /r VEX.128.F3.0F.WIG 7F /r Move unaligned packed integer values from xmm1 to xmm2/m128.
SSE2 MOVQ xmm1, xmm2/m64 F3 0F 7E /r VEX.128.F3.0F.WIG 7E /r Move quadword from xmm2/mem64 to xmm1.
SSE2 MOVQ xmm2/m64, xmm1 66 0F D6 /r VEX.128.66.0F.WIG D6 /r Move quadword from xmm1 to xmm2/mem64.
SSE2 PEXTRW reg, xmm, imm8 66 0F C5 /r ib VEX.128.66.0F.W0 C5 /r ib Extract the word specified by imm8 from xmm and move it to reg, bits 15-0. The upper bits of r32 or r64 is zeroed.
SSE2 PEXTRW reg/m16, xmm, imm8 66 0F 3A 15 /r ib VEX.128.66.0F3A.W0 15 /r ib Extract the word specified by imm8 from xmm and copy it to lowest 16 bits of reg or m16. Zero-extend the result in the destination, r32 or r64.
SSE2 PMOVMSKB reg, xmm 66 0F D7 /r VEX.128.66.0F.WIG D7 /r Move a byte mask of xmm to reg. The upper bits of r32 or r64 are zeroed.
SSE2 PSHUFLW xmm1, xmm2/m128, imm8 F2 0F 70 /r ib VEX.128.F2.0F.WIG 70 /r ib Shuffle the low words in xmm2/m128 based on the encoding in imm8 and store the result in xmm1.
SSE2 PSHUFHW xmm1, xmm2/m128, imm8 F3 0F 70 /r ib VEX.128.F3.0F.WIG 70 /r ib Shuffle the high words in xmm2/m128 based on the encoding in imm8 and store the result in xmm1.
SSE2 PSHUFD xmm1, xmm2/m128, imm8 66 0F 70 /r ib VEX.128.66.0F.WIG 70 /r ib Shuffle the doublewords in xmm2/m128 based on the encoding in imm8 and store the result in xmm1.
SSE2 MASKMOVDQU xmm1, xmm2 66 0F F7 /r VEX.128.66.0F.WIG F7 /r Selectively write bytes from xmm1 to memory location using the byte mask in xmm2. The default memory location is specified by DS:DI/EDI/RDI.
SSE2 MOVNTPD m128, xmm1 66 0F 2B /r VEX.128.66.0F.WIG 2B /r Move packed double-precision values in xmm1 to m128 using non-temporal hint.
SSE2 MOVNTDQ m128, xmm1 66 0F E7 /r VEX.128.66.0F.WIG E7 /r Move packed integer values in xmm1 to m128 using nontemporal hint.
SSE3 LDDQU xmm1, m128 F2 0F F0 /r VEX.128.F2.0F.WIG F0 /r Load unaligned data from mem and return double quadword in xmm1.
SSE3 MOVDDUP xmm1, xmm2/m64 F2 0F 12 /r VEX.128.F2.0F.WIG 12 /r Move double-precision floating-point value from xmm2/m64 and duplicate into xmm1.
SSE3 MOVSHDUP xmm1, xmm2/m128 F3 0F 16 /r VEX.128.F3.0F.WIG 16 /r Move odd index single-precision floating-point values from xmm2/mem and duplicate each element into xmm1.
SSE3 MOVSLDUP xmm1, xmm2/m128 F3 0F 12 /r VEX.128.F3.0F.WIG 12 /r Move even index single-precision floating-point values from xmm2/mem and duplicate each element into xmm1.
SSSE3 PABSB xmm1, xmm2/m128 66 0F 38 1C /r VEX.128.66.0F38.WIG 1C /r Compute the absolute value of bytes in xmm2/m128 and store UNSIGNED result in xmm1.
SSSE3 PABSD xmm1, xmm2/m128 66 0F 38 1E /r VEX.128.66.0F38.WIG 1E /r Compute the absolute value of 32-bit integers in xmm2/m128 and store UNSIGNED result in xmm1.
SSSE3 PABSW xmm1, xmm2/m128 66 0F 38 1D /r VEX.128.66.0F38.WIG 1C /r Compute the absolute value of 16-bit integers in xmm2/m128 and store UNSIGNED result in xmm1.
AESNI AESIMC xmm1, xmm2/m128 66 0F 38 DB /r VEX.128.66.0F38.WIG DB /r Perform the InvMixColumn transformation on a 128-bit round key from xmm2/m128 and store the result in xmm1.
AESNI AESKEYGENASSIST xmm1, xmm2/m128, imm8 66 0F 3A DF /r ib VEX.128.66.0F3A.WIG DF /r ib Assist in AES round key generation using an 8 bits Round Constant (RCON) specified in the immediate byte, operating on 128 bits of data specified in xmm2/m128 and stores the result in xmm1.
SSSE4.1 EXTRACTPS reg/m32, xmm1, imm8 66 0F 3A 17 /r ib VEX.128.66.0F3A.WIG 17 /r ib Extract one single-precision floating-point value from xmm1 at the offset specified by imm8 and store the result in reg or m32. Zero extend the results in 64-bit register if applicable.
SSSE4.1 MOVNTDQA xmm1, m128 66 0F 38 2A /r VEX.128.66.0F38.WIG 2A /r Move double quadword from m128 to xmm1 using nontemporal hint if WC memory type.
SSSE4.1 PEXTRB r/m8, xmm2, imm8 66 0F 3A 14 /r ib VEX.128.66.0F3A.W0 14 /r ib Extract a byte integer value from xmm2 at the source byte offset specified by imm8 into reg or m8. The upper bits of r32 or r64 are zeroed.
SSSE4.1 PEXTRD r/m32, xmm2, imm8 66 0F 3A 16 /r ib VEX.128.66.0F3A.W0 16 /r ib Extract a dword integer value from xmm2 at the source dword offset specified by imm8 into r/m32.
SSSE4.1 PEXTRQ r/m64, xmm2, imm8 66 REX.W 0F 3A 16 /r ib VEX.128.66.0F3A.W1 16 /r ib Extract a qword integer value from xmm2 at the source qword offset specified by imm8 into r/m64.
SSSE4.1 PEXTRW reg, xmm, imm8 66 0F C5 /r ib VEX.128.66.0F.W0 C5 /r ib Extract the word specified by imm8 from xmm and move it to reg, bits 15-0. The upper bits of r32 or r64 is zeroed.
SSSE4.1 PEXTRW reg/m16, xmm, imm8 66 0F 3A 15 /r ib VEX.128.66.0F3A.W0 15 /r ib Extract the word specified by imm8 from xmm and copy it to lowest 16 bits of reg or m16. Zero-extend the result in the destination, r32 or r64.
SSSE4.1 PHMINPOSUW xmm1, xmm2/m128 66 0F 38 41 /r VEX.128.66.0F38.WIG 41 /r Find the minimum unsigned word in xmm2/m128 and place its value in the low word of xmm1 and its index in the secondlowest word of xmm1.
SSSE4.1 PMOVSXBD xmm1, xmm2/m32 66 0f 38 21 /r VEX.128.66.0F38.WIG 21 /r Sign extend 4 packed 8-bit integers in the low 4 bytes of xmm2/m32 to 4 packed 32-bit integers in xmm1.
SSSE4.1 PMOVSXBQ xmm1, xmm2/m16 66 0f 38 22 /r VEX.128.66.0F38.WIG 22 /r Sign extend 2 packed 8-bit integers in the low 2 bytes of xmm2/m16 to 2 packed 64-bit integers in xmm1.
SSSE4.1 PMOVSXBW xmm1, xmm2/m64 66 0f 38 20 /r VEX.128.66.0F38.WIG 20 /r Sign extend 8 packed 8-bit integers in the low 8 bytes of xmm2/m64 to 8 packed 16-bit integers in xmm1.
SSSE4.1 PMOVSXWD xmm1, xmm2/m64 66 0f 38 23/r VEX.128.66.0F38.WIG 23 /r Sign extend 4 packed 16-bit integers in the low 8 bytes of xmm2/m64 to 4 packed 32-bit integers in xmm1.
SSSE4.1 PMOVSXWQ xmm1, xmm2/m32 66 0f 38 24 /r VEX.128.66.0F38.WIG 24 /r Sign extend 2 packed 16-bit integers in the low 4 bytes of xmm2/m32 to 2 packed 64-bit integers in xmm1.
SSSE4.1 PMOVSXDQ xmm1, xmm2/m64 66 0f 38 25 /r VEX.128.66.0F38.WIG 25 /r Sign extend 2 packed 32-bit integers in the low 8 bytes of xmm2/m64 to 2 packed 64-bit integers in xmm1.
SSSE4.1 PMOVZXBD xmm1, xmm2/m32 66 0f 38 31 /r VEX.128.66.0F38.WIG 31 /r Zero extend 4 packed 8-bit integers in the low 4 bytes of xmm2/m32 to 4 packed 32-bit integers in xmm1.
SSSE4.1 PMOVZXBQ xmm1, xmm2/m16 66 0f 38 32 /r VEX.128.66.0F38.WIG 32 /r Zero extend 2 packed 8-bit integers in the low 2 bytes of xmm2/m16 to 2 packed 64-bit integers in xmm1.
SSSE4.1 PMOVZXBW xmm1, xmm2/m64 66 0f 38 30 /r VEX.128.66.0F38.WIG 30 /r Zero extend 8 packed 8-bit integers in the low 8 bytes of xmm2/m64 to 8 packed 16-bit integers in xmm1.
SSSE4.1 PMOVZXWD xmm1, xmm2/m64 66 0f 38 33 /r VEX.128.66.0F38.WIG 33 /r Zero extend 4 packed 16-bit integers in the low 8 bytes of xmm2/m64 to 4 packed 32-bit integers in xmm1.
SSSE4.1 PMOVZXWQ xmm1, xmm2/m64 66 0f 38 34 /r VEX.128.66.0F38.WIG 34 /r Zero extend 2 packed 16-bit integers in the low 4 bytes of xmm2/m32 to 2 packed 64-bit integers in xmm1.
SSSE4.1 PMOVZXDQ xmm1, xmm2/m64 66 0f 38 35 /r VEX.128.66.0F 38.WIG 35 /r Zero extend 2 packed 32-bit integers in the low 8 bytes of xmm2/m64 to 2 packed 64-bit integers in xmm1.
SSSE4.1 PTEST xmm1, xmm2/m128 66 0F 38 17 /r VEX.128.66.0F38.WIG 17 /r Set ZF if xmm2/m128 AND xmm1 result is all 0s. Set CF if xmm2/m128 AND NOT xmm1 result is all 0s.
SSSE4.1 ROUNDPD xmm1, xmm2/m128, imm8 66 0F 3A 09 /r ib VEX.128.66.0F3A.WIG 09 /r ib Round packed double precision floating-point values in xmm2/m128 and place the result in xmm1. The rounding mode is determined by imm8.
SSSE4.1 ROUNDPS xmm1, xmm2/m128, imm8 66 0F 3A 08 /r ib VEX.128.66.0F3A.WIG 08 /r ib Round packed single precision floating-point values in xmm2/m128 and place the result in xmm1. The rounding mode is determined by imm8.
SSSE4.2 PCMPESTRI xmm1, xmm2/m128, imm8 66 0F 3A 61 /r ib VEX.128.66.0F3A 61 /r ib Perform a packed comparison of string data with explicit lengths, generating an index, and storing the result in ECX.
SSSE4.2 PCMPESTRM xmm1, xmm2/m128, imm8 66 0F 3A 60 /r ib VEX.128.66.0F3A 60 /r ib Perform a packed comparison of string data with explicit lengths, generating a mask, and storing the result in XMM0.
SSSE4.2 PCMPISTRI xmm1, xmm2/m128, imm8 66 0F 3A 63 /r ib VEX.128.66.0F3A.WIG 63 /r ib Perform a packed comparison of string data with implicit lengths, generating an index, and storing the result in ECX.
SSSE4.2 PCMPISTRM xmm1, xmm2/m128, imm8 66 0F 3A 62 /r ib VEX.128.66.0F3A.WIG 62 /r ib Perform a packed comparison of string data with implicit lengths, generating a mask, and storing the result in XMM0.

Appendix B : AVX Only Instructions

The table below lists all the instructions (mostly AVX2) that can only be encoded with a VEX prefix. Table to Markdown

Type Mnemonic Operands VEX Encoding Description
AVX vzeroupper VEX.128.0F.WIG 77 Zero upper 128 bits of all YMM registers.
AVX vzeroall VEX.128.0F.WIG 77 Zero upper 128 bits of all YMM registers.
AVX vcvtph2ps xmm1, xmm2/m64 VEX.128.66.0F38.W0 13 /r Convert four packed half precision (16-bit) floatingpoint values in xmm2/m64 to packed single-precision floating-point value in xmm1.
AVX vpermd ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.W0 36 /r Permute doublewords in ymm3/m256 using indices in ymm2 and store the result in ymm1.
AVX vpsrlvd xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 45 /r Shift doublewords in xmm2 right by amount specified in the corresponding element of xmm3/m128 while shifting in 0s.
AVX vpsravd xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 46 /r Shift doublewords in xmm2 right by amount specified in the corresponding element of xmm3/m128 while shifting in 0s.
AVX vpsllvd xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 47 /r Shift doublewords in xmm2 left by amount specified in the corresponding element of xmm3/m128 while shifting in 0s.
AVX vgatherdps xmm1, vm32x, xmm2 VEX.DDS.128.66.0F38.W0 92 /r Using dword indices specified in vm32x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
AVX vgatherqps xmm1, vm64x, xmm2 VEX.DDS.128.66.0F38.W0 93 /r Using qword indices specified in vm64x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
AVX ANDN r32a, r32b, r/m32 VEX.NDS.LZ.0F38.W0 F2 /r Bitwise AND of inverted r32b with r/m32, store result in r32a.
AVX BZHI r32a, r/m32, r32b VEX.NDS.LZ.0F38.W0 F5 /r Zero bits in r/m32 starting with the position in r32b, write result to r32a.
AVX BEXTR r32a, r/m32, r32b VEX.NDS.LZ.0F38.W0 F7 /r Contiguous bitwise extract from r/m32 using r32b as control; store result in r32a.
AVX SHLX r32a, r/m32, r32b VEX.NDS.LZ.66.0F38.W0 F7 /r Shift r/m32 logically left with count specified in r32b.
AVX PEXT r32a, r32b, r/m32 VEX.NDS.LZ.F3.0F38.W0 F5 /r Parallel extract of bits from r32b using mask in r/m32, result is written to r32a
AVX SARX r32a, r/m32, r32b VEX.NDS.LZ.F3.0F38.W0 F7 /r Shift r/m32 arithmetically right with count specified in r32b.
AVX PDEP r32a, r32b, r/m32 VEX.NDS.LZ.F2.0F38.W0 F5 /r Parallel deposit of bits from r32b using mask in r/m32, result is written to r32a.
AVX MULX r32a, r32b, r/m32 VEX.NDD.LZ.F2.0F38.W0 F6 /r Unsigned multiply of r/m32 with EDX without affecting arithmetic flags.
AVX SHRX r32a, r/m32, r32b VEX.NDS.LZ.F2.0F38.W0 F7 /r Shift r/m32 logically right with count specified in r32b.
AVX vpermilps xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 0C /r Permute single-precision floating-point values in xmm2 using controls from xmm3/m128 and store result in xmm1.
AVX vpermilps xmm1, xmm2/m128, imm8 VEX.128.66.0F3A.W0 04 /r ib Permute single-precision floating-point values in xmm2/m128 using controls from imm8 and store result in xmm1.
AVX vpermilpd xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 0D /r Permute double-precision floating-point values in xmm2 using controls from xmm3/m128 and store result in xmm1.
AVX vpermilpd xmm1, xmm2/m128, imm8 VEX.128.66.0F3A.W0 05 /r ib Permute double-precision floating-point values in xmm2/m128 using controls from imm8.
AVX vtestps xmm1, xmm2/m128 VEX.128.66.0F38.W0 0E /r Set ZF and CF depending on sign bit AND and ANDN of packed single-precision floating-point sources.
AVX vtestpd xmm1, xmm2/m128 VEX.128.66.0F38.W0 0F /r Set ZF and CF depending on sign bit AND and ANDN of packed double-precision floating-point sources.
AVX vbroadcastss xmm1, m32 VEX.128.66.0F38.W0 18 /r Broadcast double-precision floating-point element in mem to four locations in ymm1.
AVX vbroadcastsd ymm1, m64 VEX.256.66.0F38.W0 19 /r Broadcast double-precision floating-point element in mem to four locations in ymm1.
AVX vbroadcastf128 ymm1, m128 VEX.256.66.0F38.W0 1A /r Broadcast 128 bits of floating-point data in mem to low and high 128-bits in ymm1.
AVX vmaskmovps xmm1, xmm2, m128 VEX.NDS.128.66.0F38.W0 2C /r Conditionally load packed single-precision values from m128 using mask in xmm2 and store in xmm1.
AVX vmaskmovpd xmm1, xmm2, m128 VEX.NDS.128.66.0F38.W0 2D /r Conditionally load packed double-precision values from m128 using mask in xmm2 and store in xmm1.
AVX vmaskmovps m128, xmm1, xmm2 VEX.NDS.128.66.0F38.W0 2E /r Conditionally store packed single-precision values from xmm2 using mask in xmm1.
AVX vmaskmovpd m128, xmm1, xmm2 VEX.NDS.128.66.0F38.W0 2F /r Conditionally store packed double-precision values from xmm2 using mask in xmm1.
AVX vpbroadcastd xmm1, xmm2/m32 VEX.128.66.0F38.W0 58 /r Broadcast a dword integer in the source operand to four locations in xmm1.
AVX vpbroadcastq xmm1, xmm2/m64 VEX.128.66.0F38.W0 59 /r Broadcast a qword element in source operand to two locations in xmm1.
AVX vbroadcasti128 ymm1, m128 VEX.256.66.0F38.W0 5A /r Broadcast 128 bits of integer data in mem to low and high 128-bits in ymm1.
AVX vpbroadcastb xmm1, xmm2/m8 VEX.128.66.0F38.W0 78 /r Broadcast a byte integer in the source operand to sixteen locations in xmm1.
AVX vpbroadcastw xmm1, xmm2/m16 VEX.128.66.0F38.W0 79 /r Broadcast a word integer in the source operand to eight locations in xmm1.
AVX vpmaskmovd xmm1, xmm2, m128 VEX.NDS.128.66.0F38.W0 8C /r Conditionally load dword values from m128 using mask in xmm2 and store in xmm1.
AVX vpmaskmovd m128, xmm1, xmm2 VEX.NDS.128.66.0F38.W0 8E /r Conditionally store dword values from xmm2 using mask in xmm1.
AVX vpmaskmovq xmm1, xmm2, m128 VEX.NDS.128.66.0F38.W1 8C /r Conditionally load qword values from m128 using mask in xmm2 and store in xmm1.
AVX vpmaskmovq m128, xmm1, xmm2 VEX.NDS.128.66.0F38.W1 8E /r Conditionally store qword values from xmm2 using mask in xmm1.
AVX vpermq ymm1, ymm2/m256, imm8 VEX.256.66.0F3A.W1 00 /r ib Permute qwords in ymm2/m256 using indices in imm8 and store the result in ymm1.
AVX vpermpd ymm1, ymm2/m256, imm8 VEX.256.66.0F3A.W1 01 /r ib Permute double-precision floating-point elements in ymm2/m256 using indices in imm8 and store the result in ymm1.
AVX vpblendd xmm1, xmm2, xmm3/m128, imm8 VEX.NDS.128.66.0F3A.W0 02 /r ib Select dwords from xmm2 and xmm3/m128 from mask specified in imm8 and store the values into xmm1.
AVX vpermilps xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 0C /r Permute single-precision floating-point values in xmm2 using controls from xmm3/m128 and store result in xmm1.
AVX vpermilps xmm1, xmm2/m128, imm8 VEX.128.66.0F3A.W0 04 /r ib Permute single-precision floating-point values in xmm2/m128 using controls from imm8 and store result in xmm1.
AVX vpermilpd xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 0D /r Permute double-precision floating-point values in xmm2 using controls from xmm3/m128 and store result in xmm1.
AVX vpermilpd xmm1, xmm2/m128, imm8 VEX.128.66.0F3A.W0 05 /r ib Permute double-precision floating-point values in xmm2/m128 using controls from imm8.
AVX vperm2f128 ymm1, ymm2, ymm3/m256, imm8 VEX.NDS.256.66.0F3A.W0 06 /r ib Permute 128-bit floating-point fields in ymm2 and ymm3/mem using controls from imm8 and store result in ymm1.
AVX vperm2i128 ymm1, ymm2, ymm3/m256, imm8 VEX.NDS.256.66.0F3A.W0 46 /r ib Permute 128-bit integer data in ymm2 and ymm3/mem using controls from imm8 and store result in ymm1.
AVX RORX r32, r/m32, imm8 VEX.LZ.F2.0F3A.W0 F0 /r ib Rotate 32-bit r/m32 right imm8 times without affecting arithmetic flags.
AVX vinsertf128 ymm1, ymm2, xmm3/m128, imm8 VEX.NDS.256.66.0F3A.W0 18 /r ib Insert 128 bits of packed floating-point values from xmm3/m128 and the remaining values from ymm2 into ymm1.
AVX vextractf128 xmm1/m128, ymm2, imm8 VEX.256.66.0F3A.W0 19 /r ib Extract 128 bits of packed floating-point values from ymm2 and store results in xmm1/m128.
AVX vcvtps2ph xmm1/m64, xmm2, imm8 VEX.128.66.0F3A.W0 1D /r ib Convert four packed single-precision floating-point values in xmm2 to packed half-precision (16-bit) floating-point values in xmm1/m64. Imm8 provides rounding controls.
AVX vinserti128 ymm1, ymm2, xmm3/m128, imm8 VEX.NDS.256.66.0F3A.W0 38 /r ib Insert 128 bits of integer data from xmm3/m128 and the remaining values from ymm2 into ymm1.
AVX vextracti128 xmm1/m128, ymm2, imm8 VEX.256.66.0F3A.W0 39 /r ib Extract 128 bits of integer data from ymm2 and store results in xmm1/m128.
AVX vblendvps xmm1, xmm2, xmm3/m128, xmm4 VEX.NDS.128.66.0F3A.W0 4A /r /is4 Conditionally copy single-precision floating-point values from xmm2 or xmm3/m128 to xmm1, based on mask bits in the specified mask operand, xmm4.
AVX vblendvpd xmm1, xmm2, xmm3/m128, xmm4 VEX.NDS.128.66.0F3A.W0 4B /r /is4 Conditionally copy double-precision floating-point values from xmm2 or xmm3/m128 to xmm1, based on mask bits in the mask operand, xmm4.
AVX vpblendvb xmm1, xmm2, xmm3/m128, xmm4 VEX.NDS.128.66.0F3A.W0 4C /r /is4 Select byte values from xmm2 and xmm3/m128 using mask bits in the specified mask register, xmm4, and store the values into xmm1.

Appendix C : VSIB Instructions

The table below list all the AVX instructions encoded with a VEX prefix that do use a VSIB.

Mnemonic Operands VEX Encoding Description
VGATHERDPD xmm1, vm32x, xmm2 VEX.DDS.128.66.0F38.W1 92 /r Using dword indices specified in vm32x, gather double-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VGATHERQPD xmm1, vm64x, xmm2 VEX.DDS.128.66.0F38.W1 93 /r Using qword indices specified in vm64x, gather double-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VGATHERDPS xmm1, vm32x, xmm2 VEX.DDS.128.66.0F38.W0 92 /r Using dword indices specified in vm32x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VGATHERQPS xmm1, vm64x, xmm2 VEX.DDS.128.66.0F38.W0 93 /r Using qword indices specified in vm64x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERDD xmm1, vm32x, xmm2 VEX.DDS.128.66.0F38.W0 90 /r Using dword indices specified in vm32x, gather dword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERQD xmm1, vm64x, xmm2 VEX.DDS.128.66.0F38.W0 91 /r Using qword indices specified in vm64x, gather dword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERDQ xmm1, vm32x, xmm2 VEX.DDS.128.66.0F38.W1 90 /r Using dword indices specified in vm32x, gather qword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERQQ xmm1, vm64x, xmm2 VEX.DDS.128.66.0F38.W1 91 /r Using qword indices specified in vm64x, gather qword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.

Appendix D : VEX.vvvv Has Destination Operand Instructions

The table below list all the AVX instructions that are using the VEX[vvvv] bit field to define a destination operand.

Mnemonic Operands VEX Encoding Description
BLSI r32, r/m32 VEX.NDD.LZ.0F38.W0 F3 /3 Extract lowest set bit from r/m32 and set that bit in r32.
BLSMSK r32, r/m32 VEX.NDD.LZ.0F38.W0 F3 /2 Set all lower bits in r32 to “1” starting from bit 0 to lowest set bit in r/m32.
BLSR r32, r/m32 VEX.NDD.LZ.0F38.W0 F3 /1 Reset lowest set bit of r/m32, keep all other bits of r/m32 and write result to r32.
VPSLLDQ xmm1, xmm2, imm8 VEX.NDD.128.66.0F.WIG 73 /7 ib Shift xmm2 left by imm8 bytes while shifting in 0s and store result in xmm1.
VPSLLW xmm1, xmm2, imm8 VEX.NDD.128.66.0F.WIG 71 /6 ib Shift words in xmm2 left by imm8 while shifting in 0s.
VPSLLD xmm1, xmm2, imm8 VEX.NDD.128.66.0F.WIG 72 /6 ib Shift doublewords in xmm2 left by imm8 while shifting in 0s.
VPSLLQ xmm1, xmm2, imm8 VEX.NDD.128.66.0F.WIG 73 /6 ib Shift quadwords in xmm2 left by imm8 while shifting in 0s.
VPSRAW xmm1, xmm2, imm8 VEX.NDD.128.66.0F.WIG 71 /4 ib Shift words in xmm2 right by imm8 while shifting in sign bits.
VPSRAD xmm1, xmm2, imm8 VEX.NDD.128.66.0F.WIG 72 /4 ib Shift doublewords in xmm2 right by imm8 while shifting in sign bits.
VPSRLDQ xmm1, xmm2, imm8 VEX.NDD.128.66.0F.WIG 73 /3 ib Shift xmm2 right by imm8 bytes while shifting in 0s.
VPSRLW xmm1, xmm2, imm8 VEX.NDD.128.66.0F.WIG 71 /2 ib Shift words in xmm2 right by imm8 while shifting in 0s.
VPSRLD xmm1, xmm2, imm8 VEX.NDD.128.66.0F.WIG 72 /2 ib Shift doublewords in xmm2 right by imm8 while shifting in 0s.
VPSRLQ xmm1, xmm2, imm8 VEX.NDD.128.66.0F.WIG 73 /2 ib Shift quadwords in xmm2 right by imm8 while shifting in 0s.

Appendix E : VEX.vvvv Has Third Operand

The table below list all the AVX instructions that are using the VEX[vvvv] bit field to define a second source operand (3rd operand).

Mnemonic Operands VEX Encoding Description
BEXTR r32a, r/m32, r32b VEX.NDS.LZ.0F38.W0 F7 /r Contiguous bitwise extract from r/m32 using r32b as control; store result in r32a.
BZHI r32a, r/m32, r32b VEX.NDS.LZ.0F38.W0 F5 /r Zero bits in r/m32 starting with the position in r32b, write result to r32a.
SARX r32a, r/m32, r32b VEX.NDS.LZ.F3.0F38.W0 F7 /r Shift r/m32 arithmetically right with count specified in r32b.
SHLX r32a, r/m32, r32b VEX.NDS.LZ.66.0F38.W0 F7 /r Shift r/m32 logically left with count specified in r32b.
SHRX r32a, r/m32, r32b VEX.NDS.LZ.F2.0F38.W0 F7 /r Shift r/m32 logically right with count specified in r32b.
VGATHERDPD xmm1, vm32x, xmm2 VEX.DDS.128.66.0F38.W1 92 /r Using dword indices specified in vm32x, gather double-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VGATHERQPD xmm1, vm64x, xmm2 VEX.DDS.128.66.0F38.W1 93 /r Using qword indices specified in vm64x, gather double-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VGATHERDPS xmm1, vm32x, xmm2 VEX.DDS.128.66.0F38.W0 92 /r Using dword indices specified in vm32x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VGATHERQPS xmm1, vm64x, xmm2 VEX.DDS.128.66.0F38.W0 93 /r Using qword indices specified in vm64x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERDD xmm1, vm32x, xmm2 VEX.DDS.128.66.0F38.W0 90 /r Using dword indices specified in vm32x, gather dword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERQD xmm1, vm64x, xmm2 VEX.DDS.128.66.0F38.W0 91 /r Using qword indices specified in vm64x, gather dword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERDQ xmm1, vm32x, xmm2 VEX.DDS.128.66.0F38.W1 90 /r Using dword indices specified in vm32x, gather qword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERQQ xmm1, vm64x, xmm2 VEX.DDS.128.66.0F38.W1 91 /r Using qword indices specified in vm64x, gather qword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.

Appendix F: WIB Instructions

The table below list all the AVX instructions that silently ignore the VEX[W] bit when encoded with a 3-byte VEX prefix.