SSE and AVX Mutation Idea (xlate)

All Streaming SIMD Extensions (SSE) instructions using legacy encoding can be translated to Advanced Vector Extensions (AVX) encoding. This is already something that most compilers offer these days when using the correct compilation flag (i.e., -QxAVX with MSVC) to compile all legacy SSE instructions into their AVX counterpart.

Compilers will mostly do that for optimisation reason; however, in our case, this is not necessarily what we are looking for. What is more interesting is to be able to modify the encoding of instructions whilst not modifying their operands and the result of their execution. For this very reason a module for a mutation engine could be developed to translate all legacy SSE instructions to AVX format or from AVX to Legacy SSE format to modify the signature of a piece of code.

Therefore, the objective of this paper is to shortly discuss what SSE and AVX are and how it is possible to switch from one format to another without too many difficulties.

Streaming SIMD Extensions (SSE)

Before SSE there was the MMX facilities. The MMX facilities is the first to implement the concept of Single-Instruction, Multiple Data (SIMD), an instruction set that can perform arithmetic and logical operations on multiple data (i.e., bytes, words, or dwords) – hence the name SIMD. The goal being to reduce memory access operations which take a lot of time to process. This is quite useful for modern media, communication and graphic applications.

Example of SIMD operation: addss xmm1,xmm2

However, MMX data registers are aliases for the low 64-bit part of the x87 FP data registers (i.e., ST(0) – ST(7)) on top of limitations or edge cases when a routine uses both instruction sets (e.g., data lose and performance issues). Additionally, MMX facilities does not support operations on floating point values.

This is what the first version of the SSE facilities tries to address via new SIMD and non-SIMD instructions. Also, with it comes completely new 128-bit data registers (i.e., XMM0 – XMM7) which can be used to operate on scalar (single integer/FP for non-SIMD) or packed (for SIMD) operands, such as:

16 bytes packed integers
8 words packed integers
4 dwords packed integers
2 qword packed integers
4 single-precision (SP) floating point
2 double-precision (DP) floating point

Other features have been implemented such as:

Enhancement of specific type of memory write operation on cacheable memory via Non-Temporal stores (streaming stores) instructions, for example: MOVNTPS, MOVNTQ, or MASKMOVQ;
Video and media-specific instructions, for example: PAVD, LDDQU;
Thread synchronisation instructions, for example: PAUSE, MONITOR, MWAIT; and
Legacy prefix branch hints to help with misprediction when dealing with conditional branches.

Requesting the feature flags via the CPUID instruction can be used to identify whether a specific SSE instruction set is available.

Instruction Set	Introduced With	Year	Number of Instructions	CPUID Flag	Register
SSE	Pentium III	1999	70	002000000h	EDX
SSE2	130nm Pentium 4	2002	114	004000000h	EDX
SSE3	90nm Pentium 4	2004	13	000000001h	ECX
SSsE3	Core 2 Duo	2006	32	000000200h	ECX
SSE4.1	45nm Core 2 Duo (Penryn)	2008	47	000080000h	ECX
SSE4.2	Core i7 (Nehalem)	2008	7	000100000h	ECX

Simple MASM code that can be used in conjunction with the above table, example for SSE:

SSESupport PROC 
    xor eax, eax
    inc eax
    cpuid

    and edx, 002000000h
    shr edx, 19h
    xchg eax, edx 
    ret
SSESupport ENDP

Advanced Vector Extensions (AVX)

Introduced with Sandy Bridge (Q1 2011) Intel processors, the AVX facilities continue to enhance SIMD functionalities and offers new features such as:

New 256-bit data register (i.e., YMM0 – YMM7) with the lowest 128-bit being aliases for XMM data registers.
New encoding with the new 2- or 3-byte VEX prefix for legacy SSE instructions and new AVX instructions; and
Non-destructive operand operations to reduce the number of copies and load operations; and
Up to three source operands (with 4 operands instructions) using the upper 4-bit of a 8-bit immediate data value (i.e., VEX[vvvv] + ModRM[rm] + ModRM[reg] + imm8[7:4]).

Note that for this very paper there I will not go too much into AVX2 and I will not cover AVX-256 and Fused-Multiply-Add (FMA) extensions, which brings even more functionalities. Also, with AVX-256 another encoding is possible with the EVEX prefix (i.e., replacement of VEX).

Before being able to use AVX it is important to check that both the processor and the operating system does support the AVX instruction set and 256- and 128-bit data registers. The following MASM code can be used to check the aforementioned from either User-Mode (UM) or Kernel-Mode (KM):

AVXSupport PROC
    ; Get features flags 
    xor eax, eax 
    inc eax 
    cpuid 

    mov eax, ecx 
    mov ebx, ecx 

    ; Check for OSXSAVE and AVX support 
    and eax, 008000000h
    shr eax, 1Bh
    and ebx, 010000000H
    shr ebx, 1Ch

 
    ; Get return value 
    and al, bl
    jz return

get_xcr0: 
    xor ecx, ecx 
    XGETBV 
    mov ebx, eax 

    ; Check for XMM and YMM registers 
    and eax, 04h
    shr eax, 02h
    and ebx, 02h
    shr ebx, 01h

    ; Get return value
    and al, bl 
    return:
    ret 
AVXSupport ENDP

AVX and the new VEX

As mentioned in the previous section, AVX offers a new way to encode instructions (including legacy SSE) with a compact 2- or 3-byte Vector Extension (VEX) prefix, respectively starting with C5h and C4h bytes. The composition of the 2- and 3-byte VEX prefix can be found in the Intel documentation:

Fields are as follows:

R: like REX[R] in 1’s complement (inverted) form:
- 1b: Same as REX[R] = 0b (must be set to this in 32-bit mode otherwise LES/LDS)
- 0b: Same as REX[R] = 1b (64-bit mode only)
X: Like REX[X] in 1’s compliment (inverted) form:
- 1b: Same as REX[X] = 0b (must be set to this in 32-bit mode otherwise LES/LDS)
- 0b: Same as REX[X] = 1b (64-bit mode only)
B: Like REX[B] in 1’s compliment (inverted) form:
- 1b: Same as REX[B] = 1b (ignored in 32-bit mode)
- 0b: Same as REX[B] = 0b (64-bit mode only)
W: This can either be used like REX[W] or as an additional escape extension. This will be opcode specific.
m-mmmm: Used to specify opcode escape sequence (will always be 0Fh when using 2-byte VEX prefix):
- 00000b: Reserved and will #UD
- 00001b: Implied 0Fh escape opcode (Table 2)
- 00010b: Implied 0F38h escape opcodes (Table 3)
- 00011b: Implied 0F3Ah escape opcodes (Table 4)
- 00100b-11111b: Reserved will #UD
vvvv: Used in conjunction with a ModRM byte to specify an additional register as source or destination. This is encoded in 1’s complement form (inverted) or 1111b if unused:
- 1111b: XMM0/YMM0
- 1110b: XMM1/YMM1
- 1101b: XMM2/YMM2
- 1100b: XMM3/YMM3
- 1011b: XMM4/YMM4
- 1010b: XMM5/YMM5
- 1001b: XMM6/YMM6
- 1000b: XMM7/YMM7
- 0111b: XMM8/YMM8
- 0110b: XMM9/YMM9
- 0101b: XMM10/YMM10
- 0100b: XMM11/YMM11
- 0011b: XMM12/YMM12
- 0010b: XMM13/YMM13
- 0001b: XMM14/YMM14
- 0000b: XMM15/YMM15
L: Vector length bit used to promote operands to 256-bit:
- 0b: scalar or 128-bit vector operand
- 1b: 256-bit vector operand
pp: Specify a SIMD prefix used as an additional escape opcode:
- 00b: None
- 01b: 66h
- 10b: F3h
- 11b: F2h

Additionally, for a very small subset of AVX2 instructions, a new Vector SIB (VSIB) byte can be used for memory addressing. This is a special case that will not be discussed there. You can find the list of AVX instructions encoded with a VEX prefix that do uses a VSIB byte later in this paper.

Finally, like all the other prefixes (e.g., Operand Size Override or REX), VEX must be positioned before the primary opcode. Additionally, as shown in the above schema, VEX has bit fields equivalent to the REX prefix, to encode escape opcodes (i.e.,0Fh, 0F38h and 0F3Ah) and SIMD prefixes (i.e., 66h, F2h and F3h). Therefore, if any of them are used with a VEX prefix, an Undefined Instruction (i.e., #UD) exception will be raised.

Translation between Legacy SSE to AVX

As aforementioned, all legacy SSE instructions can be converted into AVX format. However, it does not mean that all AVX instructions can be encoded in legacy SSE format. Over the years SSE has been deprecated and newest instructions can only be encoded using VEX (or EVEX if we consider AVX-512).

Additionally, there is three things to be careful about when translating instructions:

First, some instructions when encoded via VEX uses an additional non-destructive (ND) operand to limit the number of read/write access to/from registers and memory addresses. It means that the ND operand which is encoded with VEX[vvvv] needs to be interpreted. Note that this ND operand can be either a source or a destination; however, only a subset on AVX instructions (13) uses VEX[vvvv] as a destination operand (1st operand) and only AVX2 instructions using VSIB byte and few others uses (13) VEX[vvvv] as a second source operand (3rd operand).
Second, again, some AVX instructions does not exist in a legacy encoding format (lot of AVX2 and all AVX-512). Therefore, attending to encode them in a legacy format will produce nonsense at best or #UD
Thirdly, any AVX instructions operating on 256-bit operands (i.e., VEX[L] = 1b) cannot be encoded in legacy SSE format because of the data limitation of the XMM data registers (i.e., 128-bits).

Later in this paper the lists of instructions using an additional non-destructive operand and AVX only instructions are provided. There is also the list of AVX2 instructions using VSIB for vector memory access addressing.

Example 1: Basic 2-byte VEX Encoded Instruction

Mnemonic	Operands	Encoding
MOVD	r/m32, xmm	66 0F 7E /r
VMOVD	r/m32, xmm	VEX.128.66.0F.W0 7E /r

2-byte VEX prefix will be used because no non-destructive additional register is required, the general-purpose data register do not need to be promoted to 64-bit, and, the instruction operates on 128-bit XMM register. Elements to consider:

SIMD prefix 66h: VEX[pp] = 01b
Escape opcode 0Fh: implied with a 2-byte VEX prefix
64-bit register promotion W0: VEX[W] = 0b (not used in 2-byte VEX version)
128-bit only operands: VEX[L] = 0b

Example 2: Basic 3-byte VEX Encoded Instruction with 64-bit

Mnemonic	Operands	Encoding
MOVDQ	r64/m64, xmm1	66 REX.W 0F 7E /r
VMOVDQ	r64/m64, xmm1	VEX.128.66.0F.W1 7E /r

3-byte VEX prefix will be used because the general-purpose data register DO need to be promoted to 64-bit. Even if non-destructive additional register is required and the instruction operates only on 128-bit XMM register. Elements to consider:

SIMD prefix 66h: VEX[pp] = 01b
Escape opcode 0Fh: VEX[m-mmmm] = 00001b
64-bit register promotion W1: VEX[W] = 0b
128-bit only operands: VEX[L] = 0b

Example 3: 3-byte VEX Encoded Instruction with SIB

Mnemonic	Operands	Encoding
MOVDQ	xmm1, xmm2/m128	66 0F D6 /r
VMOVDQ	xmm1, xmm2/m128	VEX.128.66.0F.WIG D6 /r

Similar instruction as example two but with a SIB byte and an immediate data value. Also, special case when the AVX instruction does not care about VEX[W] bit field (i.e., WIG). Elements to consider:

SIMD prefix 66h: VEX[pp] = 01b
Escape opcode 0Fh: VEX[m-mmmm] = 00001b
64-bit base register promotion: VEX[B] = 0b
128-bit only operands: VEX[L] = 0b

Example 4: With Non-Destructive Operand

Mnemonic	Operands	Encoding
PADDQ	xmm1, xmm2/m128	66 0F D4 /r
VPADDQ	xmm1, xmm2, xmm3/m128	VEX.NDS.128.66.0F.WIG D4 /r

In this example there is an instruction using an additional non-destructive operand which will be encoded in the VEX[vvvv] bit field. The complete list of such instruction can be found in one of the tables below. It is important to note that the destination operand need to be encoded twice to prevent unwanted read/write to/from a register or memory address. Elements to consider:

SIMD prefix 66h: VEX[pp] = 01b
Source operand XMM4: VEX[vvvv] = 1011b
Escape opcode 0Fh: implied with a 2-byte VEX prefix
128-bit only operands: VEX[L] = 0b

WIB and Synonymous Mutation

Some AVX instructions does not care about the state of the VEX[W] bit – it will be ignored. It means that when encoding an AVX instructions with such specification, it is possible to set the bit to either 0 or 1 and therefore generate again another encoding. The modification is small but enough to modify a byte and thus break signature in some cases.

Let’s take for example the following 2 instructions:

C4 C1 79 D6 4C 4A 01		vmovq mmword ptr [r10+rcx*2+1],xmm1
C4 C1 F9 D6 4C 4A 01 		vmovq mmword ptr [r10+rcx*2+1],xmm1

In the first version VEX[W] = 0b while on the second version VEX[W] = 1b. Both are valid ways to encode this instruction, will execute and not #UD. Obviously, it will work only when instruction is encoded using 3-byte VEX prefix.

Final Notes

First, when looking at all the examples, it is possible to see that when translating from SSE to AVX or from AVX to SSE encoding the ModRM (and potentially SIB) byte, memory displacement and immediate data value are not affected at all. Only the legacy prefixes used as SIMD prefixes, REX prefix, escape opcode and primary opcodes will be modified, making the whole process easier.

It should be noted that mixing both legacy SSE code and AVX is impacting badly the performances of the CPU. AVX modify the upper bits of the YMM data registers while Legacy SSE instructions cannot modify them. As a result, the upper bits can be in a clean, modified and unsaved (also known as dirty), or preserved/Non_INIT state. As a result, when executing an SSE instruction after an AVX instruction, and vis versa, the processor need to save the state of the register (equivalent to a XSAVE instruction).

Therefore, if different code block can be identified, the VZEROUPPER instruction should be executed before and after executing AVX instructions to clean the upper bits of the YMM registers and set them in a clean mode.

To assite in the understanding of this paper, the following resources can be consulted:

Appendix A : Non-Destructive Operands Instructions

The table below list all the instructions that can be directly translated from AVX VEX encoding to legacy SSE and from legacy SSE encoding to AVX VEX.

Type	Mnemonic	Oprands	Legacy Encoding	VEX Encoding	Description
SSE	COMISS	xmm1, xmm2/m32	0F 2F /r	VEX.128.0F.WIG 2F /r	Compare low single-precision floating-point values in xmm1 and xmm2/mem32 and set the EFLAGS flags accordingly.
SSE	CVTSS2SI	r32, xmm1/m32	F3 0F 2D /r	VEX.128.F3.0F.W0 2D /r	Convert one single-precision floating-point value from xmm1/m32 to one signed doubleword integer in r32.
SSE	CVTSS2SI	r64, xmm1/m32	F3 REX.W 0F 2D /r	VEX.128.F3.0F.W1 2D /r	Convert one single-precision floating-point value from xmm1/m32 to one signed quadword integer in r64.
SSE	CVTTSS2SI	r32, xmm1/m32	F3 0F 2C /r	VEX.128.F3.0F.W0 2C /r	Convert one single-precision floating-point value from xmm1/m32 to one signed doubleword integer in r32 using truncation.
SSE	CVTTSS2SI	r64, xmm1/m32	F3 REX.W 0F 2C /r	VEX.128.F3.0F.W1 2C /r	Convert one single-precision floating-point value from xmm1/m32 to one signed quadword integer in r64 using truncation.
SSE	LDMXCSR	m32	0F AE /2	VEX.LZ.0F.WIG AE /2	Load MXCSR register from m32.
SSE	MOVAPS	xmm1, xmm2/m128	0F 28 /r	VEX.128.0F.WIG 28 /r	Move aligned packed single-precision floating-point values from xmm2/mem to xmm1.
SSE	MOVAPS	xmm2/m128, xmm1	0F 29 /r	VEX.128.0F.WIG 28 /r	Move aligned packed single-precision floating-point values from xmm1 to xmm2/mem.
SSE	MOVHPS	m64, xmm1	0F 17 /r	VEX.128.0F.WIG 17 /r	Move two packed single-precision floating-point values from high quadword of xmm to m64.
SSE	MOVLPS	0F 13/r	0F 13/r	VEX.128.0F.WIG 13/r	Move two packed single-precision floating-point values from low quadword of xmm1 to m64.
SSE	MOVMSKPS	reg, xmm	0F 50 /r	VEX.128.0F.WIG 50 /r	Extract 4-bit sign mask from xmm2 and store in reg. The upper bits of r32 or r64 are zeroed.
SSE	MOVUPS	xmm1, xmm2/m128	0F 10 /r	VEX.128.0F.WIG 10 /r	Move unaligned packed single-precision floating-point from xmm2/mem to xmm1.
SSE	MOVUPS	xmm2/m128, xmm1	0F 11 /r	VEX.128.0F.WIG 11 /r	Move unaligned packed single-precision floating-point from xmm1 to xmm2/mem.
SSE	RCPPS	xmm1, xmm2/m128	0F 53 /r	VEX.128.0F.WIG 53 /r	Computes the approximate reciprocals of packed single-precision values in xmm2/mem and stores the results in xmm1.
SSE	RSQRTPS	xmm1, xmm2/m128	0F 52 /r	VEX.128.0F.WIG 52 /r	Computes the approximate reciprocals of the square roots of packed single-precision values in xmm2/mem and stores the results in xmm1.
SSE	SQRTPS	xmm1, xmm2/m128	0F 51 /r	VEX.128.0F.WIG 51 /r	Computes Square Roots of the packed single-precision floating-point values in xmm2/m128 and stores the result in xmm1.
SSE	STMXCSR	m32	0F AE /3	VEX.LZ.0F.WIG AE /3	Store contents of MXCSR register to m32.
SSE	UCOMISS	xmm1, xmm2/m32	0F 2E /r	VEX.128.0F.WIG 2E /r	Compare low single-precision floating-point values in xmm1 and xmm2/mem32 and set the EFLAGS flags accordingly.
SSE	PEXTRW	reg, xmm, imm8	66 0F C5 /r ib	VEX.128.66.0F.W0 C5 /r ib	Extract the word specified by imm8 from xmm and move it to reg, bits 15:0. Zero-extend the result. The upper bits of r64/r32 is filled with zeros.
SSE	PEXTRW	reg/m16, xmm, imm8	66 0F 3A 15 /r ib	VEX.128.66.0F3A.W0 15 /r ib	Extract a word integer value from xmm2 at the source word offset specified by imm8 into reg or m16. The upper bits of r64/r32 is filled with zeros.
SSE	PMOVMSKB	reg, xmm	66 0F D7 /r	VEX.128.66.0F.WIG D7 /r	Move a byte mask of xmm to reg. The upper bits of r32 or r64 are zeroed
SSE	MOVNTPS	m128, xmm1	0F 2B /r	VEX.128.0F.WIG 2B /r	Move packed single-precision values xmm1 to mem using non-temporal hint.
SSE2	COMISD	xmm1, xmm2/m64	66 0F 2F /r	VEX.128.66.0F.WIG 2F /r	Compare low double-precision floating-point values in xmm1 and xmm2/mem64 and set the EFLAGS flags accordingly.
SSE2	CVTSD2SI	r32, xmm1/m64	F2 0F 2D /r	VEX.128.F2.0F.W0 2D /r	Convert one double-precision floating-point value from xmm1/m64 to one signed doubleword integer r32.
SSE2	CVTSD2SI	r64, xmm1/m64	F2 REX.W 0F 2D /r	VEX.128.F2.0F.W1 2D /r	Convert one double-precision floating-point value from xmm1/m64 to one signed quadword integer sign-extended into r64.
SSE2	CVTTSD2SI	r32, xmm1/m64	F2 0F 2C /r	VEX.128.F2.0F.W0 2C /r	Convert one double-precision floating-point value from xmm1/m64 to one signed doubleword integer in r32 using truncation.
SSE2	CVTTSD2SI	r64, xmm1/m64	F2 REX.W 0F 2C /r	VEX.128.F2.0F.W1 2C /r	Convert one double-precision floating-point value from xmm1/m64 to one signed quadword integer in r64 using truncation.
SSE2	CVTPD2PS	xmm1, xmm2/m128	66 0F 5A /r	VEX.128.66.0F.WIG 5A /r	Convert two packed double-precision floating-point values in xmm2/mem to two single-precision floating-point values in xmm1.
SSE2	CVTPS2PD	xmm1, xmm2/m64	0F 5A /r	VEX.128.0F.WIG 5A /r	Convert two packed single-precision floating-point values in xmm2/m64 to two packed double-precision floating-point values in xmm1.
SSE2	CVTPD2DQ	xmm1, xmm2/m128	F2 0F E6 /r	VEX.128.F2.0F.WIG E6 /r	Convert two packed double-precision floating-point values in xmm2/mem to two signed doubleword integers in xmm1.
SSE2	CVTTPD2DQ	xmm1, xmm2/m128	66 0F E6 /r	VEX.128.66.0F.WIG E6 /r	Convert two packed double-precision floating-point values in xmm2/mem to two signed doubleword integers in xmm1 using truncation.
SSE2	CVTDQ2PD	xmm1, xmm2/m64	F3 0F E6 /r	VEX.128.F3.0F.WIG E6 /r	Convert two packed signed doubleword integers from xmm2/mem to two packed double-precision floating-point values in xmm1.
SSE2	CVTPS2DQ	xmm1, xmm2/m128	66 0F 5B /r	VEX.128.66.0F.WIG 5B /r	Convert four packed single-precision floating-point values from xmm2/mem to four packed signed doubleword values in xmm1.
SSE2	CVTTPS2DQ	xmm1, xmm2/m128	F3 0F 5B /r	VEX.128.F3.0F.WIG 5B /r	Convert four packed single-precision floating-point values from xmm2/mem to four packed signed doubleword values in xmm1 using truncation.
SSE2	CVTDQ2PS	xmm1, xmm2/m128	0F 5B /r	VEX.128.0F.WIG 5B /r	Convert four packed signed doubleword integers from xmm2/mem to four packed single-precision floating-point values in xmm1.
SSE2	MOVAPD	xmm1, xmm2/m128	66 0F 28 /r	VEX.128.66.0F.WIG 28 /r	Move aligned packed double-precision floating-point values from xmm2/mem to xmm
SSE2	MOVAPD	xmm2/m128, xmm1	66 0F 29 /r	VEX.128.66.0F.WIG 29 /r	Move aligned packed double-precision floating-point values from xmm1 to xmm2/mem.
SSE2	MOVHPD	m64, xmm1	66 0F 17 /r	VEX.128.66.0F.WIG 17 /r	Move double-precision floating-point value from high quadword of xmm1 to m64.
SSE2	MOVLPD	m64, xmm1	66 0F 13/r	VEX.128.66.0F.WIG 13/r	Move double-precision floating-point value from low quadword of xmm1 to m64.
SSE2	MOVMSKPD	reg, xmm	66 0F 50 /r	VEX.128.66.0F.WIG 50 /r	Extract 2-bit sign mask from xmm and store in reg. The upper bits of r32 or r64 are filled with zeros.
SSE2	MOVUPD	xmm1, xmm2/m128	66 0F 10 /r	VEX.128.66.0F.WIG 10 /r	Move unaligned packed double-precision floating-point from xmm2/mem to xmm
SSE2	MOVUPD	xmm2/m128, xmm1	66 0F 11 /r	VEX.128.66.0F.WIG 11 /r	Move unaligned packed double-precision floating-point from xmm1 to xmm2/mem.
SSE2	SQRTPD	xmm1, xmm2/m128	66 0F 51 /r	VEX.128.66.0F.WIG 51 /r	Computes Square Roots of the packed double-precision floating-point values in xmm2/m128 and stores the result in xmm1.
SSE2	UCOMISD	xmm1, xmm2/m64	66 0F 2E /r	VEX.128.66.0F.WIG 2E /r	Compare low double-precision floating-point values in xmm1 and xmm2/mem64 and set the EFLAGS flags accordingly.
SSE2	MOVD	xmm, r/m32	66 0F 6E /r	VEX.128.66.0F.W0 6E /r	Move doubleword from r/m32 to xmm.
SSE2	MOVD	r/m32, xmm	66 0F 7E /r	VEX.128.66.0F.W0 7E /r	Move doubleword from xmm to r/m32.
SSE2	MOVQ	xmm, r/m64	66 REX.W 0F 6E /r	VEX.128.66.0F.W1 6E /r	Move quadword from r/m64 to xmm.
SSE2	MOVQ	r/m64, xmm	66 REX.W 0F 7E /r	VEX.128.66.0F.W0 7E /r	Move quadword from xmm register to r/m64.
SSE2	MOVDQA	xmm1, xmm2/m128	66 0F 6F /r	VEX.128.66.0F.WIG 6F /r	Move aligned packed integer values from xmm2/mem to xmm1.
SSE2	MOVDQA	xmm2/m128, xmm1	66 0F 7F /r	VEX.128.66.0F.WIG 7F /r	Move aligned packed integer values from xmm1 to xmm2/mem.
SSE2	MOVDQU	xmm1, xmm2/m128	F3 0F 6F /r	VEX.128.F3.0F.WIG 6F /r	Move unaligned packed integer values from xmm2/m128 to xmm1.
SSE2	MOVDQU	xmm2/m128, xmm1	F3 0F 7F /r	VEX.128.F3.0F.WIG 7F /r	Move unaligned packed integer values from xmm1 to xmm2/m128.
SSE2	MOVQ	xmm1, xmm2/m64	F3 0F 7E /r	VEX.128.F3.0F.WIG 7E /r	Move quadword from xmm2/mem64 to xmm1.
SSE2	MOVQ	xmm2/m64, xmm1	66 0F D6 /r	VEX.128.66.0F.WIG D6 /r	Move quadword from xmm1 to xmm2/mem64.
SSE2	PEXTRW	reg, xmm, imm8	66 0F C5 /r ib	VEX.128.66.0F.W0 C5 /r ib	Extract the word specified by imm8 from xmm and move it to reg, bits 15-0. The upper bits of r32 or r64 is zeroed.
SSE2	PEXTRW	reg/m16, xmm, imm8	66 0F 3A 15 /r ib	VEX.128.66.0F3A.W0 15 /r ib	Extract the word specified by imm8 from xmm and copy it to lowest 16 bits of reg or m16. Zero-extend the result in the destination, r32 or r64.
SSE2	PMOVMSKB	reg, xmm	66 0F D7 /r	VEX.128.66.0F.WIG D7 /r	Move a byte mask of xmm to reg. The upper bits of r32 or r64 are zeroed.
SSE2	PSHUFLW	xmm1, xmm2/m128, imm8	F2 0F 70 /r ib	VEX.128.F2.0F.WIG 70 /r ib	Shuffle the low words in xmm2/m128 based on the encoding in imm8 and store the result in xmm1.
SSE2	PSHUFHW	xmm1, xmm2/m128, imm8	F3 0F 70 /r ib	VEX.128.F3.0F.WIG 70 /r ib	Shuffle the high words in xmm2/m128 based on the encoding in imm8 and store the result in xmm1.
SSE2	PSHUFD	xmm1, xmm2/m128, imm8	66 0F 70 /r ib	VEX.128.66.0F.WIG 70 /r ib	Shuffle the doublewords in xmm2/m128 based on the encoding in imm8 and store the result in xmm1.
SSE2	MASKMOVDQU	xmm1, xmm2	66 0F F7 /r	VEX.128.66.0F.WIG F7 /r	Selectively write bytes from xmm1 to memory location using the byte mask in xmm2. The default memory location is specified by DS:DI/EDI/RDI.
SSE2	MOVNTPD	m128, xmm1	66 0F 2B /r	VEX.128.66.0F.WIG 2B /r	Move packed double-precision values in xmm1 to m128 using non-temporal hint.
SSE2	MOVNTDQ	m128, xmm1	66 0F E7 /r	VEX.128.66.0F.WIG E7 /r	Move packed integer values in xmm1 to m128 using nontemporal hint.
SSE3	LDDQU	xmm1, m128	F2 0F F0 /r	VEX.128.F2.0F.WIG F0 /r	Load unaligned data from mem and return double quadword in xmm1.
SSE3	MOVDDUP	xmm1, xmm2/m64	F2 0F 12 /r	VEX.128.F2.0F.WIG 12 /r	Move double-precision floating-point value from xmm2/m64 and duplicate into xmm1.
SSE3	MOVSHDUP	xmm1, xmm2/m128	F3 0F 16 /r	VEX.128.F3.0F.WIG 16 /r	Move odd index single-precision floating-point values from xmm2/mem and duplicate each element into xmm1.
SSE3	MOVSLDUP	xmm1, xmm2/m128	F3 0F 12 /r	VEX.128.F3.0F.WIG 12 /r	Move even index single-precision floating-point values from xmm2/mem and duplicate each element into xmm1.
SSSE3	PABSB	xmm1, xmm2/m128	66 0F 38 1C /r	VEX.128.66.0F38.WIG 1C /r	Compute the absolute value of bytes in xmm2/m128 and store UNSIGNED result in xmm1.
SSSE3	PABSD	xmm1, xmm2/m128	66 0F 38 1E /r	VEX.128.66.0F38.WIG 1E /r	Compute the absolute value of 32-bit integers in xmm2/m128 and store UNSIGNED result in xmm1.
SSSE3	PABSW	xmm1, xmm2/m128	66 0F 38 1D /r	VEX.128.66.0F38.WIG 1C /r	Compute the absolute value of 16-bit integers in xmm2/m128 and store UNSIGNED result in xmm1.
AESNI	AESIMC	xmm1, xmm2/m128	66 0F 38 DB /r	VEX.128.66.0F38.WIG DB /r	Perform the InvMixColumn transformation on a 128-bit round key from xmm2/m128 and store the result in xmm1.
AESNI	AESKEYGENASSIST	xmm1, xmm2/m128, imm8	66 0F 3A DF /r ib	VEX.128.66.0F3A.WIG DF /r ib	Assist in AES round key generation using an 8 bits Round Constant (RCON) specified in the immediate byte, operating on 128 bits of data specified in xmm2/m128 and stores the result in xmm1.
SSSE4.1	EXTRACTPS	reg/m32, xmm1, imm8	66 0F 3A 17 /r ib	VEX.128.66.0F3A.WIG 17 /r ib	Extract one single-precision floating-point value from xmm1 at the offset specified by imm8 and store the result in reg or m32. Zero extend the results in 64-bit register if applicable.
SSSE4.1	MOVNTDQA	xmm1, m128	66 0F 38 2A /r	VEX.128.66.0F38.WIG 2A /r	Move double quadword from m128 to xmm1 using nontemporal hint if WC memory type.
SSSE4.1	PEXTRB	r/m8, xmm2, imm8	66 0F 3A 14 /r ib	VEX.128.66.0F3A.W0 14 /r ib	Extract a byte integer value from xmm2 at the source byte offset specified by imm8 into reg or m8. The upper bits of r32 or r64 are zeroed.
SSSE4.1	PEXTRD	r/m32, xmm2, imm8	66 0F 3A 16 /r ib	VEX.128.66.0F3A.W0 16 /r ib	Extract a dword integer value from xmm2 at the source dword offset specified by imm8 into r/m32.
SSSE4.1	PEXTRQ	r/m64, xmm2, imm8	66 REX.W 0F 3A 16 /r ib	VEX.128.66.0F3A.W1 16 /r ib	Extract a qword integer value from xmm2 at the source qword offset specified by imm8 into r/m64.
SSSE4.1	PEXTRW	reg, xmm, imm8	66 0F C5 /r ib	VEX.128.66.0F.W0 C5 /r ib	Extract the word specified by imm8 from xmm and move it to reg, bits 15-0. The upper bits of r32 or r64 is zeroed.
SSSE4.1	PEXTRW	reg/m16, xmm, imm8	66 0F 3A 15 /r ib	VEX.128.66.0F3A.W0 15 /r ib	Extract the word specified by imm8 from xmm and copy it to lowest 16 bits of reg or m16. Zero-extend the result in the destination, r32 or r64.
SSSE4.1	PHMINPOSUW	xmm1, xmm2/m128	66 0F 38 41 /r	VEX.128.66.0F38.WIG 41 /r	Find the minimum unsigned word in xmm2/m128 and place its value in the low word of xmm1 and its index in the secondlowest word of xmm1.
SSSE4.1	PMOVSXBD	xmm1, xmm2/m32	66 0f 38 21 /r	VEX.128.66.0F38.WIG 21 /r	Sign extend 4 packed 8-bit integers in the low 4 bytes of xmm2/m32 to 4 packed 32-bit integers in xmm1.
SSSE4.1	PMOVSXBQ	xmm1, xmm2/m16	66 0f 38 22 /r	VEX.128.66.0F38.WIG 22 /r	Sign extend 2 packed 8-bit integers in the low 2 bytes of xmm2/m16 to 2 packed 64-bit integers in xmm1.
SSSE4.1	PMOVSXBW	xmm1, xmm2/m64	66 0f 38 20 /r	VEX.128.66.0F38.WIG 20 /r	Sign extend 8 packed 8-bit integers in the low 8 bytes of xmm2/m64 to 8 packed 16-bit integers in xmm1.
SSSE4.1	PMOVSXWD	xmm1, xmm2/m64	66 0f 38 23/r	VEX.128.66.0F38.WIG 23 /r	Sign extend 4 packed 16-bit integers in the low 8 bytes of xmm2/m64 to 4 packed 32-bit integers in xmm1.
SSSE4.1	PMOVSXWQ	xmm1, xmm2/m32	66 0f 38 24 /r	VEX.128.66.0F38.WIG 24 /r	Sign extend 2 packed 16-bit integers in the low 4 bytes of xmm2/m32 to 2 packed 64-bit integers in xmm1.
SSSE4.1	PMOVSXDQ	xmm1, xmm2/m64	66 0f 38 25 /r	VEX.128.66.0F38.WIG 25 /r	Sign extend 2 packed 32-bit integers in the low 8 bytes of xmm2/m64 to 2 packed 64-bit integers in xmm1.
SSSE4.1	PMOVZXBD	xmm1, xmm2/m32	66 0f 38 31 /r	VEX.128.66.0F38.WIG 31 /r	Zero extend 4 packed 8-bit integers in the low 4 bytes of xmm2/m32 to 4 packed 32-bit integers in xmm1.
SSSE4.1	PMOVZXBQ	xmm1, xmm2/m16	66 0f 38 32 /r	VEX.128.66.0F38.WIG 32 /r	Zero extend 2 packed 8-bit integers in the low 2 bytes of xmm2/m16 to 2 packed 64-bit integers in xmm1.
SSSE4.1	PMOVZXBW	xmm1, xmm2/m64	66 0f 38 30 /r	VEX.128.66.0F38.WIG 30 /r	Zero extend 8 packed 8-bit integers in the low 8 bytes of xmm2/m64 to 8 packed 16-bit integers in xmm1.
SSSE4.1	PMOVZXWD	xmm1, xmm2/m64	66 0f 38 33 /r	VEX.128.66.0F38.WIG 33 /r	Zero extend 4 packed 16-bit integers in the low 8 bytes of xmm2/m64 to 4 packed 32-bit integers in xmm1.
SSSE4.1	PMOVZXWQ	xmm1, xmm2/m64	66 0f 38 34 /r	VEX.128.66.0F38.WIG 34 /r	Zero extend 2 packed 16-bit integers in the low 4 bytes of xmm2/m32 to 2 packed 64-bit integers in xmm1.
SSSE4.1	PMOVZXDQ	xmm1, xmm2/m64	66 0f 38 35 /r	VEX.128.66.0F 38.WIG 35 /r	Zero extend 2 packed 32-bit integers in the low 8 bytes of xmm2/m64 to 2 packed 64-bit integers in xmm1.
SSSE4.1	PTEST	xmm1, xmm2/m128	66 0F 38 17 /r	VEX.128.66.0F38.WIG 17 /r	Set ZF if xmm2/m128 AND xmm1 result is all 0s. Set CF if xmm2/m128 AND NOT xmm1 result is all 0s.
SSSE4.1	ROUNDPD	xmm1, xmm2/m128, imm8	66 0F 3A 09 /r ib	VEX.128.66.0F3A.WIG 09 /r ib	Round packed double precision floating-point values in xmm2/m128 and place the result in xmm1. The rounding mode is determined by imm8.
SSSE4.1	ROUNDPS	xmm1, xmm2/m128, imm8	66 0F 3A 08 /r ib	VEX.128.66.0F3A.WIG 08 /r ib	Round packed single precision floating-point values in xmm2/m128 and place the result in xmm1. The rounding mode is determined by imm8.
SSSE4.2	PCMPESTRI	xmm1, xmm2/m128, imm8	66 0F 3A 61 /r ib	VEX.128.66.0F3A 61 /r ib	Perform a packed comparison of string data with explicit lengths, generating an index, and storing the result in ECX.
SSSE4.2	PCMPESTRM	xmm1, xmm2/m128, imm8	66 0F 3A 60 /r ib	VEX.128.66.0F3A 60 /r ib	Perform a packed comparison of string data with explicit lengths, generating a mask, and storing the result in XMM0.
SSSE4.2	PCMPISTRI	xmm1, xmm2/m128, imm8	66 0F 3A 63 /r ib	VEX.128.66.0F3A.WIG 63 /r ib	Perform a packed comparison of string data with implicit lengths, generating an index, and storing the result in ECX.
SSSE4.2	PCMPISTRM	xmm1, xmm2/m128, imm8	66 0F 3A 62 /r ib	VEX.128.66.0F3A.WIG 62 /r ib	Perform a packed comparison of string data with implicit lengths, generating a mask, and storing the result in XMM0.

Appendix B : AVX Only Instructions

The table below lists all the instructions (mostly AVX2) that can only be encoded with a VEX prefix. Table to Markdown

Type	Mnemonic	Operands	VEX Encoding	Description
AVX	vzeroupper		VEX.128.0F.WIG 77	Zero upper 128 bits of all YMM registers.
AVX	vzeroall		VEX.128.0F.WIG 77	Zero upper 128 bits of all YMM registers.
AVX	vcvtph2ps	xmm1, xmm2/m64	VEX.128.66.0F38.W0 13 /r	Convert four packed half precision (16-bit) floatingpoint values in xmm2/m64 to packed single-precision floating-point value in xmm1.
AVX	vpermd	ymm1, ymm2, ymm3/m256	VEX.NDS.256.66.0F38.W0 36 /r	Permute doublewords in ymm3/m256 using indices in ymm2 and store the result in ymm1.
AVX	vpsrlvd	xmm1, xmm2, xmm3/m128	VEX.NDS.128.66.0F38.W0 45 /r	Shift doublewords in xmm2 right by amount specified in the corresponding element of xmm3/m128 while shifting in 0s.
AVX	vpsravd	xmm1, xmm2, xmm3/m128	VEX.NDS.128.66.0F38.W0 46 /r	Shift doublewords in xmm2 right by amount specified in the corresponding element of xmm3/m128 while shifting in 0s.
AVX	vpsllvd	xmm1, xmm2, xmm3/m128	VEX.NDS.128.66.0F38.W0 47 /r	Shift doublewords in xmm2 left by amount specified in the corresponding element of xmm3/m128 while shifting in 0s.
AVX	vgatherdps	xmm1, vm32x, xmm2	VEX.DDS.128.66.0F38.W0 92 /r	Using dword indices specified in vm32x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
AVX	vgatherqps	xmm1, vm64x, xmm2	VEX.DDS.128.66.0F38.W0 93 /r	Using qword indices specified in vm64x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
AVX	ANDN	r32a, r32b, r/m32	VEX.NDS.LZ.0F38.W0 F2 /r	Bitwise AND of inverted r32b with r/m32, store result in r32a.
AVX	BZHI	r32a, r/m32, r32b	VEX.NDS.LZ.0F38.W0 F5 /r	Zero bits in r/m32 starting with the position in r32b, write result to r32a.
AVX	BEXTR	r32a, r/m32, r32b	VEX.NDS.LZ.0F38.W0 F7 /r	Contiguous bitwise extract from r/m32 using r32b as control; store result in r32a.
AVX	SHLX	r32a, r/m32, r32b	VEX.NDS.LZ.66.0F38.W0 F7 /r	Shift r/m32 logically left with count specified in r32b.
AVX	PEXT	r32a, r32b, r/m32	VEX.NDS.LZ.F3.0F38.W0 F5 /r	Parallel extract of bits from r32b using mask in r/m32, result is written to r32a
AVX	SARX	r32a, r/m32, r32b	VEX.NDS.LZ.F3.0F38.W0 F7 /r	Shift r/m32 arithmetically right with count specified in r32b.
AVX	PDEP	r32a, r32b, r/m32	VEX.NDS.LZ.F2.0F38.W0 F5 /r	Parallel deposit of bits from r32b using mask in r/m32, result is written to r32a.
AVX	MULX	r32a, r32b, r/m32	VEX.NDD.LZ.F2.0F38.W0 F6 /r	Unsigned multiply of r/m32 with EDX without affecting arithmetic flags.
AVX	SHRX	r32a, r/m32, r32b	VEX.NDS.LZ.F2.0F38.W0 F7 /r	Shift r/m32 logically right with count specified in r32b.
AVX	vpermilps	xmm1, xmm2, xmm3/m128	VEX.NDS.128.66.0F38.W0 0C /r	Permute single-precision floating-point values in xmm2 using controls from xmm3/m128 and store result in xmm1.
AVX	vpermilps	xmm1, xmm2/m128, imm8	VEX.128.66.0F3A.W0 04 /r ib	Permute single-precision floating-point values in xmm2/m128 using controls from imm8 and store result in xmm1.
AVX	vpermilpd	xmm1, xmm2, xmm3/m128	VEX.NDS.128.66.0F38.W0 0D /r	Permute double-precision floating-point values in xmm2 using controls from xmm3/m128 and store result in xmm1.
AVX	vpermilpd	xmm1, xmm2/m128, imm8	VEX.128.66.0F3A.W0 05 /r ib	Permute double-precision floating-point values in xmm2/m128 using controls from imm8.
AVX	vtestps	xmm1, xmm2/m128	VEX.128.66.0F38.W0 0E /r	Set ZF and CF depending on sign bit AND and ANDN of packed single-precision floating-point sources.
AVX	vtestpd	xmm1, xmm2/m128	VEX.128.66.0F38.W0 0F /r	Set ZF and CF depending on sign bit AND and ANDN of packed double-precision floating-point sources.
AVX	vbroadcastss	xmm1, m32	VEX.128.66.0F38.W0 18 /r	Broadcast double-precision floating-point element in mem to four locations in ymm1.
AVX	vbroadcastsd	ymm1, m64	VEX.256.66.0F38.W0 19 /r	Broadcast double-precision floating-point element in mem to four locations in ymm1.
AVX	vbroadcastf128	ymm1, m128	VEX.256.66.0F38.W0 1A /r	Broadcast 128 bits of floating-point data in mem to low and high 128-bits in ymm1.
AVX	vmaskmovps	xmm1, xmm2, m128	VEX.NDS.128.66.0F38.W0 2C /r	Conditionally load packed single-precision values from m128 using mask in xmm2 and store in xmm1.
AVX	vmaskmovpd	xmm1, xmm2, m128	VEX.NDS.128.66.0F38.W0 2D /r	Conditionally load packed double-precision values from m128 using mask in xmm2 and store in xmm1.
AVX	vmaskmovps	m128, xmm1, xmm2	VEX.NDS.128.66.0F38.W0 2E /r	Conditionally store packed single-precision values from xmm2 using mask in xmm1.
AVX	vmaskmovpd	m128, xmm1, xmm2	VEX.NDS.128.66.0F38.W0 2F /r	Conditionally store packed double-precision values from xmm2 using mask in xmm1.
AVX	vpbroadcastd	xmm1, xmm2/m32	VEX.128.66.0F38.W0 58 /r	Broadcast a dword integer in the source operand to four locations in xmm1.
AVX	vpbroadcastq	xmm1, xmm2/m64	VEX.128.66.0F38.W0 59 /r	Broadcast a qword element in source operand to two locations in xmm1.
AVX	vbroadcasti128	ymm1, m128	VEX.256.66.0F38.W0 5A /r	Broadcast 128 bits of integer data in mem to low and high 128-bits in ymm1.
AVX	vpbroadcastb	xmm1, xmm2/m8	VEX.128.66.0F38.W0 78 /r	Broadcast a byte integer in the source operand to sixteen locations in xmm1.
AVX	vpbroadcastw	xmm1, xmm2/m16	VEX.128.66.0F38.W0 79 /r	Broadcast a word integer in the source operand to eight locations in xmm1.
AVX	vpmaskmovd	xmm1, xmm2, m128	VEX.NDS.128.66.0F38.W0 8C /r	Conditionally load dword values from m128 using mask in xmm2 and store in xmm1.
AVX	vpmaskmovd	m128, xmm1, xmm2	VEX.NDS.128.66.0F38.W0 8E /r	Conditionally store dword values from xmm2 using mask in xmm1.
AVX	vpmaskmovq	xmm1, xmm2, m128	VEX.NDS.128.66.0F38.W1 8C /r	Conditionally load qword values from m128 using mask in xmm2 and store in xmm1.
AVX	vpmaskmovq	m128, xmm1, xmm2	VEX.NDS.128.66.0F38.W1 8E /r	Conditionally store qword values from xmm2 using mask in xmm1.
AVX	vpermq	ymm1, ymm2/m256, imm8	VEX.256.66.0F3A.W1 00 /r ib	Permute qwords in ymm2/m256 using indices in imm8 and store the result in ymm1.
AVX	vpermpd	ymm1, ymm2/m256, imm8	VEX.256.66.0F3A.W1 01 /r ib	Permute double-precision floating-point elements in ymm2/m256 using indices in imm8 and store the result in ymm1.
AVX	vpblendd	xmm1, xmm2, xmm3/m128, imm8	VEX.NDS.128.66.0F3A.W0 02 /r ib	Select dwords from xmm2 and xmm3/m128 from mask specified in imm8 and store the values into xmm1.
AVX	vpermilps	xmm1, xmm2, xmm3/m128	VEX.NDS.128.66.0F38.W0 0C /r	Permute single-precision floating-point values in xmm2 using controls from xmm3/m128 and store result in xmm1.
AVX	vpermilps	xmm1, xmm2/m128, imm8	VEX.128.66.0F3A.W0 04 /r ib	Permute single-precision floating-point values in xmm2/m128 using controls from imm8 and store result in xmm1.
AVX	vpermilpd	xmm1, xmm2, xmm3/m128	VEX.NDS.128.66.0F38.W0 0D /r	Permute double-precision floating-point values in xmm2 using controls from xmm3/m128 and store result in xmm1.
AVX	vpermilpd	xmm1, xmm2/m128, imm8	VEX.128.66.0F3A.W0 05 /r ib	Permute double-precision floating-point values in xmm2/m128 using controls from imm8.
AVX	vperm2f128	ymm1, ymm2, ymm3/m256, imm8	VEX.NDS.256.66.0F3A.W0 06 /r ib	Permute 128-bit floating-point fields in ymm2 and ymm3/mem using controls from imm8 and store result in ymm1.
AVX	vperm2i128	ymm1, ymm2, ymm3/m256, imm8	VEX.NDS.256.66.0F3A.W0 46 /r ib	Permute 128-bit integer data in ymm2 and ymm3/mem using controls from imm8 and store result in ymm1.
AVX	RORX	r32, r/m32, imm8	VEX.LZ.F2.0F3A.W0 F0 /r ib	Rotate 32-bit r/m32 right imm8 times without affecting arithmetic flags.
AVX	vinsertf128	ymm1, ymm2, xmm3/m128, imm8	VEX.NDS.256.66.0F3A.W0 18 /r ib	Insert 128 bits of packed floating-point values from xmm3/m128 and the remaining values from ymm2 into ymm1.
AVX	vextractf128	xmm1/m128, ymm2, imm8	VEX.256.66.0F3A.W0 19 /r ib	Extract 128 bits of packed floating-point values from ymm2 and store results in xmm1/m128.
AVX	vcvtps2ph	xmm1/m64, xmm2, imm8	VEX.128.66.0F3A.W0 1D /r ib	Convert four packed single-precision floating-point values in xmm2 to packed half-precision (16-bit) floating-point values in xmm1/m64. Imm8 provides rounding controls.
AVX	vinserti128	ymm1, ymm2, xmm3/m128, imm8	VEX.NDS.256.66.0F3A.W0 38 /r ib	Insert 128 bits of integer data from xmm3/m128 and the remaining values from ymm2 into ymm1.
AVX	vextracti128	xmm1/m128, ymm2, imm8	VEX.256.66.0F3A.W0 39 /r ib	Extract 128 bits of integer data from ymm2 and store results in xmm1/m128.
AVX	vblendvps	xmm1, xmm2, xmm3/m128, xmm4	VEX.NDS.128.66.0F3A.W0 4A /r /is4	Conditionally copy single-precision floating-point values from xmm2 or xmm3/m128 to xmm1, based on mask bits in the specified mask operand, xmm4.
AVX	vblendvpd	xmm1, xmm2, xmm3/m128, xmm4	VEX.NDS.128.66.0F3A.W0 4B /r /is4	Conditionally copy double-precision floating-point values from xmm2 or xmm3/m128 to xmm1, based on mask bits in the mask operand, xmm4.
AVX	vpblendvb	xmm1, xmm2, xmm3/m128, xmm4	VEX.NDS.128.66.0F3A.W0 4C /r /is4	Select byte values from xmm2 and xmm3/m128 using mask bits in the specified mask register, xmm4, and store the values into xmm1.

Appendix C : VSIB Instructions

The table below list all the AVX instructions encoded with a VEX prefix that do use a VSIB.

Mnemonic	Operands	VEX Encoding	Description
VGATHERDPD	xmm1, vm32x, xmm2	VEX.DDS.128.66.0F38.W1 92 /r	Using dword indices specified in vm32x, gather double-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VGATHERQPD	xmm1, vm64x, xmm2	VEX.DDS.128.66.0F38.W1 93 /r	Using qword indices specified in vm64x, gather double-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VGATHERDPS	xmm1, vm32x, xmm2	VEX.DDS.128.66.0F38.W0 92 /r	Using dword indices specified in vm32x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VGATHERQPS	xmm1, vm64x, xmm2	VEX.DDS.128.66.0F38.W0 93 /r	Using qword indices specified in vm64x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERDD	xmm1, vm32x, xmm2	VEX.DDS.128.66.0F38.W0 90 /r	Using dword indices specified in vm32x, gather dword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERQD	xmm1, vm64x, xmm2	VEX.DDS.128.66.0F38.W0 91 /r	Using qword indices specified in vm64x, gather dword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERDQ	xmm1, vm32x, xmm2	VEX.DDS.128.66.0F38.W1 90 /r	Using dword indices specified in vm32x, gather qword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERQQ	xmm1, vm64x, xmm2	VEX.DDS.128.66.0F38.W1 91 /r	Using qword indices specified in vm64x, gather qword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.

Appendix D : VEX.vvvv Has Destination Operand Instructions

The table below list all the AVX instructions that are using the VEX[vvvv] bit field to define a destination operand.

Mnemonic	Operands	VEX Encoding	Description
BLSI	r32, r/m32	VEX.NDD.LZ.0F38.W0 F3 /3	Extract lowest set bit from r/m32 and set that bit in r32.
BLSMSK	r32, r/m32	VEX.NDD.LZ.0F38.W0 F3 /2	Set all lower bits in r32 to “1” starting from bit 0 to lowest set bit in r/m32.
BLSR	r32, r/m32	VEX.NDD.LZ.0F38.W0 F3 /1	Reset lowest set bit of r/m32, keep all other bits of r/m32 and write result to r32.
VPSLLDQ	xmm1, xmm2, imm8	VEX.NDD.128.66.0F.WIG 73 /7 ib	Shift xmm2 left by imm8 bytes while shifting in 0s and store result in xmm1.
VPSLLW	xmm1, xmm2, imm8	VEX.NDD.128.66.0F.WIG 71 /6 ib	Shift words in xmm2 left by imm8 while shifting in 0s.
VPSLLD	xmm1, xmm2, imm8	VEX.NDD.128.66.0F.WIG 72 /6 ib	Shift doublewords in xmm2 left by imm8 while shifting in 0s.
VPSLLQ	xmm1, xmm2, imm8	VEX.NDD.128.66.0F.WIG 73 /6 ib	Shift quadwords in xmm2 left by imm8 while shifting in 0s.
VPSRAW	xmm1, xmm2, imm8	VEX.NDD.128.66.0F.WIG 71 /4 ib	Shift words in xmm2 right by imm8 while shifting in sign bits.
VPSRAD	xmm1, xmm2, imm8	VEX.NDD.128.66.0F.WIG 72 /4 ib	Shift doublewords in xmm2 right by imm8 while shifting in sign bits.
VPSRLDQ	xmm1, xmm2, imm8	VEX.NDD.128.66.0F.WIG 73 /3 ib	Shift xmm2 right by imm8 bytes while shifting in 0s.
VPSRLW	xmm1, xmm2, imm8	VEX.NDD.128.66.0F.WIG 71 /2 ib	Shift words in xmm2 right by imm8 while shifting in 0s.
VPSRLD	xmm1, xmm2, imm8	VEX.NDD.128.66.0F.WIG 72 /2 ib	Shift doublewords in xmm2 right by imm8 while shifting in 0s.
VPSRLQ	xmm1, xmm2, imm8	VEX.NDD.128.66.0F.WIG 73 /2 ib	Shift quadwords in xmm2 right by imm8 while shifting in 0s.

Appendix E : VEX.vvvv Has Third Operand

The table below list all the AVX instructions that are using the VEX[vvvv] bit field to define a second source operand (3rd operand).

Mnemonic	Operands	VEX Encoding	Description
BEXTR	r32a, r/m32, r32b	VEX.NDS.LZ.0F38.W0 F7 /r	Contiguous bitwise extract from r/m32 using r32b as control; store result in r32a.
BZHI	r32a, r/m32, r32b	VEX.NDS.LZ.0F38.W0 F5 /r	Zero bits in r/m32 starting with the position in r32b, write result to r32a.
SARX	r32a, r/m32, r32b	VEX.NDS.LZ.F3.0F38.W0 F7 /r	Shift r/m32 arithmetically right with count specified in r32b.
SHLX	r32a, r/m32, r32b	VEX.NDS.LZ.66.0F38.W0 F7 /r	Shift r/m32 logically left with count specified in r32b.
SHRX	r32a, r/m32, r32b	VEX.NDS.LZ.F2.0F38.W0 F7 /r	Shift r/m32 logically right with count specified in r32b.
VGATHERDPD	xmm1, vm32x, xmm2	VEX.DDS.128.66.0F38.W1 92 /r	Using dword indices specified in vm32x, gather double-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VGATHERQPD	xmm1, vm64x, xmm2	VEX.DDS.128.66.0F38.W1 93 /r	Using qword indices specified in vm64x, gather double-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VGATHERDPS	xmm1, vm32x, xmm2	VEX.DDS.128.66.0F38.W0 92 /r	Using dword indices specified in vm32x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VGATHERQPS	xmm1, vm64x, xmm2	VEX.DDS.128.66.0F38.W0 93 /r	Using qword indices specified in vm64x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERDD	xmm1, vm32x, xmm2	VEX.DDS.128.66.0F38.W0 90 /r	Using dword indices specified in vm32x, gather dword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERQD	xmm1, vm64x, xmm2	VEX.DDS.128.66.0F38.W0 91 /r	Using qword indices specified in vm64x, gather dword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERDQ	xmm1, vm32x, xmm2	VEX.DDS.128.66.0F38.W1 90 /r	Using dword indices specified in vm32x, gather qword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.
VPGATHERQQ	xmm1, vm64x, xmm2	VEX.DDS.128.66.0F38.W1 91 /r	Using qword indices specified in vm64x, gather qword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.

Appendix F: WIB Instructions

The table below list all the AVX instructions that silently ignore the VEX[W] bit when encoded with a 3-byte VEX prefix.

SSE and AVX Mutation Idea (xlate)

Table of contents