Bit Vectors and my first steps into assembly

blog.smidt.dev

113 points by thunderbong 7 days ago

Another trick for classifying bytes is to use pshufb. That instruction works like:

  fn pshufb(a: [i8; 16], b: [i8; 16]) -> [i8; 16] {
    result = [0; 16]
    for j in 0, 1, …, 15 {
      if b[j] >= 0 {
        result[j] = a[b[j] & 15];
      }
    }
    return result;
  }

So you can often construct values m1, m2, and compute:

  a = pshufb(m1, input & 0xf)
  b = pshufb(m2, (unsigned_input) >> 4)
  x = f(a,b)

Where f is something like bit wise AND. Roughly, m1 and m2 assign sets of 8 potential properties based on the high or low order bits of each byte, and then we and together the sets of properties to make sure the upper and lower bits ‘agree’.

This technique can sometimes be used to mark a few special characters.

dragontamer 4 days ago

Pshufb is an old trick. An oldie but a goodie from SSE2 days.
All hail modern improvements!! pdep, pext, vpcompressb, vpexpandb, and especially vpermi2v.
Forget about 16 byte lookup table with pshufb. Try a 128-byte lookup with vpermi2v AVX512.
--------
I find that pext/pdep for 64-bits and vpcompressb/vpexpandb for 512-bits is also more flexible than pshufb (!!!!) in practice, if a bit slower as it is over multiple instructions.
Supercomputer programs are written with a gather-conpute-scatter pattern. Gather your data, compute on the data, and finally place the data where needed.
Pext and vpcompressb are gather operations.
Pdep and vpexpandb are scatter.
Pshufb is only a gather and thus fails to match the supercomputer pattern. GPUs provide a bpermute (backwards-permute) that scatters results back to a table and is really the missing link IMO.
Hopefully Intel will provide bpermute of their own eventually....
boris-smidt a day ago

I'll try it in the follow up post and probably include some graphics to make it clear what it does. Its indeed the real simd way of comparing multiple characters in 1 iteration instead of 1 character per iteration.
- dragontamer a day ago
  
  purplesyringa's post below is ~7 assembly instructions without any need of lookup tables.
  And the instructions purplesyringa uses are Add, AND, OR, and Compare. Modern CPUs can execute ~4 ADD/AND/OR instructions per clock tick (even if they are SIMD or even AVX512 instructions), so you're looking at under 4ish clock cycles to purplesyringa's solution.
  Its likely the best solution in the whole thread. The main advantage of pshufb is its extreme flexibility due to the lookup table, so there's advantages in learning that methodology.
  But in general, sequences of simpler instructions are much faster.
  ---------
  The "hard part" of purplesyringa's code is that its branchless.
  Learning to write branchless (especially using the pcmpeq* line of instructions) is a bit of a hassle and worth practicing for sure. AVX512 added explicit mask registers that do it better but 128-bit SSE / 256-bit AVX is still more widespread today.
teo_zero 4 days ago

> So you can often construct values m1, m2, and compute:
And what would m1 and m2 be for the specific problem¹ described in TFA?
¹ Namely, to find out if the input belongs to [0-9A-Za-z]
- dragontamer 3 days ago
  
  The original poster got lucky and m1 and m2 can in fact be constructed.
  But your overall post and skepticism is warranted. It's definitely wrong in the general case the original post made.
  To solve this easily, you truly need the 128-byte table from vpermi2v / AVX512 (note that ASCII is a 7-bit code and thus all functions of ASCII fit inside of a 128-table).
  The reason the original post got lucky is that the top 2 bits of ASCII determine special vs numbers vs lowercase vs uppercase.
  The bottom 5 bits determine the specific character. ASCII is a table of tables.
  0x30-0x39 are '0' - '9' respectively. So 0x3 in the 'top' table proves you have a number.
  0x40 is uppercase and 0x60 is lower case letters.
  Bottom table is harder to construct but we don't actually need to know the top bits. We just need 0x0-0x9 (valid for numbers) vs 0x1-0xF (valid for 0x4 and 0x6, as 0x40 and 0x60 are invalid letters), and finally 0x0-0x1A (aka p-z, valid for top half of alphabets in ASCII).
  This harder bottom table is solvable with 3 bits of the 8, meaning we have 5 spare bits for future extensions.
  So valid for numbers is bit 0x1. Valid for lower-half alphabet is 0x2 and valid for upper-half alphabet is 0x4.
  Locations like 5 are valid for all three (0x7) while locations like 0 are only valid for numbers and upper half alphabet (aka 0x5).
  ----------
  In the general case, the [0-9a-zA-Z] language is a regular language (even in an arbitrary non-ASCII code in binary), which means it can always be solved with a state transition table and a finite automata. And yes, pshufb can be a very useful tool for making such a table (and a bigger table like vpermi2v is even better).
  But it's not clear that the pshufb sequence in the original post would always work for an arbitrary jump table. Indeed, for best results you likely need to use pext and other complex bit instructions to efficiently extract the most info out of the bits available.
  Or you know.... Use the 128-tanle from vpermi2v.
  ------
  Note: FPGAs are composed of 4-LUTs which share many similarities with these 4-bit / 16-byte tables. One can argue that an FPGA is nothing more than a whole bunch of parallel pshufb units.
  But that doesn't mean that two such tables solves all problems lol. It's often surprising how some problems blow up in your face and suddenly eat up thousands of LUTs.
  - boris-smidt a day ago
    
    Interesting i never thought of ASCII to be a table of tables I'll have another look into the ASCII table number sequences to better understand it. This thread is really full of gold nuggets.

madflame991 4 days ago

In certain conditions gcc will actually generate instructions that use this same bitvector described: https://godbolt.org/z/6Gfjv6PGd (notice the funny 130159 value in there)

The only thing I did was to adjust the range of characters so that the bitvector fits 64 bits. Sadly, I do not this it would work for the author's bigger range.

ww520 3 days ago

I know the post only uses SIMD as an attempt to do bitmask lookup for each byte, but a good alternative use of SIMD is for parallel data processing, i.e. operating on more than one character. Being able to identify a range of characters making up the identifier is much faster than checking one character at a time, especially for avoiding branching.

Did a bit of xmas hacking to come up with the function to find identifiers in parallel in 32-byte chunks. I used Zig since it makes vector/SIMD programming easy.

The code is in: https://github.com/williamw520/misc_zig/blob/main/identifier...

The generated code of the nextBitmask32() function has about 38 instructions. https://godbolt.org/z/7ME3sW55e

That's 38 instructions for processing every 32 bytes. With no branching!

ww520 3 days ago

Slightly optimized code to get down to 32 asm instructions.
https://godbolt.org/z/ccE3r1bjc
- teo_zero 19 hours ago
  
  Godbolt shows 40 asm instructions, or am I misreading it?
  - ww520 19 hours ago
    
    Ahh ok. The above link used whatever the latest trunk of the zig compiler to generate code. Looks like the current trunk has regressed just now and produces more instructions.
    The link below uses the previous stable version and produces fewer instructions.
    https://godbolt.org/z/Ez4YGbdqn

boris-smidt a day ago

Hey i just discovered that my blog post was posted here i indeed have little experience with assembly and the idea was just could i use a 'larger register' like you would use a 64bit register to keep the table hot in the CPU. But sadly i failed and kinda gave up and made the post as is.

The good news is i will do a 'second' step into assembly programming where i will try to implement the algorithm of `dan-robertson`. There will also be a switch case and other tricks suggested in this thread.

With my experience i noticed that i'm missing breakpoints and unit tests to better understand what each assembly step does. Which things read from memory etc. What also doesn't help is that x86 feels a bit like an 'arbitrary patchwork of instructions' where some instructions exists for float and doubles but not int etc.

It seems a lot of implicit knowledge is needed to do assembly programming, for example `xor A A` to zero the register. So i probably have to read hackers delight from back to back.

The only assembly i have done in the past was AVR which is very limited and it was easy to understand. I could just put a JTAG debugger and see the registers directly.

Golang is used because it is a nice balance between high and low level, unlike the JVM/Python (JIT compilers) you can't disassemble and see the native code.

Ideally there would be a 'modern hackers delight' which would also include some simple simd tricks.

My goal is to eventually reimplement 'write a compiler/interpeter in go' in zig and rust to learn those languages. After 1h of trying it seems zig is 'quite' easy to include assembly as well and to do SIMD. But currently i'm on page 40 of 'write a compiler in go' so i'm still a long way off. For the zig/rust implementation i probably need to include a GC in this simple VM https://craftinginterpreters.com/garbage-collection.html so still a lot to learn.

Also one 'clever' idea i implemented in the lexer was to allow full UTF8 but only in string literal parsing mode. This avoids that you can use emoijs as variable and function names.

ajross 4 days ago

> Since go doesn’t inline assembly i decided to include the search loop into the assembly itself and to return the index of the non matching character. Sadly i didn’t get it to work fully because i’m missing tools to actually debug the assembly itself.

Um... "gdb <my_binary>", "break <my_function>", "run", just like any other tool.

Then "disassemble" to see the instructions around the PC, "stepi" to execute one instruction. "info register" to get current state (or use registers in an expression as e.g. "print $rax" or "print $r11").

And that's it. Debuggers are actually natively tools operating at the level of machine instructions. The language syntax support is a level on top.

xyzzy_plugh 4 days ago

Assembly is one of those things that is feared by so many engineers, but is actually incredibly accessible. Whenever I find an excuse to drop down to assembly my colleagues have looked at me like I'm a wizard with 3 heads.
The tooling to debug, dump or inspect, or to link or otherwise glue bits of assembly together has existed for decades. It's all right there!
- dist1ll 4 days ago
  
  I agree it's unnecessarily feared. At its core, it lacks a lot of complexity.
  But actually writing asm is an experience straight from the 80s. Compared to modern languages, the tooling is simply antiquated. I noticed this once I started writing larger sections of asm at a time. It's "fine" in the same sense that writing makefiles is "fine" - it becomes a PITA after a certain size.
  I think there's lots to improve on the current state of asm, but it's just too niche. The only power users I know are hobbyists working on operating systems and games for retro hardware.
  - boris-smidt a day ago
    
    Also add cryptographical people to that list. A friend of mine told me that they create 'constant' time algorithms.
    These have to written in assembly otherwise the compiler will try to speed up the algorithm. But then you can theoretically measure the execution time of encrypting and reverse engineer the possible keys.
- RicoElectrico 4 days ago
  
  It's one of these areas LLMs really help to get started. Not much invention needed, but quite involved setup if you don't know what you're precisely looking for. Having them spit out boilerplate and needed invocations lowers the barrier substantially.
  - dragontamer 4 days ago
    
    Assuming you have a C compiler, you should use C to get basic boilerplate and use asm or volatile asm to drop down to assembly level.
    This is true for GPUs, CPUs like x86 and ARM, and more. Only on the smallest of embedded systems like AVR with 4kB RAM have I ever found dropping to raw assembly to be useful.
    Bonus points: input/output/clobbers works well with the GCC and CLang optimizers. Though it's not intuitive how compilers can still move your assembly around, it's surprisingly full featured for how to get info into and out of registers.
    There are rare times where the C boilerplate is counterproductive. I guess OS level setup is another spot where raw non-C assembly is useful.
    
    david-gpu 4 days ago
    
    Intrinsics, when available, are a convenient tool. But even then, I think assembly these days has very limited applications and most programmers don't need to know any assembly to be effective.
    Preemptive disclaimer: all my jobs were the exception to the rule and I sadly had to work with various assembly languages on a daily basis. Still not a fan.
    
    dragontamer 4 days ago
    
    Things like the condition flags, or GPU execution flags, or other 'implicit' flags are when assembly is needed.
    Otherwise, intrinsics are sufficient. And with wavefront voting / ballot instructions (and intrinsics) being so readily available on modern GPUs, there's very little reason to go to those assembly level / machine code flags or registers.
    Similarly, back when ccmov instructions weren't reliably output by compilers, you'd probably want raw assembly. But today I think it's safe to rely upon the optimizers.
    
    AlotOfReading 4 days ago
    
    You should be able to read assembly, not least because understanding what's generated by the compiler or jit is pretty important to performance optimization.
    
    boris-smidt a day ago
    
    Reading is indeed the main thing i would like to learn, For my master thesis i had dome some GPU programming (8 years ago) and then it was super useful to read the assembly to reduce the amount of steps and to understand the 'execution model'. So this allows you to make sure your 'optimization' actually optimized anything.
    
    david-gpu 3 days ago
    
    Aren't there many other steps with better bang for your buck to be done before that? Libraries, better algorithms and data structures, parallelism, etc.
    
    boris-smidt a day ago
    
    I suspect it depends, when you would write a yaml/json parser you can only change the algorithm up to a point. After that you will have to start doing some bit fiddling and then being able to see the assembly can be really valuable.
    
    david-gpu a day ago
    
    How many programmers write a YAML/JSON parser vs. use an existing library?
    How many of the ones who write their own parser would benefit from using a library more than from reading assembly?
    If your answer is that: "well, the ones writing the library benefit from learning assembly"... Think about what percentage of programmers they represent. Not to mention that source-level profiling will still give them better bang for their buck.
    As somebody who has read a ton of assembly in their career because those marginal gains mattered: 99% of programmers are better off investing their time elsewhere.
    
    boris-smidt a day ago
    
    Yes i agree with that one most people don't need, they should first use a profiler. Then they can easily improve the performance by 10x.
    For example I optimized a calculation with python dataframes by monkeypatching the 'unique' method so it would skip the sort, since my data was already sorted. This gained me a 5% performance improvement. (there where a few other tricks which reduced the calculation time from 3h to 20m making iterating faster)
    So i guess the assembly part is just a personal interest and it is only useful for the most inner loop of a program which you can't avoid.
    It seems that in general using SIMDs/intrinsic is already in the very advanced playbook of a developer. Just like reflection, classpath scanning etc, GPU acceleration.
    Ideally the standard library should provide the fastest JSON/YAML/CSV parser so no other attempts are made to improve on the standard.
    I suspect your argument could even be used if you need performance it might be easier to just switch languages. Somebody was super exiting to me that he used a javascript lib which generated optimized code for SQL queries deserialization at runtime. I bluntly said well shouldn't you just use another language to avoid this complexity.
    Curious question, Why did you read assembly often in your career?
    
    AlotOfReading 2 days ago
    
    Reading assembly, like compiler explorer or in a debugger.
    
    david-gpu 2 days ago
    
    Yes, I understand. But the only way that gives you good bang for your buck is if you have already exhausted a number of other areas earlier. I.e. it is marginal gains.
    How many programmers out there would be better served by spending time learning more about those other areas before they even start thinking about whether the compiler is generating slightly suboptimal assembly?
pm215 4 days ago

I also like "disp /3i $pc", which tells gdb to disassemble the next three instructions whenever it stops (including after stepi). Though I'm sure you can configure it to give you more sophisticated disassembly display and context, especially with the tui interface, this is usually "good enough" for me when I run into something I want to step through at the assembly level in the middle of debugging a program that's primarily C.
- purplesyringa 3 days ago
  
  On a related note, https://hugsy.github.io/gef/ provides a TUI interface like that.

MobiusHorizons 4 days ago

I’m surprised he didn’t use a switch statement, as that is the canonical way to ask the compiler to make the jump table for you.

boris-smidt 21 hours ago

I just tried the switch statement ``` func BenchmarkSwitch(b *testing.B) { for i := 0; i < b.N; i++ { for i := 0; i < len(randomBytes); i++ { loop: for i, x := range randomBytes[i:] { switch x { case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9': result = i != 0 break loop case 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z': result = i != 0 break loop case 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z': result = i != 0 break loop } } } } } ```
It isn't any faster than the || compare, so i suspect the go compiler doesn't really treat it like a special case.
rep_lodsb 4 days ago

Without SIMD, that would probably the optimal way to implement it. And this bit vector idea really isn't doing SIMD at all, it's just (ab)using the vector registers to store 128 bits.
I think the Zig code would also compile to better output, and be a lot simpler, if it was using an array of u1, instead of doing the bit shifting manually.
- dragontamer 3 days ago
  
  Bit processing is always SIMD.
  A 64-bit register being processed 64-bits at a time is, from a mental and programmers perspective, a 64-wide SIMD computer over 1-bit AND, OR, and NOT instructions.
  If you aren't using SIMD / parallel programming concepts while thinking about 64-parallel bits (or even in this case, parallel processing of 8-bit bytes or 7-bits of ASCII), then you are making life much harder for yourself.
  -------
  I believe in the 90s, some people called 32-bit parallelism as SIMD Within A Register (SWAR). Some guy wrote 3DES encryption with 64-bit AND/OR/NOTs and progressed bit-by-bit (but in parallel), achieving absurdly efficient encryption back in the day.
  And of course, Chess Programmers with all their 64-bit boards for their 8x8 game constantly think in terms of SIMD inside of 64-bits.
  -------
  Once you learn to think in SIMD, you can extend these concepts out to 128-bits (SSE), 256-bits (AVX) or 512-bit (AVX512) or beyond (GPU 1024-bits and greater).
  - selimthegrim 3 days ago
    
    http://lamport.azurewebsites.net/pubs/multiple-byte.pdf
- MobiusHorizons 4 days ago
  
  The bit vector does give a nice compression compared to the jump table. But I believe the main thing to do for speeding up string parsing is to consume more than one character at a time. I personally found the techniques in https://tia.mat.br/posts/2018/02/01/more_on_string_switch_in... very helpful to explain these issues.
  EDIT: I especially like this statement from the conclusion:
  > The good thing about the switch statement in C is that it is maybe the highest level statement in the language: the compiler can get really creative in how its code is generated. It's not uncommon for it to generate jump tables or even binary searches.

purplesyringa 3 days ago

This article is kind of fine overall but gets so many specifics wrong. I guess this is to be expected from someone just beginning to verse in assembly, but still.

    if '0' <= x && x <= '9' || 'a' <= x && x <= 'z' || 'A' <= x && x <= 'Z' {
 ...
    }

Go is garbage, but GCC and LLVM compile this to branchless code. If the goal was to redo the book in Zig and Rust, perhaps it would've been a good idea to profile the implementations under Zig and Rust.

If you really want to optimize code like `a <= x && x <= b` by hand, you can rewrite it to `x - a <= b - a`, where the comparison is unsigned; GCC and LLVM do that for you. Next, you can optimize the case-insensitive check by masking out the case bit:

    void f();
    void test(char x) {
        if ((unsigned char)(x - '0') < 10 || (unsigned char)((x & 0x5f) - 'A') < 26) {
            f();
        }
    }

compiles to just

    lea     eax, [rdi - 48]
    cmp     al, 10
    jb      f
    and     dil, 95
    add     dil, -65
    cmp     dil, 25
    jbe     f
    ret

This is still not as fast as a memory read, but IMO this should've been the first thing to try. However, a) GCC and LLVM can autovectorize this, whereas memory reads can't be (although I admit the resulting codegen makes me uneasy), b) you can go on and vectorize this by hand, parsing an identifier in just one iteration of a loop, which you can't do with a LUT. I might have made a bug, but something along the lines of

    unsigned short test(__m128i x) {
        return _mm_movemask_epi8(_mm_or_si128(
            _mm_cmplt_epi8(_mm_sub_epi8(x, _mm_set1_epi8(0xb0)), _mm_set1_epi8(0x8a)),
            _mm_cmplt_epi8(_mm_sub_epi8(_mm_and_si128(x, _mm_set1_epi8(0x5f)), _mm_set1_epi8(0xc1)), _mm_set1_epi8(0x9a))
        ));
    }

should work. This has 5x more throughput than a LUT and should have better branch prediction characteristics, although it consumes 16 bytes at a time, slightly limiting the applications.

The section about bitvectors is fine.

x86 processors don't have 128-bit MMX registers and 256-bit MMY registers. They have legacy 64-bit MMX registers, modern 128-bit XMM registers, and modern 256-bit YMM registers. This MMX/XMM duality sure does confuse people, so I don't blame the author. But come on, at least read the article before publishing; broken GitHub URLs without `blob/master` and broken ()[] Markdown syntax is not funny.

To those trying to make sense of Zig output, don't forget `-O ReleaseFast`. The codegen here is very simple, just a 128-bit integer simulated with two 64-bit ones.

boris-smidt a day ago

I'll have another try in Zig later on and indeed the XMM/MMX and MMY really is confusing, couldn't they just cal it V128, V256 and V512 ...
jesse__ 3 days ago

Excellent nitpicks, hopefully the author reads this and does a follow up :)
- selimthegrim 3 days ago
  
  How good is Hacker’s Delight on this stuff?
  - purplesyringa 3 days ago
    
    The main problem with Hacker's Delight is its age. It has lots of bit tricks and optimization guidelines you'll find useful, but you have to understand that a big part of it doesn't apply on modern devices. But as long as you treat it as a collection of snippets and benchmark the code on modern devices (or, if you have built some intuition, that works too), it's an invaluable resource.
    
    selimthegrim 3 days ago
    
    Thanks, I stumbled across a copy in a local coffee shop by happenstance that had been remaindered out of a University of Alaska library.