JavaScript Benchmarking Is a Mess

131 points by joseneca 4 days ago

pizlonator 4 days ago

(I designed JavaScriptCore's optimizing JITs and its garbage collector and a bunch of the runtime. And I often benchmark stuff.)

Here's my advice for how to run benchmarks and be happy with the results.

- Any experiment you perform has the risk of producing an outcome that misleads you. You have to viscerally and spiritually accept this fact if you run any benchmarks. Don't rely on the outcome of a benchmark as if it's some kind of Truth. Even if you do everything right, there's something like a 1/10 risk that you're fooling yourself. This is true for any experiment, not just ones involving JavaScript, or JITs, or benchmarking.

- Benchmark large code. Language implementations (including ahead of time compilers for C!) have a lot of "winning in the average" kind of optimizations that will kick in or not based on heuristics, and those heuristics have broad visibility into large chunks of your code. AOTs get there by looking at the entire compilation unit, or sometimes even your whole program. JITs get to see a random subset of the whole program. So, if you have a small snippet of code then the performance of that snippet will vary wildly depending on how it's used. Therefore, putting some small operation in a loop and seeing how long it runs tells you almost nothing about what will happen when you use that snippet in anger as part of a larger program.

How do you benchmark large code? Build end-to-end benchmarks that measure how your whole application is doing perf-wise. This is sometimes easy (if you're writing a database you can easily benchmark TPS, and then you're running the whole DB impl and not just some small snippet of the DB). This is sometimes very hard (if you're building UX then it can be hard to measure what it means for your UX to be responsive, but it is possible). Then, if you want to know whether some function should be implemented one way or another way, run an A:B test where you benchmark your whole app with one implementation versus the other.

Why is that better? Because then, you're measuring how your snippet of code is performing in the context of how it's used, rather than in isolation. So, your measurement will account for how your choices impact the language implementation's heuristics.

Even then, you might end up fooling yourself, but it's much less likely.

kmiller68 4 days ago

I completely agree with this advice. Micro-benchmarking can work well as long as you already have an understanding of what's happening behind the scenes. Without that it greatly increases the chance that you'll get information unrelated to how your code would perform in the real world. Even worse, I've found a lot of the performance micro-benchmarking websites can actually induce performance issues. Here's an example of a recent performance bug that appears to have been entirely driven by the website's harness. https://bugs.webkit.org/show_bug.cgi?id=283118
rgbrgb 4 days ago

Love this. I have done a fair amount of UI performance optimization and agree with the end-to-end strategy.
For UX stuff, 2 steps I’d add if you're expecting a big improvement:
1) Ship some way of doing a sampled measurement in production before the optimization goes out. Networks and the spec of the client devices may be really important to the UX thing you're trying to improve. Likely user devices are different from your local benchmarking environment.
2) Try to tie it to a higher level metric (e.g. time on site, view count) that should move if the UI thing is faster. You probably don't just want it to be faster, you want the user to have an easier time doing their thing, so you want something that ties to that. At the very least this will build your intuition about your product and users.
leeoniya 4 days ago

great points! i do a lot of JS benchmarking + optimization and whole-program measurement is key. sometimes fixing one hotspot changes the whole profile, not just shifts the bottleneck to the next biggest thing in the original profile. GC behaves differently in different JS vms. sometimes if you benchmark something like CSV parsers which can stress the GC, Benchmark.js does a poor job by not letting the GC collect properly between cycles. there's a lengthy discussion about why i use a custom benchmark runner for this purpose [1]. i can recommend js-framework-benchmark [2] as a good example of one that is done well, also WebKit's speedometer [3].
[1] https://github.com/leeoniya/uDSV/issues/2
[2] https://github.com/krausest/js-framework-benchmark
[3] https://github.com/WebKit/Speedometer
hinkley 4 days ago

> there's something like a 1/10 risk that you're fooling yourself.
You’re being generous or a touch ironic. It’s at least 1/10 and probably more like 1/5 on average and 1/3 for people who don’t take advice.
Beyond testing changes in a larger test fixture, I also find that sometimes multiplying the call count for the code under examination can help clear things up. Putting a loop in to run the offending code 10 times instead of once is a clearer signal. Of course it still may end up being a false signal.
I like a two phase approach, wheee you use a small scale benchmark while iterating on optimization ideas, with checking the larger context once you feel you’ve made progress, and again before you file a PR.
At the end of the day, eliminating accidental duplication of work is the most reliable form of improvement, and one that current and previous generation analysis tools don’t do well. Make your test cases deterministic and look at invocation counts to verify that you expect n calls of a certain shape to call the code in question exactly kn times. Then figure out why it’s mn instead. (This is why I say caching is the death of perf analysis. Once it’s added this signal disappears)
- daelon 3 days ago
  
  The first half of the sentence you quoted is "even if you do everything right". What is the point of selectively quoting like that and then responding to something you know they didn't mean?
  - hinkley 2 days ago
    
    With respect, the first half of the sentence sounds more like waffling than a clear warning to your peers. I bond with other engineers over how badly the industry as a whole handles the scientific method. Too many of us can’t test a thesis to the satisfaction of others. Hunches, speculation, and confidently incorrect. Every goddamn day.
    Feynman said: The most important thing is not to fool yourself, and you’re the easiest person to fool.
    That’s a lot more than 1/10, and he’s talking mostly to future scientists, not developers.
aardvark179 4 days ago

Excellent advice. It’s also very important to know what any micro benchmarks you do have are really measuring. I’ve seen enough that actually measured the time to setup or parse something because they dominated and wasn’t cached correctly. Conversely I’ve seen cases where the JIT correctly optimised away almost everything because there was a check on the final value.
Oh, and if each op takes under a nanosecond than your benchmark is almost certainly completely broken.
thayne 4 days ago

I don't really disagree with anything you said, but having to run end to end tests for any benchmarking is far from ideal. For one thing, they are often slow,and to get reliable results, you have to run them multiple times, which makes it even slower. And that makes it more difficult, and expensive, to try a lot of things and iterate, and have a short feedback loop. IME writing good end to end tests is also just generally more difficult than writing unit tests of smaller benchmarking code.
mroche 4 days ago

ThePrimeTime posted a livestream recording a few days ago where he and his guest dove into language comparison benchmarks. Even the first 10 minutes touches on things I hadn't thought of before beyond the obvious "these are not representative of real world workloads." It's an interesting discussion if you have the time.
https://www.youtube.com/watch?v=RrHGX1wwSYM

spankalee 4 days ago

My old team at Google created a tool to help do better browser benchmarking called Tachometer: https://github.com/google/tachometer

It tries to deal with the uncertainties of different browsers, JITs, GCs, CPU throttling, varying hardware, etc., several ways:

- Runs benchmarks round-robin to hopefully subject each implementation to varying CPU load and thermal properties evenly.

- It reports the confidence interval for an implementation, not the mean. Doesn't throw out outlier samples.

- For multiple implementations, compares the distributions of samples, de-emphasizing the mean

- For comparisons, reports an NxM difference table, showing how each impl compares to the other.

- Can auto-run until confidence intervals for different implementations no longer overlap, giving high confidence that there is an actual difference.

- Uses WebDriver to run benchmarks in multiple browsers, also round-robin, and compares results.

- Can manage npm dependencies, so you can run the same benchmark with different dependencies and see how different versions change the result.

Lit and Preact use Tachometer to tease out performance changes of PRs, even on unreliable GitHub Action hardware. We needed the advanced statistical comparisons exactly because certain things could be faster or slower in different JIT tiers, different browsers, or different code paths.

We wanted to be able to test changes that might have small but reliable overall perf impact, in the context of a non-micro-benchmark, and get reliable results.

Tachometer is browser-focused, but we made it before there were so many server runtimes. It'd be really interesting to make it run benchmarks against Node, Bun, Deno, etc. too.

nemomarx 4 days ago

how relevant is browser benchmarking now that chrome owns most of the space?

dan-robertson 4 days ago

Re VM warmup, see https://tratt.net/laurie/blog/2022/more_evidence_for_problem... and the linked earlier research for some interesting discussion. Roughly, there is a belief when benchmarking that one can work around not having the most-optimised JIT-compiled version by running your benchmark a number of times and then throwing away the result before doing ‘real’ runs. But it turns out that:

(a) sometimes the jit doesn’t run

(b) sometimes it makes performance worse

(d) and obviously in the real world you may not end up with the same jitted version that you get in your benchmarks

vitus 4 days ago

In general, this isn't even a JS problem, or a JIT problem. You have similar issues even in a lower-level language like C++: branch prediction, cache warming, heck, even power state transitions if you're using AVX512 instructions on an older CPU. Stop-the-world GC causing pauses? Variation in memory management exists in C, too -- malloc and free are not fixed-cost, especially under high churn.
Benchmarks can be a useful tool, but they should not be mistaken for real-world performance.
- dan-robertson 4 days ago
  
  I think that sort of thing is a bit different because you can have more control over it. If you’re serious about benchmarking or running particularly performance sensitive code, you’ll be able to get reasonably consistent benchmark results run-to-run, and you’ll have a big checklist of things like hugepages, pgo, tickless mode, writing code a specific way, and so on to get good and consistent performance. I think you end up being less at the mercy of choices made by the VM.
  - marcosdumay 4 days ago
    
    The amount of control you have varies in a continuum between hand-written assembly to SQL queries. But there isn't really a difference of kind here, it's just a continuum.
    If there's anything unique about Javascript is that has an unusually high rate of "unpredictability" / "abstraction level". But again, it has pretty normal values of both of those, just the relation is away from the norm.
    
    dan-robertson 4 days ago
    
    I’m quite confused by your comment. I think this subthread is about reasons one might see variance on modern hardware in your ‘most control’ end of the continuum.
    The start of the thread was about some ways where VM warmup (and benchmarking) may behave differently from how many reasonably experienced people expect.
    I claim it’s reasonably different because one cannot tame parts of the VM warmup issues whereas one can tame many of the sources of variability one sees in high-performance systems outside of VMs, eg by cutting out the OS/scheduler, by disabling power saving and properly cooling the chip, by using CAT to limit interference with the L3 cache, and so on.
    
    marcosdumay 4 days ago
    
    > eg by cutting out the OS/scheduler, by disabling power saving and properly cooling the chip, by using CAT to limit interference with the L3 cache, and so on
    That's not very different from targeting a single JS interpreter.
    In fact, you get much more predictability by targeting a single JS interpreter than by trying to guess your hardware limitations... Except, of course if you target a single hardware specification.
    
    hinkley 4 days ago
    
    When we were upgrading to ES6 I was surprised/relieved to find that moving some leaf node code in the call graph to classes from prototypes did help. The common wisdom at the time was that classes were still relatively expensive. But they force strict, which we were using inconsistently, and they flatten the object representation (I discovered these issues in the heap dump, rather than the flame graph). Reducing memory pressure can overcome the cost of otherwise suboptimal code.
    OpenTelemetry having not been invented yet, someone implemented server side HAR reports and the data collection on that was a substantial bottleneck. Particularly the original implementation.
- hinkley 4 days ago
  
  Running benchmarks on the old Intel MacBook a previous job gave me was like pulling teeth. Thermal throttling all the time. Anything less than at least a 2x speed up was just noise and I’d have to push my changes to CI/CD to test, which is how our build process sprouted a benchmark pass. And a grafana dashboard showing the trend lines over time.
  My wheelhouse is making lots of 4-15% improvements and laptops are no good for those.
__alexs 4 days ago

Benchmarking that isn't based on communicating the distribution of execution times is fundamentally wrong on almost any platform.
hyperpape 4 days ago

I think Tratt’s work is great, but most of the effects that article highlights seem small enough that I think they’re most relevant to VM implementors measuring their own internal optimizations.
Iirc, the effects on long running benchmarks in that paper are usually < 1%, which is a big deal for runtime optimizations, but typically dwarfed by the differences between two methods you might measure.
- hinkley 4 days ago
  
  Cross cutting concerns run into these sorts of problems. And they can sneak up on you because as you add these calls to your coding conventions, it’s incrementally added in new code and substantial edits to old, so what added a few tenths of a ms at the beginning may be tens of milliseconds a few years later. Someone put me on a trail of this sort last year and I managed to find about 75 ms of improvement (and another 50ms in stupid mistakes adjacent to the search).
  And since I didn’t eliminate the logic I just halved the cost, that means we were spending about twice that much. But I did lower the slope of the regression line quite a lot, and I believe enough that new nodeJS versions improve response time faster than it was organically decaying. There were times it took EC2 instance type updates to see forward progress.
  - hyperpape 4 days ago
    
    I think you might be responding to a different point than the one I made. The 1% I'm referring to is a 1% variation between subsequent runs of the same code. This is a measurement error, and it inhibits your ability to accurately compare the performance of two pieces of code that differ by a very small amount.
    Now, it reads like you' think I'm saying you shouldn't care about a method if it's only 1% of your runtime. I definitely don't believe that. Sure, start with the big pieces, but once you've reached the point of diminishing returns, you're often left optimizing methods that individually are very small.
    It sounds like you're describing a case where some method starts off taking < 1% of the runtime of your overall program, and grows over time. It's true that if you do an full program benchmark, you might be unable to detect the difference between that method and an alternate implementation (even that's not guaranteed, you can often use statistics to overcome the variance in the runtime).
    However, you often still will be able to use micro-benchmarks to measure the difference between implementation A and implementation B, because odds are they differ not by 1% in their own performance, but 10% or 50% or something.
    That's why I say that Tratt's work is great, but I think the variance it describes is a modest obstacle to most application developers, even if they're very performance minded.
    
    hinkley 2 days ago
    
    > Now, it reads like you' think I'm saying you shouldn't care about a method if it's only 1% of your runtime. I definitely don't believe that.
    Nor do I. I make a lot of progress down in the 3% code. You have to go file by file and module by module rather than hotspot by hotspot down here, and you have to be very good at refactoring and writing unit/pinning tests to get away with it. But the place I worked where I developed a lot of my theories I kept this up for over two years, making significant perf improvements every quarter before I had to start looking under previously visited rocks. It also taught me the importance of telegraphing competence to your customers. They will put up with a lot of problems if they trust you to solve them soon.
    No I’m saying the jitter is a lot more than one percent, and rounding errors and externalities can be multiples of the jitter. Especially for leaf node functions - which are a fruitful target because it turns out your peers mostly don’t have opinions about “clever” optimizations if they’re in leaf node functions, in commits they could always roll back if they get sick of looking at them. Which is both a technical and a social problem.
    One of the seminal cases for my leaving the conventional wisdom: I’d clocked a function call as 10% of an expensive route, via the profiler. I noticed that the call count was way off. A little more than twice what I would have guessed. Turned out two functions were making the call with the same input, but they weren’t far apart in the call tree, so I rearranged the code a bit to eliminate the second call, expecting the e2e time to be 5% faster. But it was repeatably 20% faster, which made no sense at all, except for memory pressure.
    About five years later I saw the most egregious case of this during a teachable moment when I was trying to explain to coworkers that fixing dumb code >> caching. I eliminated a duplicate database request in sibling function calls, and we saw the entire response time drop by 10x instead of the 2-3x I was expecting.

vitus 4 days ago

> This effort, along with a move to prevent timing attacks, led to JavaScript engines intentionally making timing inaccurate, so hackers can’t get precise measurements of the current computers performance or how expensive a certain operation is.

The primary motivation for limiting timer resolution was the rise of speculative execution attacks (Spectre / Meltdown), where high-resolution timers are integral for differentiating between timings within the memory hierarchy.

https://github.com/google/security-research-pocs/tree/master...

If you look at when various browsers changed their timer resolutions, it's entirely a response to Spectre.

https://blog.mozilla.org/security/2018/01/03/mitigations-lan...

https://issues.chromium.org/issues/40556716 (SSCA -> "speculative side channel attacks")

blacklion 4 days ago

Very strange take on "JIT introduce a lot of error into result". I'm from JVM/Java world, but it is JITted VM too, and in our world question is: why you want to benchmark interpreted code at all!?

Only final-stage, fully-JIT-ted and profile-optimized code is what matter.

Short-lived interpreted / level-1 JITted code is not interesting at all from benchmarking perspective, because it will be compiled fast enough to doesn't matter in grand scheme of things.

dzaima 4 days ago

JIT can be very unpredictable. I've seen cases with JVM of running the exact same benchmark in the same VM twice having the second run be 2x slower than the first, occurrences of having ran one benchmark before another making the latter 5x slower, and similar.
Sure, if you make a 100% consistent environment of a VM running just the single microbenchmark you may get a consistent result on one system, but is a consistent result in any way meaningful if it may be a massive factor away from what you'd get in a real environment? And even then I've had cases of like 1.5x-2x differences for the exact same benchmark run-to-run.
Granted, this may be less of a benchmarking issue, more just a JIT performance issue, but it's nevertheless also a benchmarking issue.
Also, for JS, in browser specifically, pre-JIT performance is actually a pretty meaningful measurement, as each website load starts anew.
- gmokki 4 days ago
  
  How long did you run the benchmark if you got so large variation?
  For simple methods I usually run the benchnarkes method 100k times, 10k is minimum for full JIT.
  For large programs I have noticed the performance keeps getting better for the first 24 hours, after which I take a profiling dump.
  - dzaima 4 days ago
    
    Most of the simple benches I do are for ~1 second. The order-dependent things definitely were reproducible (something along the lines of rerunning resulting in some rare virtual method case finally being invoked enough times/with enough cases to heavily penalize the vastly more frequent case). And the case of very different results C2 deciding to compile the code differently (looking at the assembly was problematic as adding printassembly whatever skewed the case it took), and stayed stable for tens of seconds after the first ~second iirc (though, granted, it was preview jdk.incubator.vector code).
the_mitsuhiko 4 days ago

> I'm from JVM/Java world, but it is JITted VM too, and in our world question is: why you want to benchmark interpreted code at all!?
Java gives you exceptional control over the JVM allowing you to create really good benchmark harnesses. That today is not the case with JavaScript and the proliferation of different runtimes makes that also harder. To the best of my knowledge there is no JMH equivalent for JavaScript today.
pizlonator 4 days ago

When JITing Java, the main profiling inputs are for call devirtualization. That has a lot of randomness, but it's confined to just those callsites where the JIT would need profiling to devirtualize.
When JITing JavaScript, every single fundamental operation has profiling. Adding stuff has multiple bits of profiling. Every field access. Every array access. Like, basically everything, including also callsites. And without that profiling, the JS JIT can't do squat, so it depends entirely on that profiling. So the randomness due to profiling has a much more extreme effect on what the compiler can even do.
ufo 4 days ago

Javascript code is often short lived and doesn't have enough time to wait for the JIT to watm up.
munificent 4 days ago

> Short-lived interpreted / level-1 JITted code is not interesting at all from benchmarking perspective, because it will be compiled fast enough to doesn't matter in grand scheme of things.
This is true for servers but extremely not true for client-side GUI applications and web apps. Often, the entire process of [ user starts app > user performs a few tasks > user exits app ] can be done in a second. Often, the JIT never has a chance to warm up.
- blacklion 3 days ago
  
  If it is done in literal second, why will you benchmark it?
  In such case you need "binary" benchmark: does user need to wait or not? You don't need some fancy graphics, percentiles, etc.
  And in such case your worse enemy is not JIT but variance of user's hardware, from old Atom netbook to high-end working station with tens of 5Ghz cores. Same for RAM and screen size.
  - munificent 3 days ago
    
    > If it is done in literal second, why will you benchmark it?
    The difference between one second and two seconds can be the difference between a happy user and an unhappy user.
    > You don't need some fancy graphics, percentiles, etc.
    You don't need those to tell you if your app is slow, you need them to tell you why your app is slow. The point of a profiler isn't to identity the existence of a performance problem. You should know you have a performance problem before you ever bother to start your profiler. The point is to give you enough information so that you can solve your performance problem.
    > your worse enemy is not JIT but variance of user's hardware, from old Atom netbook to high-end working station with tens of 5Ghz cores. Same for RAM and screen size.
    Yes, this are a real problem that client-side developers have to deal with. It's hard.
Etheryte 4 days ago

Agreed, comparing functions in isolation can give you drastically different results from the real world, where your application can have vastly different memory access patterns.
- natdempk 4 days ago
  
  Does anyone know how well the JIT/cache on the browser works eg. how useful it is to profile JIT'd vs non-JIT'd and what those different scenarios might represent in practice? For example is it just JIT-ing as the page loads/executes, or are there cached functions that persist across page loads, etc?

pygy_ 4 days ago

I have been sleeping on this for quite a while (long covid is a bitch), but I have built a benchmarking lib that sidesteps quite a few of these problems, by

- running the benchmark in thin slices, interspersed and suffled, rather than in one big batch per item (which also avoids having one scenario penalized by transient noise)

- displaying a graphs that show possible multi-modal distributions when the JIT gets in the way

- varying the lengths of the thin slices between run to work around the poor timer resolution in browsers

- assigning the results of the benchmark to a global (or a variable in the parent scope as it is in the WEB demo below) avoid dead code elimination

This isn't a panacea, but it is better than the existing solutions AFA I'm aware.

There are still issues because, sometimes, even if the task order is shuffled for each slice, the literal source order can influence how/if a bit of code is compiled, resulting in unreliable results. The "thin slice" approach can also dilute the GC runtime between scenarios if the amount of garbage isn't identical between scenarios.

I think it is, however, a step in the right direction.

- CLI runner for NODE: https://github.com/pygy/bunchmark.js/tree/main/packages/cli

- WIP WEB UI: https://flems.io/https://gist.github.com/pygy/3de7a5193989e0...

In both case, if you've used JSPerf you should feel right at home in the WEB UI. The CLI UI is meant to replicate the WEB UI as close as possible (see the example file).

pygy_ 4 days ago

I hadn't run these in a while, but in the current Chrome version, you can clearly see the multi-modality of the results with the dummy Math.random() benchmark.

ericyd 4 days ago

Maybe I'm doing it wrong, but when I benchmark code, my goal is to compare two implementations of the same function and see which is faster. This article seems to be concerned with finding some absolute metric of performance, but to me that isn't what benchmarking is for. Performance will vary based on hardware and runtime which often aren't in your control. The limitations described in this article are interesting notes, but I don't see how they would stop me from getting a reasonable assessment of which implementation is faster for a single benchmark.

epolanski 4 days ago

Well, the issue is that micro benchmarking in JS is borderline useless.
You can have some function that iterates over something and benchmark two different implementations and draw conclusions that one is better than the other.
Then, in real world, when it's in the context of some other code, you just can't draw conclusions because different engines will optimize the very same paths differently in different contexts.
Also, your micro benchmark may tell you that A is faster than B...when it's a hot function that has been optimized due to being used frequently. But then you find that B which is used only few times and doesn't get optimized will run faster by default.
It is really not easy nor obvious to benchmark different implementations. Let alone the fact that you have differences across engines, browsers, devices and OSs (which will use different OS calls and compiler behaviors).
- ericyd 4 days ago
  
  I guess I've just never seen any alternative to microbenchmarking in the JS world. Do you know of any projects that do "macrobenchmarking" to a significant degree so I could see that approach?
  - epolanski 4 days ago
    
    Real world app and some other projects focus on entire app benchmarks.
hyperpape 4 days ago

The basic problem is that if the compiler handles your code in an unusual way in the benchmark, you haven't really measured the two implementations against each other, you've measured something different.
Dead code elimination is the most obvious way this happens, but you can also have issues where you give the branch predictor "help", or you can use a different number of implementations of a method so you get different inlining behavior (this can make a benchmark better or worse than reality), and many others.
As for runtime, if you're creating a library, you probably care at least a little bit about alternate runtimes, though you may well just target node/V8 (on the JVM, I've done limited benchmarking on runtimes other than HotSpot, though if any of my projects get more traction, I'd anticipate needing to do more).
hinkley 4 days ago

It’s some of both because the same things that make benchmarking inaccurate make profiling less accurate. How do you know you should even be looking at that function in the first place?
You might get there eventually but a lot of people don’t persevere where perf is concerned. They give up early which leaves either a lot of perf on the table or room for peers to show them up.
wpollock 4 days ago

You're not wrong, but there are cases where "absolute" performance matters. For example, when your app must meet a performance SLA.
dizhn 4 days ago

Isn't that more profiling than benchmarking?
- mort96 4 days ago
  
  No? Profiling tells you which parts of your code take time.
  - dizhn 4 days ago
    
    Thanks

skybrian 4 days ago

Performance is inherently non-portable. In fact, ignoring performance differences is what enables portability.

Not knowing what performance to expect is what allows you to build a website and expect it to run properly years later, on browsers that haven’t been released yet, running on future mobile phones that use chips that haven’t been designed yet, over a half-working WiFi connection in some cafe somewhere.

Being ignorant of performance is what allows you to create Docker images that work on random servers in arbitrary datacenters, at the same time that perfect strangers are running their jobs and arbitrarily changing what hardware is available for your code to use.

It’s also what allows you to depend on a zillion packages written by others and available for free, and upgrade those packages without things horribly breaking due to performance differences, at least most of the time.

If you want fixed performance, you have to deploy on fixed, dedicated hardware, like video game consoles or embedded devices, and test on the same hardware that you’ll use in production. And then you drastically limit your audience. It’s sometimes useful, but it’s not what the web is about.

But faster is better than slower, so we try anyway. Understanding the performance of portable code is a messy business because it’s mostly not the code, it’s our assumptions about the environment.

We run tests that don’t generalize. For scientific studies, this is called the “external validity” problem. We’re often doing the equivalent of testing on mice and assuming the results are relevant for humans.

Max-q 4 days ago

Ignoring performance is what gives you slow code, costing you a lot if the code you write will be a success because you have to throw a lot more hardware at it. Think back to early Twitter that crashed and went down often hours each day.
Most optimization will improve on all or some VMs. Most will not make it slower on others.
If you write code that will be scaled up, optimization can save a lot of money, give better uptime, and it’s not a bad thing, the better code is not less portable in most cases.
- skybrian 4 days ago
  
  Sorry, I phrased that badly. By “ignoring performance,” I meant something more like “writing resilient code that can handle swings in performance.” For example, having generous deadlines where possible. Performance is fuzzy, but fast code that has lots of headroom will be more resilient than code that barely kees up.
- wavemode 4 days ago
  
  To play devil's advocate: I don't think, in hindsight, Twitter would choose to have delayed their product's launch in order to engineer it for better performance. Given how wildly successful it became purely by being in the right place at the right time.

CalChris 4 days ago

Laurence Tratt's paper Virtual machine warmup blows hot and cold [1] paper has been posted several times and never really discussed. It covers this problem for Java VMs and also presents a benchmarking methodology.

[1] https://dl.acm.org/doi/10.1145/3133876

diggan 4 days ago

> Essentially, these differences just mean you should benchmark across all engines that you expect to run your code to ensure code that is fast in one isn’t slow in another.

In short, the JavaScript backend people now need to do what we JavaScript frontend people been doing since SPAs became a thing, run benchmarks across multiple engines instead of just one.

hyperpape 4 days ago

For anyone interested in this subject, I’d recommend reading about JMH. The JVM isn’t 100% the same as JS VMs, but as a benchmarking environment it shares the same constraint of JIT compilation.

The right design is probably one that:

1) runs different tests in different forked processes, to avoid variance based on the order in which tests are run changing the JIT’s decisions.

2) runs tests for a long time (seconds or more per test) to ensure full JIT compilation and statistically meaningful results

Then you need to realize that your micro benchmarks give you information and help you understand, but the acid test is improving the performance of actual code.

henning 4 days ago

While there may be challenges, caring about frontend performance is still worth it. When I click the Create button in JIRA and start typing, the text field lags behind my typing. I use a 2019 MacBook Pro. Unforgivable. Whether one alternate implementation that lets me type normally is 10% faster than another or not or whatever may be harder to answer. If I measure how bad the UI is and it's actually 60x slower than vanilla JS rather than 70x because of measurement error, the app is still a piece of shit.

austin-cheney 4 days ago

There is a common sentiment I see there that I see regularly repeated in software. Here is my sarcastic take:

I hate measuring things because accuracy is hard. I wish I could just make up my own numbers to make myself feel better.

It is surprising to me how many developers cannot measure things, do so incorrectly, and then look for things to blame for their emotional turmoil.

Here is quick guide to solve for this:

1. Know what you are measuring and what its relevance is to your product. It is never about big or small because numerous small things make big things.

2. Measuring things means generating numbers and comparing those numbers against other numbers from a different but similar measure. The numbers are meaningless is there is no comparison.

3. If precision is important use the high performance tools provided by the browser and Node for measuring things. You can get greater than nanosecond precision and then account for the variance, that plus/minus range, in your results. If you are measuring real world usage and your numbers get smaller, due to performance refactoring, expect variance to increase. It’s ok, I promise.

4. Measure a whole bunch of different shit. The point of measuring things isn’t about speed. It’s about identifying bias. The only way to get faster is to know what’s really happening and just how off base your assumptions are.

5. Never ever trust performance indicators from people lacking objectivity. Expect to have your results challenged and be glad when they are. Rest on the strength of your evidence and ease of reproduction that you provide.

notnullorvoid 3 days ago

Something I see many JS benchmarks struggle with is GC. Benchmarks run in tight loops with no GC in between, leading to results not representative of real world use. Rarely do they even run GC between the different test cases, so earlier cases build up GC pressure negatively impacting the later cases and invalidating all results.

leeoniya 3 days ago

yep, also thermal throttling.
had this discussion about GC pressure a bit ago: https://github.com/leeoniya/uDSV/issues/2

xpl 4 days ago

I once created this tool for benchmarking JS: https://github.com/xpl/what-code-is-faster

It does JIT warmup and ensures that your code doesn't get optimized out (by making it produce a side effect in result).

evnwashere 4 days ago

That’s why i created mitata, it greatly improves on javascript (micro-)benchmarking tooling

it provides bunch of features to help avoiding jit optimization foot-guns during benchmarking and dips into more advanced stuff like hardware cpu counters to see what’s the end result of jit on cpu

croes 4 days ago

Do the users care?

I think they are used to waiting because they no longer know the speed of desktop applications.

leeoniya 4 days ago

it's fun to write "fast" js code and watch people's amazement. has hardware becomes cheaper and faster devs become lazier and careless.
it's all fun and games until your battery dies 3 hours too soon.
https://en.m.wikipedia.org/wiki/Jevons_paradox

1oooqooq 4 days ago

kids dont recall when chrome was cheating left and right to be faster than firefox (after they were honestly for a couple months).

you'd have to run benchmarks for all sort of little thibgs because no browser would leave things be. If they thought one popular benchmark was using string+string it was all or nothing to optimize that, harming everything else. next week if that benchmark changed to string[].join... you get the idea. your code was all over the place in performance. Flying today, molasses next week... sometimes chrome and ff would switch the optimizations, so you'd serve string+string to one and array.join to the other. sigh.

cannibalXxx 3 days ago

i write my applications relying first on javascript. and for those interested in learning, i have written some articles about javascript here on this platform. See some of them here https://chat-to.dev/search?p=javascript

sylware 4 days ago

If you use javascript, use a lean engine coded in a lean SDK, certainly not the c++ abominations in Big Tech web engines.

Look at quickjs, and use your own very lean OS interfaces.

joseneca 4 days ago

QuickJS is great for cases where you are more limited by startup and executable size than anything else but it tends to perform quite terribly (https://bellard.org/quickjs/bench.html) compared to V8 and anything else with JIT compilation.
More code does not inherently mean worse performance.
- sylware 4 days ago
  
  But its SDK is not the c++ abominations like v8, and that alone is enough to choose quickjs or similar, since we all know here, that c++ (and similar, namely rust/java/etc) is a definitive nono (when it is not forced down our throat as a user, don't forget to thanks the guys doing that, usually well hidden behind internet...).
  For performance, don't use javascript anyway...
  That said, a much less worse middleground would be to have performance critical block written in assembly (RISC-V Now!) orchestrated by javascript.
  - surajrmal 4 days ago
    
    I'm not sure I understand the point of mentioning the language. If it was written in a language like c but still "shoved down your throat" would you still have qualms with it? Do you just like things written by corporate entities and those languages tend to be popular as they scale to larger teams well? Or do you dislike software that is too large to understand and optimized for the needs of larger teams? Because it doesn't matter if it's software or some grouping of distinct software - at some point there will be a point at which it becomes challenging to understand the full set of software.
    If I were to create an analogy, it feels like you're complaining about civil engineers who design skyscrapers to be built out of steel and concrete instead of wood and brick like we use for houses. Sure the former is not really maintainable by a single person but it's also built for higher capacity occupancy and teams of folks to maintain.
    
    sylware 4 days ago
    
    The "need of larger teams" does not justify to delegate the core of the technical interfaces to a grotesquely and absurdely complex and gigantic computer language with its compiler (probably very few real life ones).
    This is accute lack of perspective, border-line fraud.
    
    surajrmal 2 days ago
    
    Not all problems are technical. Some are social. Velocity of a large team should absolutely be taken into account when designing software. Does the concept of an assembly line also upset you? Not every thing can be done by a single craftsperson working in isolation or in a small group.
    I'm not suggesting where we ended up is ideal, but it has utility. I hope for a language reset akin to kotlin for the c++ community at some point to help resolve the complexity we've managed accumulate over decades. There is a decent amount of effort from various parts of the ecosystem to accomplish this.
  - ramon156 4 days ago
    
    I could be reading your comment wrong, but what do you mean with "c++ is a definitive nono"? Also how is a complicated repository enough (reason) to choose quickjs?
    
    sylware 4 days ago
    
    quickjs or similar. Namely, small depending on a reasonable SDK (which does include the computer language).

sroussey 4 days ago

For the love of god, please do not do this example:

  for (int i = 0; i<1000; i++) {
    console.time()
    // do some expensive work
    console.timeEnd()
  }

Take your timing before and after the loop and divide by the count. Too much jitter otherwise.

d8 and node have many options for benchmarking and if you really care, go command line. JSC is what is behind Bun so you can go that direction as well.

And BTW: console.time et al does a bunch of stuff itself. You will get the JIT looking to optimize it as well in that loop above, lol.

igouy 4 days ago

> and divide by the count
Which gives an average rather than a time?
- gmokki 4 days ago
  
  I usually do a
  var innerCount = 2000; // should run about 2 seconds for (var i=0; i<1000; i++) { var start = currentMillis(); for (var j=0; j<innerCount; j++) { benchmark method(); } best = min(best, (currentMillis() - start) / (double) innerCount); }
  That way I can both get enough precision form the millisecond resolution and run the whole thing enough times to get the best result without JIT/GC pauses. The result is usually very stable, even when benchmarking calls to database (running locally).
  - sroussey a day ago
    
    That’s great!
    My point was to minimize stuff that will get JITed—like console functions. People don’t realize how much code is in there that is not native.
  - igouy 4 days ago
    
    No interest in a more general tool?
    https://github.com/sosy-lab/benchexec

thecodrr 4 days ago

Benchmarking is a mess everywhere. Sure you can get some level of accuracy but reproducing any kind of benchmark results across machines is impossible. That's why perf people focus on things like CPU cycles, heap size, cache access etc instead of time. Even with multiple runs and averaged out results you can only get a surface level idea of how your code is actually performing.

tonetheman 4 days ago

[dead]

DJBunnies 4 days ago

The whole JavaScript ecosystem is a mess.

egberts1 4 days ago

As one who mapped the evolution of JavaScript and actually benchmark each of those iteration (company proprietary info), it Doesn't get anymore accurate that the OP's reiteration of the article's title.
I upvoted that.
Evolution chart:
https://egbert.net/blog/articles/javascript-jit-engines-time...

gred 4 days ago

If you find yourself benchmarking JavaScript, you chose the wrong language.