3,200% CPU Utilization

josephmate.github.io

493 points by atomlib 8 months ago

macspoofing 8 months ago

Ooof. The core collections in Java are well understood to not be thread-safe by design, and this should have been noticed.

OP should go through the rest of the code and see if there are other places where collections are potentially operated by multiple threads.

>The easiest way to fix this was to wrap the TreeMap with Collections.synchronizedMap or switch to ConcurrentHashMap and sort on demand.

That will make the individual map operations thread-safe, but given that nobody thought of concurrency, are you sure series of operations are thread-safe? That is, are you sure the object that owns the tree-map is thread-safe. I wouldn't bet on it.

>Controversial Fix: Track visited nodes

Don't do that! The collection will still not be thread-safe and you'll just fail in some other more subtle way .. either now, or in the future (if/when the implementation changes in the next Java release).

>Sometimes, a detail oriented developer will notice the combination of threads and TreeMap, or even suggest to not use a TreeMap if ordered elements are not needed. Unfortunately, that didn’t happen in this case.

That's not a good take-away! OP, the problem is you violating the contract of the collection, which is clear that it isn't thread-safe. The problem ISN'T the side-effect of what happens when you violate the contract. If you change TreeMap to HashMap, it's still wrong (!!!), even though you may not get the side-effect of a high CPU utilization.

---------

When working with code that is operated on by multiple threads, the only surefire strategy I found was to to make every possible object immutable and limiting any object that could not be made immutable to small, self-contained and highly controlled sections. We rewrote one of the core modules following these principles, and it went from being a constant source of issues, to one of the most resilient sections of our codebase. Having these guidelines in place, also made code reviews much easier.

OskarS 8 months ago
> That will make the individual map operations thread-safe, but given that nobody thought of concurrency, are you sure series of operations are thread-safe? That is, are you sure the object that owns the tree-map is thread-safe. I wouldn't bet on it.
This is a very important point! And, in my experience, a common mistake by programmers who aren't great at concurrency.
Lets say you have a concurrent dynamic array. The array class is designed to be thread-safe for the explicit purpose that you want to share it between threads. You want to access element 10, but you want to be sure that you're not out of bounds. So you do this:
```
    if (array.size() > 10) {
        array.element(10).doSomething();
    }
```
It doesn't matter how thread-safe this array class is: these two operations combined are NOT thread-safe, because thread-safety of the class only means that `.size()` and `.element()` are on their own not going to cause any races. But it's entirely possible another thread removes elements in between you checking the size and accessing the element, at which point you'll (at best) get an out-of-bounds crash.
The way to fix it is to either use atomic methods on the class which does both (something like `.element_or_null()` or whatever), or to not bother with a concurrent dynamic array at all and instead just use regular one you guard with a mutex (so the mutex is held during both operations and whatever other operations other threads perform on the array).
- hassleblad23 7 months ago
  
  A similar pattern (and overlooked very often) is with DB updates where you want to increment a number by 1, but split it into 2 roundtrips to the DB to get the existing value and set it to `value + 1`.
  Concurrent transactions can screw it up, and you wouldn't notice until an escalation happens.
- int_19h 7 months ago
  
  This kind of stuff is exactly why the original "thread-safe" collections in Java were deprecated in favor of the current one: locking on every operation means that a lot of trivial code "just works", but then things break without warning once you need consistency between consequent operations. And, at the same time, everybody is paying the locking penalty, even if they don't actually do concurrent access, or if they do their own locking.
  Ironically, .NET repeated the same mistake with the same outcome - it's just less obvious because for .NET the switch from synchronized-by-default to unsync-by-default happened when generics were introduced, so type safety of the new collections was the part that devs generally paid the most attention to.
- Thaxll 7 months ago
  
  This is not a thread safety issue though, this is a transaction problem.
  - masklinn 7 months ago
    
    It’s absolutely a thread safety issue. It’s not a data race, but it’s a race condition and can very much cause unexpected behaviour e.g. out of bounds error even though on its face it has the correct preconditions.
  - hassleblad23 7 months ago
    
    Wrapping these 2 statements in a transaction will not solve this particular issue unless the transactions are run sequentially.
sunshowers 8 months ago

Unsynchronized, shared, mutable state are where data races happen. You need all three.
This means there are three ways to resolve this:
* Add synchronization with locks, etc.
* Don't share mutable access, e.g. single-owner model with channels.
* Make data immutable: an insight originally from purely functional languages. I believe Google invested quite heavily in this model with Guava.
Rust lets you choose which one of the three to give up, and it also statically prevents all three from happening at the same time.
- fc417fc802 7 months ago
  
  > Don't share mutable access, e.g. single-owner model with channels.
  That can still break if a reader makes multiple calls and implicitly assumes that the data structure at large hasn't been mutated between them.
  - sunshowers 7 months ago
    
    This is equivalent to giving up a lock in between though, I think.
    
    fc417fc802 7 months ago
    
    Maybe I misunderstood. I took the first bullet to refer to wrapping operations in a lock and the second to refer to single writer multiple reader.
    At least in the single writer scenario it all breaks down if a lock free reader requires multiple items that are related to one another. It's an easy oversight to make though.
    Fixing it sucks because you usually have to go and add locking to your previously fast and efficient lock free setup.
    
    sunshowers 7 months ago
    
    > At least in the single writer scenario it all breaks down if a lock free reader requires multiple items that are related to one another. It's an easy oversight to make though.
    This is true -- generally the solution to this is to ask for everything in a single message.
    
    pbh101 7 months ago
    
    It can also break if the read implementation has mutable side effects, e.g. rebalancing. Java HashMap used to have this causing an infinite loop. I haven’t looked recently but my guess is it still does.
    https://mailinator.blogspot.com/2009/06/beautiful-race-condi...
    ‘Not thread safe’ really means ‘not thread safe’.
    
    gf000 7 months ago
    
    I mean, that's the "art of the deal". If you have a lock for all eternity, then you have a single owner, and are practically single-threaded.
    The hard part is exactly all the combinations locks can be hold and given up, which will have a non-trivial number for more complex programs/data structures. That's where modeling software can help by iterating over all the possible cases.
zelphirkalt 7 months ago

This is basically what I thought: "Wait a second, is that data structure even thread-safe??" and then "Seems like you are just using the wrong data structure for your purpose.". Shoehorning with mutexes will usually lead to problems elsewhere and bottlenecks.
What you suggest to make objects immutable: Yes, and then one actually uses different data structures, which are made for being used as immutable data structures, making use of structural sharing, to avoid or limit an explosion in memory usage. So again its a case of "use the right data structures for the job". Using purely functional data structures makes one side of the problem simply disappear. The other side might not even come up in many cases: What if the threads depend on other versions of the data structure, which the other threads create? (need intermediate results of other threads). Then one has a real headache coming ones way, and probably needs again other data structures.
One additional idea I had, if one really wants to shoehorn it with the already used mutable data structure is, that one could write something that serializes the access attempts before they hit the data structure. That could also include grouping accesses into transactions and only running complete transactions. Then it makes me think whether it is close to implementing a database ...
somat 7 months ago

So your solution to the the insanity that is the threading model was to reinvent the process model in it...
I don't intend to be mean, I just think the idea of independent execution in a shared memory environment is fundamentally flawed. And the effort in trying to get such follies working should have been put into better process models.
- macspoofing 7 months ago
  
  >So your solution to the the insanity that is the threading model was to reinvent the process model in it...
  Where did you get that?
  Immutability does not mean replicating the 'process model'. It means removing (limiting) an entire class of bugs from your multithreaded code. There will still be shared areas - and those should be controlled.

Someone 8 months ago

FTA: The code can be reduced to simply:

  public void someFunction(SomeType relatedObject,
                           List<SomeOtherType> unrelatedObjects) {
    ...
    treeMap.put(relatedObject.a(), relatedObject.b());
    ...
    // unrelatedObjects is used later on in the function so the
    // parameter cannot be removed
  }

That’s not true. The original code only does the treeMap.put if unrelatedObjects is nonempty. That may or may not be a bug.

You also would have to check that a and b return the same value every time, and that treeMap behaves like a map. If, for example, it logs updates, you’d have to check that changing that to log only once is acceptable.

TYMorningCoffee 8 months ago

Good point. It should be replaced with an if not empty check.
- sedatk 7 months ago
  
  An empty check wouldn't be thread-safe either.
  - vips7L 7 months ago
    
    Hmm I don’t think we know that from the post. The list of unrelated objects could be immutable or mutable and never modified. An isEmpty check is 100% safe to be called concurrently in both those situations

layer8 8 months ago

Another way to get infinite loops is using a Comparator or Comparable implementation that doesn’t implement a consistent total order: https://stackoverflow.com/questions/62994606/concurrentskips... (This is unrelated to concurrency.)

Whether it occurs or not can depend on the specific data being processed, and the order in which it is being processed. So this can happen in production after seemingly working fine for a long time.

TYMorningCoffee 8 months ago

Have you seen this before in person? It would make a great blog post.
I haven't personally encountered a buggy comparator without a total order.
- sunshowers 8 months ago
  
  When Rust's sort algorithm changed to detect total order violations, it caused some breakage. See for example: https://github.com/mitsuhiko/insta/pull/586
  I recently wrote a blog post which (while not the main focus) includes a discussion on randomized testing of sort algorithms in the face of inconsistent comparators: https://sunshowers.io/posts/monads-through-pbt/ section 2.
- layer8 8 months ago
  
  I have seen a lot of incorrect Comparators and Comparable implementations in existing code, but haven’t personally come across the infinite-loop situation yet.
  To give one example, a common error is to compare two int values via subtraction, which can give incorrect results due to overflow and modulo semantics, instead of using Integer::compare (or some equivalent of its implementation).
  - smarks 8 months ago
    
    Interesting. I haven’t seen an infinite loop either, but I can imagine one if a comparator tries to be too “clever” for example if it bases its comparison logic on some external state.
    Another common source of comparator bugs is when people compare floats or doubles and they don’t account for NaN, which is unequal to everything, including itself!
    In Java, the usual symptom of comparator bugs is that sort throws the infamous “Comparison method violates its general contract!” exception.
- hollowcelery 8 months ago
  
  I knew someone who missed out on a gold medal at the International Olympiad of Informatics because his sort comparator didn’t have a total order.
  - TYMorningCoffee 8 months ago
    
    Ouch. Any idea which problem? Those problems are public: https://ioinformatics.org/page/contests/10

lucianbr 8 months ago

> I always thought of race conditions as corrupting the data or deadlocking. I never though it could cause performance issues. But it makes sense, you could corrupt the data in a way that creates an infinite loop.

Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.

williamdclt 8 months ago

It's a decent rule of thumb, but it definitely needs some pragmatism. Squashing any error, strangeness and warning can be very expensive in some projects, much more than paying the occasional seemingly-unrelated problem.
But of course it's quasi-impossible to know in advance the likelihood of a given error participating in a future problem, and whether it's cheaper to fix this error ahead or let the problem happen. So it becomes an art more than a science to decide what to focus on.
"fix nothing" is certainly a horrible approach, "fix everything" is often impractical. So you either need some sort of decision framework, or a combination of decent "instinct" (experience, really) and trust from your stakeholder (which comes from many places, including good communication and track record of being pragmatic over dogmatic)
- alex_smart 8 months ago
  
  > Squashing any error, strangeness and warning can be very expensive in some projects
  Strongly disagreed. Strange, unexpected behaviour of code is a warning sign that you have fallen short in defensive programming and you no longer have a mental model of your code that corresponds with reality. That is a very dangerous to be in. Very quickly possible to be stuck in quicksand not too far afterwards.
  - huijzer 8 months ago
    
    Depends a lot on the project, I think, as the parent comment suggests.
  - fc417fc802 7 months ago
    
    I feel like these categories are different. Warnings should generally be treated as errors in my book, and all errors should be corrected. But "strangeness" is much more open ended. Sometimes large systems don't behave quite as expected and it might make sense to delay a "fix" until something is actually in need of it. If none of your tests fail then does it really matter?
    
    alex_smart 7 months ago
    
    > If none of your tests fail then does it really matter?
    Yes. Absolutely.
    You don't believe your software is correct because your tests don't fail. You believe your software is correct because you have a mental model of your code. If your tests are not failing but your software is not behaving correctly, that your mental model of your code is broken.
    
    fc417fc802 7 months ago
    
    I agree for small systems. But as they get larger you often can't keep track of every last piece simultaneously. It can also become quite involved to figure out why a relatively obscure thing happened in a particular case.
    Consider something like Unreal Engine for example. It's not realistic to expect to have a full mental image of the entire system in such a case.
    At least in theory the tests are supposed to cover the observable behavior that matters. So I figure if the tests pass all is well. If I still find something broken then I need to add a test case for it.
    
    alex_smart 7 months ago
    
    > But as they get larger you often can't keep track of every last piece simultaneously
    Sure, but then you divide the larger system into smaller components where each team is responsible for one or few of these individual pieces and the chief architect is responsible for making sure of how the pieces are put together.
    > At least in theory the tests are supposed to cover the observable behavior that matters. So I figure if the tests pass all is well. If I still find something broken then I need to add a test case for it.
    But you sure as hell hope that the engineer working on implementing your database has a decent mental model for the thread safety of his code and not introduces subtle concurrency bugs because his tests are still green. You also hope that he understands that he needs to call fsync to actually flush to data to disk instead of going yolo (systems never crash and disks never fail ). How are you supposed to cover the user observable behavior in this case? You cut off the power supply to your system/plug off your disk while writing to the database and assert that all the statements that got committed actually persisted? And how many times you repeat that test to really convince you that you are not leaving behind a bug that will only happen in production systems say once every three years?
    I am only giving database and multithreading as examples because they are the most obvious, but I think the principle applies more generally. Take the simplest piece of code everyone learns to write first thing in uni, quicksort. If you don't have a sufficient mental model for how that algorithm works, what amount of tests will you write to convince yourself that your implementation is correct?
    
    fmbb 7 months ago
    
    > then you divide the larger system into smaller components where each team is responsible for one or few of these individual pieces and the chief architect is responsible for making sure of how the pieces are put together
    And then you have ravioli code in the large. It is not going to make it easier to understand the bigger system, but it will make it harder to debug.
    https://en.wikipedia.org/wiki/Spaghetti_code#Ravioli_code
    If you can reproduce an error, you can fix it. Do that.
    If you cannot reproduce it after a day of trying and it doesn’t happen often, don’t fix it.
    
    fc417fc802 7 months ago
    
    Off topic but when I first saw ravioli code it was in a positive light, as a contrast to lasagna code. But then somewhere along the line people started using it in a negative manner.
    There is some optimal level of splitting things up so that it's understandable and editable without overdoing it on the abstraction front. We need a term for that.
    
    quinnirill 7 months ago
    
    Pizza code? Central components are mostly clearly distinguishable and glued together in a fairly consistent manner.
    
    fc417fc802 7 months ago
    
    Probably not the best examples.
    Sqlite famously has more test related code than database code.
    Multithreading correctly is difficult enough that multiple specialized modeling languages exist and those are cumbersome enough to use that most people don't. In practice you avoid bugs there by strictly adhering to certain practices regarding how you structure your code.
    You mention fsync but that famously does not always behave as you expect it to. Look up fsyncgate just for starters. It doesn't matter how well you think you understand your code if you have faulty assumptions about the underlying system.
    Generally you come across to me as overconfident in your ability to get things right by carefully reasoning about them. Of course it's important to do that but I guarantee things are still going to break. You will never have a perfect understanding of any moderately large system. If you believe you do then you are simply naive. Plan accordingly (by which I mean, write tests).
    
    Kiro 7 months ago
    
    I highly doubt anyone has a mental model of the all the code they're working with. You very often work with code that you kind of understand but not fully.
    
    alex_smart 7 months ago
    
    I obviously meant the code that you own and are responsible for.
    
    Kiro 7 months ago
    
    Same thing there.
    
    taneq 7 months ago
    
    And/or so are your tests.
    
    alex_smart 7 months ago
    
    No, they are not. They are a cheap way of verifying that something hasn't gone wrong, not a proof of correctness.
    Tests failing implies the code is incorrect. Tests not failing does not imply that the code is correct.
    
    fc417fc802 7 months ago
    
    > Tests not failing does not imply that the code is correct.
    I don't think that's what's being suggested. Tests not failing when your code does implies that you are missing test cases. In other words things are underspecified.
    Haskell is the extreme example of this. If it successfully compiles then it most likely does exactly what you intended but it might be difficult to get it to compile in the first place.
    
    alex_smart 7 months ago
    
    >Tests not failing when your code does implies that you are missing test cases. In other words things are underspecified.
    I am really confused. Have you guys never written any multithreaded code? You can write the most disgusting thread-unsafe code without a single lock and be perfectly green on all your tests. And who in the world can write tests to simulate all possible timing scenarios to test for race conditions?
    I give multithreading as just the most egregiously obvious example that this "tests can prove correctness" idea is fundamentally broken, but I think it applies more generally.
    >Haskell is the extreme example of this. If it successfully compiles then it most likely does exactly what you intended but it might be difficult to get it to compile in the first place.
    Absolutely 100% of the safety of haskell comes from the mental model (functional programming, immutable data structures etc) and none from the test cases (although their community appears to even do testing slightly better than others).
    
    fc417fc802 7 months ago
    
    My Haskell comment was regarding specification of the overall system, not tests specifically. It was a reference to the incredible type system.
    > this "tests can prove correctness" idea
    You are the only one putting forward such an idea. It's not that I think tests passing proves correctness. It's that I know from experience that I don't fully understand the system. The code that breaks is always a surprise to me because if it wasn't then I would have fixed it before it broke.
    So if my code breaks and tests don't catch it then I figure that before I fix it I should add a test for it.
    Of course there are some categories such as multithreading that you generally can't test. So you take extra care with those, and then inevitably end up sinking time into debugging them later regardless.
    
    alex_smart 7 months ago
    
    >You are the only one putting forward such an idea
    This was the very first comment you made that started this thread:
    >> If none of your tests fail then does it really matter?
    
    fc417fc802 7 months ago
    
    "it doesn't matter" != "code is correct"
    Sometimes it's just not cost effective to solve every last bit of jank.
    If you write software that controls safety critical systems then obviously that statement does not apply. But if you write webapps or games or ...
    
    alex_smart 7 months ago
    
    >"it doesn't matter" != "code is correct"
    Fair. Despite the lengthy argument, I don't think our stances are all that drastically different. I am not saying that you shouldn't write tests. Just that the mental model comes first and that the tests are informed by it (in an ideal world the test are an automatically executable specification of your mental model).
    Still I don't understand why you insist that "there exists an automated test for it" has to be the definition of whether something matters.
    > But if you write webapps or games or ...
    In which case it might just be fine to YOLO it without tests.
    People have been delivering valuable software without automated tests for decades. Checkout the code of the linux kernel from circa 2005 and see how many tests there are.
    
    taneq 7 months ago
    
    I'm not saying that passing tests proves the code is correct, I'm saying that if you find a problem with the code that your tests don't pick up, then you should add a test for it.
  - taneq 7 months ago
    
    100% this. If our product does something unexpected, finding out why is top priority. It might be that everything’s fine and this is just a rare edge case where this is the correct behaviour. It might be a silly display bug. Or it might be the first clue about a serious underlying issue.
  - saagarjha 8 months ago
    
    I mean, it is expensive. It’s just that the alternative might be more so.
    
    fmbb 7 months ago
    
    > might be
    Yes.
    Or it might be cheaper.
- Ntrails 8 months ago
  
  I remember a project to get us down from like 1500 build warnings to sub 100. It took a long time, generated plenty of bikeshedding, and was impossible to demonstrate value.
  I, personally, was mostly just pissed we didn't get it to zero. Unsurprisingly the number has climbed back up since
  - TYMorningCoffee 8 months ago
    
    Could you propose to fail the build based on the number of warnings to ensure it doesn't go up?
    I did something similar with spotbugs. There were existing warnings I couldn't get time to fix so I configured the maven to fail if it exceed the level at which I enabled it.
    This has the unfortunate side effect that if it drops and no one adjusts the threshold then people can add more issues without failing the build.
    
    bornfreddy 8 months ago
    
    > This has the unfortunate side effect that if it drops and no one adjusts the threshold then people can add more issues without failing the build.
    Our tests are often written with a list of known exceptions. However, such tests also fail if an exception is no longer needed - with a congratulatory message and a notice that this exception should be removed from the list. This ensures that the list gets shorter and shorter.
    
    allannienhuis 8 months ago
    
    I did this with a project that was worked on by a large team (50+ devs, many, many, many kloc) when we first added linting to the project (this was very early 2000s) - we automatically tracked the number of errors and warnings at each build, persisted them, and then failed the build if the numbers went up. So it automatically adjusted the threshold.
    It worked really well to incrementally improve things without forcing people to deal with them all the time. People would from time to time make sure they cleaned up a number of issues in the files they happened to be working on, but they didn't have to do them all (which can be a problem with strategies that for example lint only the changed files, but require 0 errors). We threw a small line chart up one one of our dashboards to provide some sense of overall progress. I think we got it down to zero or thereabouts within a year or so.
    
    thingification 8 months ago
    
    If you can instead construct a list of existing instances to grandfather in, that doesn't suffer from this problem. Of course many linting tools do this via "ignore" code comments.
    That feels less arbitrary than a magic number (because it is!) and I've seen it work.
    
    Etheryte 8 months ago
    
    We used this approach to great effect when we migrated a huge legacy project from Javascript to Typescript. It gives you enough flexibility in the in between stages so you're not forced to change weird code you don't know right away, while enforcing enough of a structure to eventually make it out alive in the end.
    
    Ntrails 8 months ago
    
    Absolutely could!
    However, management felt kinda burned because that was a bunch of time and unsurprisingly nobody was measurably more productive afterwards (it turns out those are just shitty code tidbits, but not highly correlated with areas which where it is miserable to make changes. Some of the over-refactorings probably made things harder.
    It was a lovely measurable metric, making it an easy sell in advance. Which maybe was the problem idk.
  - kristianp 7 months ago
    
    It depends on the language. It's so easy to generate warnings in C/C++, other languages they are rare or it's easy to avoid them.
  - pimeys 7 months ago
    
    I guess I've been lucky to work on companies in the past ten or so years, where compiler warnings were treated as errors in the CI. Also been really lucky to use an editor which always highlights warnings and lists them, so I can fix those easily.
  - minutillo 7 months ago
    
    Step 1: get it to zero
    Step 2: -Werror
    Step 3: there is no step 3
- Forricide 8 months ago
  
  Fixing everything is impractical, but I'd say a safer rule of thumb would be to at least understand small strangenesses/errors. In the case of things that are hard to fix - e.g. design/architectural decisions that lead to certain performance issues or what have you - it's still usually not too time consuming to get a basic understanding of why something is happening.
  Still better to quash small bugs and errors where possible, but at least if you know why they happen, you can prevent unforeseen issues.
  - munk-a 8 months ago
    
    Sometimes it can take a serious effort to understand why a problem is happening and I'll accept an unknown blip that can be corrected by occasionally hitting a reset button occasionally when dealing with third party software. From my experience my opinion aligns with yours though - it's also worth understanding why an error happens in something you've written, the times we've delayed dealing with mysterious errors that nobody in the team can ascribe we've ended up with a much larger problem when we've finally found the resources to deal with it.
    Nobody wants to triage an issue for eight weeks, but one thing to keep in mind is that the more difficult it is to triage an issue the more ignorance about the system that process is revealing - if your most knowledgeable team members are unable to even triage an issue in a modest amount of time it reveals that your most knowledgeable team members have large knowledge gaps when it comes to comprehending your system.
    This, at least, goes for a vague comprehension of the cause - there are times you'll know approximately what's going wrong but may get a question from the executive suite about the problem (i.e. "Precisely how many users were affected by the outage that caused us to lose our access_log") that might take weeks or months or be genuinely nigh-on-impossible to answer - I don't count questions like that as part of issue diagnosis. And if it's a futile question you should be highly defensive about developer time.
    
    Forricide 8 months ago
    
    That's very fair - at least with third party software, it can be nigh impossible to track down a problem.
    With third party libraries, I've too-often found myself reading the code to figure something out, although that's a frustrating enough experience I generally wouldn't wish on other people.
  - relistan 7 months ago
    
    This. Understand it all least to a level where you can make an effort vs risk/impact trade off. Ideally eliminate all low effort issues and mitigate high risk or high impact issues. But eliminating them all is not usually practical. And besides, most of the high impact/high effort application risk resides in design and not in warnings that come from logs or the compiler.
- gamedever 8 months ago
  
  if there are any warnings I'm supposed to ignore then there are effectively no warnings.
  there's nothing pagmatic about it. once I get into the habit of ignoring a few warnings that effectively means all warnings will be ignored
connicpu 8 months ago

At my job we treat all warnings as errors and you can't merge your pull requests unless all automatically triggered CI pipelines pass. It requires discipline, but once you get it into that state it's a lot easier to keep it there.
- berkes 8 months ago
  
  The last point is the key.
  It then creates immense value by avoiding a lot of risk and uncertainty for little effort.
  Getting from "thousands of warnings" to zero isn't a good ROI in many cases, certainly not on a shortish term. But staying at zero is nearly free.
  This is even more so with those "fifteen flickering tests" these 23 tests that have been failing and ignored or skipped for years.
  It's also why I commonly set up a CI, testing systems, linters, continuous deployment before anything else. I'll most often have an entire CI and guidelines and build automation to deploy something that will only say "hello world". Because it's much easier to keep it green, clean and automated than to move there later on
  - Calamitous 8 months ago
    
    That's because it moves from being a project to being a process. I've tried to express this at my current job.
    They want to take time out to write a lot of unit tests, but they're not willing to change the process to allow/expect devs to add unit tests along with each feature they write.
    I'll be surprised if all the tests are still passing two months after this project, since nobody runs them.
    
    temporallobe 8 months ago
    
    That’s why TDD (Test-Driven Development) has become a trend. I personally don’t like TDD’s philosophy of writing tests first, then the code (probably because I prefer to think of a solutions more linearly), but I do absolutely embrace the idea and practice of writing tests along side of the code, and having minimum coverage thresholds. If you build that into your pipeline from the very beginning, you can blame the “process” when there aren’t enough tests.
    
    ptmcc 7 months ago
    
    The flip that switched for me to make me practice something TDD-adjacent is to replace most manual verification with writing a test. Once I got in the habit I find it so much faster, more consistent, and then I have lasting tests to check in!
    I don't typically write tests first so it's not true TDD but it's been a big personal process improvement and quality boost.
    
    berkes 7 months ago
    
    > to allow/expect devs to add unit tests
    For me such gigs are a red flag and immediate turn down (I'm freelancer with enough opportunities, luxury position, I know).
    I would consider it really weird if management dictates exactly what tools and steps a carpenter must take to repair a chair. Or when the owner of a hotel tells the chef what steps are allowed when preparing fish. We trust the carpenter or chef to know this best. To know best how to employ their skills given the context.
    If management doesn't trust the experts they hire to make the right choice in how they work, what tools they use, what steps they take, etc. that's a red flag: either they are hiring the wrong people (and the micromanaging is an attempt to fix that) or they don't think the experts are expert enough to make decisions on their own.
    For me, this goes for tools (e.g. management dictates I must work on their windows machine with their IDE and other software) for processes (management forbids tests, or requires certain rituals around merges etc) and for internals (management forbidding or requiring certain abstractions, design patterns etc)
    To be clear: a team, through, or via a management, should have common values and structures and such. And it makes perfect sense for management to define the context (e.g. this is a proof of concept, no need for the rigid quality here. Or we must get these features out of the door before thursday, nothing else matters.) It's when the management dictates how teams or experts must achieve this that it becomes a red flag to me.
    I haven't been wrong in this. These red-flags almost always turned out to hint at underlying, deeply rooted cultural problems that caused all the technical troubles.
    
    noisy_boy 7 months ago
    
    > I'll be surprised if all the tests are still passing two months after this project, since nobody runs them.
    Wouldn't they just run as part of the build? At least for Java, Junit tests run as part of the build by default.
- reaperducer 8 months ago
  
  At my job we treat all warnings as errors and you can't merge your pull requests unless all automatically triggered CI pipelines pass. It requires discipline, but once you get it into that state it's a lot easier to keep it there.
  Sounds like what we used to call "professionalism." That was before "move fast, break things and blame the user" became the norm.
  - LeifCarrotson 8 months ago
    
    It very much depends on the nature of your work.
    If manual input can generate undefined behavior, you depend on a human making a correct decision, or you're dealing with real-world behavior using incomplete sensors to generate a model...sometimes, the only reasonable target is "fail gracefully". You cannot expect to generate right outputs with wrong inputs. It's not wrong to blame the user when economics, not just laziness, say that you need to trust the user to not do something unimagineable.
    I think this is the kind of situation where a little professionalism would have prevented the issue: Handling uncaught exceptions in your threadpool/treemap combo would have prevented the problem from happening.
  - dheera 8 months ago
    
    > That was before "move fast, break things and blame the user" became the norm.
    When VCs only give you effectively 9 months of runway (3 months of coffees, 9 months of actual getting work done, 3 months more coffees to get the next round, 3 more months because all the VCs backed out because your demo wasn't good enough), move fast and break things is how things are done.
    If giving startups 5 years of runway was the norm, then yeah, we could all be professional.
  - mardifoufs 8 months ago
    
    There's basically no proof that software used to be more "professional". Sure the process was more formal, but I've not seen any proof (and I'm not talking about peer reviewed stuff here, but even just anecdotal examples) of the "end result" of those processes being better, more robust or even less buggy than what we get out of what some may call "move fast and break stuff" development.
  - robocat 8 months ago
    
    > professionalism." That was before "move fast, break things
    I think you're professing a false dichotomy. Is it unprofessional to "move fast, break things"?
    I'm a slow moving yak shaver partly due to concious intention. I admire some outcomes from engineers that break things like big rockets.
    I definitely think we learn fast by breaking things: assuming we are scientific enough to design to learn without too much harm/cost.
    
    tonyarkles 7 months ago
    
    > I admire some outcomes from engineers that break things like big rockets.
    I work in unmanned aerospace. It started with 55lb quadcopters and got… significantly bigger from there.
    I’ve thought a ton about what you’re saying over the last 5-6 years. I have broken things. My coworkers and underlings have broken things. We’ve also spent a bunch of time doing design review, code review, FEA review, big upfront design, and all those expensive slow things.
    Here’s, for me, the dividing line: did we learn something new and novel? Or did we destroy kilobux worth of hardware to learn something we could have learned from a textbook, or doing a bit of math, or getting a coworker to spend a few hours reviewing our work before flying it?
    And I 100% agree with your last statement: you can “move fast and break things for science” professionally. But… if something breaks when you weren’t expecting it to, the outcome of the post-mortem really should be surprising new knowledge and not “this made it to prod without review and was a stupid mistake”
    
    int_19h 7 months ago
    
    > Is it unprofessional to "move fast, break things"?
    Most of the time it is. The sorry state of software in general is a testament to that.
- dataflow 8 months ago
  
  > At my job we treat all warnings as errors
  Pretty sure what you actually mean is that you treat some warnings as errors, and disable the others. I would find it hard to believe you're building with literally all the compiler warnings enabled all the time.
  - connicpu 7 months ago
    
    We do have a couple disabled by default in the config, but it's still a lot of warnings:
    -Wall,-Wextra,-Werror,-Wno-type-limits,-Wno-attributes,-Wno-deprecated-declarations
    And of course some are suppressed with a pragma for a short section of specific segments of the code where you can make the argument that it's appropriate during code review, but those pragmas stick out like a sore thumb.
    
    dataflow 7 months ago
    
    > We do have a couple disabled by default in the config, but it's still a lot of warnings
    A lot, yes, but definitely a lot more than 2 that you still don't have enabled. Might be worth looking into them if you haven't already -- you will definitely disable some of them in the process.
    (And I assume you already have clang-tidy, but if not, look into that too.)
    
    connicpu 7 months ago
    
    Yep, plus static analysis in CI. We also run ubasan, tsan, and valgrind on our unit tests.
    
    kelnos 7 months ago
    
    -Wall + -Wextra doesn't actually enable all warning, though. There are quite a few others that you still have to enable manually.
stickfigure 8 months ago

The other thing is don't catch and ignore exceptions. Even "catch and log" is a bad idea unless you specifically know that program execution is safe to continue. Just let the exception propagate up to where something useful can be done, like return 500 or displaying an error dialog.
- wpollock 7 months ago
  
  I agree, and I think the OP does as well. Really, the Executor framework used here was to blame by not setting the uncaught exception handler to catch and propagate those exceptions by default.
  But having shared, mutable state without locks or other protection should never have passed code review in the first place. I would look at that commit to try to determine how it happened, then try to change the team's processes to prevent that.
  - necovek 7 months ago
    
    Most commonly you would have a non-concurrent use of TreeMap, and then for performance reasons, someone else would come in and introduce threads in a couple of places, without ensuring that all data access (esp write) is properly guarded.
    The way people structure code, it might even be non-obvious that there's a use of something like TreeMap, as it will be abstracted away into "addNode" method.
    Still a red flag, since the process when introducing threads should be "ensure data structures allow for concurrent write and read, or guard them otherwise", but when the task says "performance improvement" and one's got 14x or 28x improvement already, one might skip that other "bit" due to sheer enthusiasm.
temporallobe 8 months ago

Yes, but…I suppose you have to pick your battles. There was recently a problem vexing me about a Rails project I maintain where the logs were filled with complaints about “unsupported parameters”, even though we painstakingly went through all the controllers and allowed them. It’s /probably/ benign, but it adds a lot of noise to the logs. Several of us took a stab at resolving it, but in the end we always had much higher priorities to address. Also it’s hard to justify spending so many hours on something that has little “business value”, especially when there is currently no functional impact.
It’s a nuisance issue sorta like hemorrhoids. Do you get the surgery and suffer weeks of extreme pain, or do you just deal with it? Hemorrhoids are mostly benign, but certainly have the potential to become more severe and very problematic. Maybe this phenomenon should be called digital hemorroids?
- brirec 8 months ago
  
  As someone with pretty bad hemorrhoids, I’m hesitant to ask my doctor about surgery because I’ve been told the hemorrhoids will come back, without question. So it’s even still just a temporary fix…
  - imtringued 7 months ago
    
    Surgery for hemorrhoids sounds like trying to cure the symptom.
    There is something going wrong in your body that has hemorrhoids as a downstream effect. Surgery can't fix the root cause.
    If you have constipation then consider the following: the large intestine has bacteria that process your undigested food and this can have many nasty consequences. What is going wrong in the small intestine that leads to this?
- colechristensen 8 months ago
  
  Couldn’t you just run a debugger to find all of the incidents of that issue?
  - temporallobe 8 months ago
    
    We’ve been down many paths on this. In some cases we know exactly where it’s happening, but despite configuring everything correctly, it still complains. It might just be a bug in the Rails code or a fault in the way parameters are passed in (some of the endpoints take a lot of parameters, some of them optional). We could “fix” the issue by simply allowing all parameters, but of course this opens a security risk. This is a 10+ year old code base and I am told it has been a thorn in their side for a long time. It’s one of those battles thar I suppose we are not going to try fighting unless we get really bored and have nothing else to work on.
  - bornfreddy 8 months ago
    
    Also, stack trace should show you everything you need to know to fix this, or am I missing something? (no experience with Ruby)
    Otherwise, I see the cleanups and refactoring as part of normal development process. There is no need to put such tasks in Jira - they must be done as preparation for the regular tasks. I can imagine that some companies take agile too seriously and want to micromanage every little task, but I guess lack of time for refactoring is not the biggest problem.
    
    colechristensen 8 months ago
    
    "why don't you just" comments are easy :) (I made one)
    debugging in codebases with a lot of magic (rails) is hard. it can be very difficult to follow calls around unless you're quite the expert. certain styles of programming really frustrate me, but then again I program like a scientist so the kinds of things I'm prone to do frustrate software engineers (for loops nested 8 deep, preference for single character variables, etc.)
    
    temporallobe 7 months ago
    
    You’re spot-on with the “magic” of Rails. While it can be very powerful and feature-rich, it can feel like a black box at times and stack traces aren’t always accurate or helpful. To be fair, this happens with a lot of library-heavy frameworks though, Spring Boot being a prime example of this.
dspillett 8 months ago

> > race conditions […] I never though it could cause performance issues. But it makes sense, you could corrupt the data in a way that creates an infinite loop.
Even without corruption a race condition can cause significant performance issues by causing the same work to be done many times with only one result being kept.
> Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle
For warnings: at least explained in a comment, where it has been decided irrelevant (preferably with a pragma to turn off the warning as locally as possible).
Strange behaviour I prefer to get rid of. I've found code marked (by me at least once!) "not sure why this works, but" which very much no longer works, and had to rewrite in a rush where if it had been addressed earlier there would have been more time to be careful.
- szundi 8 months ago
  
  There are so many of these in some projects that starting to fix them kills it.
  Only time to fix is just after it is discovered - or mostly never ever because it becomes expensive to build up the context in mind again.
swatcoder 8 months ago

> Rarely is this accepted by whoever chooses what we should work on.
You need to find more disciplined teams.
There are still people out there who care about correctness and understand how to achieve it without it being an expensive distraction. It a team culture factor that mostly just involves staying on top of these concerns as soon as they're encountered so there's not some insurmountable and inscrutable backlog that makes it feel daunting and hopeless or that makes prioritization difficult.
- saulpw 8 months ago
  
  Most teams are less disciplined than they should be. Also, job/team mobility is very low right now. So the question becomes, how do you increase discipline on the team you're on?
  - rapind 8 months ago
    
    For very small teams, exploring new platforms and / or languages that compliment correctness is an option. Using a statically typed language with explicit managed side effects has made a huge difference for me. Super disruptive the larger the team though of course.
    
    necovek 7 months ago
    
    It's also a function of the project size.
    Switching that 500k LOCs Ruby or JavaScript project over to Rust ain't happening quickly in neither small nor big team.
    
    rapind 7 months ago
    
    I was thinking Haskell not Rust :) But yes, for sure, you'd have to tackle it bit by bit with some sort of bridging if you decided it was worth it.
TYMorningCoffee 8 months ago

> Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.
I agree. I hate lingering warnings. Unfortunately at the at time of this bug I did not have static analysis tools to detect these code smells.
- berkes 8 months ago
  
  Another problem with lingering warnings is that it's easy to overlook that one new warning that's actually important amongst floods of older warnings.
corytheboyd 8 months ago

> Rarely is this accepted by whoever chooses what we should work on.
I get that YMMV based on the org, but I find that more often than not, it’s expected that you are properly maintaining the software that you build, including the things like fixing warnings as you notice them. I can already feel the pushback coming, which is “no but really, we are NOT allowed to do ANYTHING but feature work!!!!” and… okay, I’m sorry, that really sucks… but maybe, MAYBE, some of that is you being defeatist, which is only adding to this culture problem. Oh well, now’s not the time to get fired for something silly like going against the status quo, so… I get it either way.
Retr0id 8 months ago

And from a security perspective, the "might cause a problem 0.000001% of the time" flaws can often be manipulated into becoming a problem 100% of the time.
- sitkack 8 months ago
  
  All security issues are subclass of bugs. Security is a niche version of QA.
chanux 7 months ago

I've also felt lonely on this hill I'm dying on.
But for some solace https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Dev...
closeparen 7 months ago

>Rarely is this accepted by whoever chooses what we should work on.
This is the kind of thing you should probably just do, not seek buy-in for from whoever chooses what you should work on. Unless it is going to take some extreme amount of time.
phamilton 7 months ago

A coworker once told me: "Undefined behavior means the code could theoretically order a pizza"
Hyperbole sure, but made me chuckle and is a nice reminder to check all assumptions.
cryptonector 7 months ago

Not everyone agrees. The SQLite team, for example, famously refuses to fix warnings reported by users.
I myself do try to fix most warnings -- some are just false positives.

hn_acc1 8 months ago

The mention of "could barely ssh in" reminds me of a situation in grad school where our group had a Sun UltraSparc 170 (IIRC) with 1GB HD and 128 or 256 MB of RAM, shared by maybe 8 people in a small research group relating to parallel and distributed processing. Keep in mind, Sun machines were rarely rebooted, ever.

So I guess the new user / student was trying to do things in parallel to speed things up when they chopped up their large text file into N (32 or 64) sections based on line number (not separate files), and then ran N copies of perl in parallel, each processing its own set of lines from that one file.

Not only did you have a large amount (for back then) of RAM used by N copies of the perl interpreter (separate processes, not threads, mind you!) processing its data, as well, any attempt to swap was interleaved with frantic seeking to a different section of the same file to read a few more lines for one of N processes stalled on IO. Also, probably the Jth process had to read J/N of the entire file to get to its section. So the first section of the file was read N times, the next N-1, then N-2, etc.

We (me and the fellow-student-trusted-as-sysadmin who had the root password) couldn't even get a login prompt on the console. Luckily, I had a logged-in session (ssh from an actual x-terminal - a large-screen "dumb" terminal), and "su" prompted for a password after 20-30 minutes of running it. After another 5-10 minutes, we had a root session and were able to run top and figure out what was going on. Killing the offending processes (after trying to contact the user) restored the system back to normal.

Edit: forgot to say: had the right idea, but totally didn't understand the system's limitations. It was SEVERELY I/O limited with that hard drive and relatively low RAM, so just processing the data linearly would have easily been the best approach unless the amount of data to be kept would have gotten too large.

kachapopopow 8 months ago

Yep, ran into this way too many times. Performing concurrent operations on non thread-safe objects in java or generally in any language produces the most interesting bugs in the world.

Espressosaurus 8 months ago

Which is why you manage atomic access to non-thread-safe objects yourself, or use a thread-safe version of them when using them across threads.
Multithreading errors are the worst to debug. In this case it's dead simple to identify at design time and warning flags should have gone up as soon as he started thinking about using any of the normal containers in a multithreaded environment.
- BobaFloutist 8 months ago
  
  Every time I think I'm sorta getting somewhere in my understanding of how to write code I see a comment like this that reminds me that the rabbithole is functionally infinite in both breadth and depth.
  There's simply no straightforward default approach that won't have you running into and thinking through the most esoteric sounding problems. I guess that's half the fun!
  - mrkeen 8 months ago
    
    It's not that bad. We just don't have the equivalent of GC for multi-threading yet, so the advice necessarily needs to be "just remember to take and release locks" (same as remembering to malloc and free).
    Hopefully someone will invent something like STM [1] in the distant year of 2007 or so [2]. It has actual thread-safe data structures. Not just the current choice between wrong-answer-if-you-dont-lock and insane-crashing-if-you-dont-lock.
    [1] https://www.adit.io/posts/2013-05-15-Locks,-Actors,-And-STM-...
    [2] https://youtu.be/4caDLTfSa2Q?feature=shared
    
    LegionMammal978 8 months ago
    
    Rust takes pride in its 'fearless concurrency' (strict compile-time checks to ensure that locks or similar constructs are used for cross-thread data, alongside the usual channels and whatnot), while Go takes pride in its use of channels and goroutines for most tasks. Not everything is like the C/C++/C#/Java situation where synchronization constructs are divorced from the data they're responsible for.
    
    neonsunset 8 months ago
    
    Synchronization primitives in Go are just as divorced as elsewhere, sometimes even more so - it does have channels, but Goroutines cannot yield a value, forcing you to employ a separate storage location together with WaitGroup/Mutex/RWMutex (which, unlike Rust's RWLock, is separate too, although C# lets you model it to an extent). This results in community developing libraries like https://github.com/sourcegraph/conc which attempt to replicate Rust's Futures / C#'s Tasks.
    
    jpc0 7 months ago
    
    Writing to a channel of size 1 feels a lot like a yeild to me, you can even do it in a loop.
    A task is an abstraction over those primatives in any language. To my knowledge TBB task graph abstract over a threadpool using exactly that concept.
    From what I've seen swift is the only language that properly handles concurrency. I'm taking another crack at rust but the fact that everyone uses tokio for anything parallel makes me feel like the language doesn't have great support for concurrency, it just has decent typing which isn't a surpise to anyone.
    
    ratorx 8 months ago
    
    For C++, abseil’s thread annotations are quite nice for getting closer to the Rust style of locking. Of course, the Rust style is still much easier to understand and less manual.
    
    gf000 7 months ago
    
    None of them solve the problems associated with the general category of race conditions. You can trivially create live/dead locks with channel/message-passing, and rust only prevents data races, though ownership is definitely a step in the right direction.
    (Well, go is not even memory safe under data races!)
    Also, Java is one of the languages where you can just add `synchronized` as part of the method signature, and while this definitely doesn't solve the problem, I don't think "divorced from the data" is accurate.
    
    mrkeen 7 months ago
    
    Re: 'synchronized' and data. It is a good distinction to make because sync does indeed lock control, not data. With ACID transactions or STM, an atomic section will run as-if-sequentially, full stop, since the data is locked. With Java sync, you get 'no other thread is in these lines of code' and you have to hope that's enough for the system to run as-if-sequentially.
    
    gf000 7 months ago
    
    I think it's an interesting distinction, but it's sort of OOP's main thing to encapsulate the local (possibly mutable) state, so that using an object externally will leave the inner state correct.
    I think the primary issue here is that synchronized should be the default case, with optionally lifting that very strict restriction for multi-threaded access.
    Nonetheless, objects with methods that would do STM on their inner state would be a pretty cool design.
    
    mrkeen 7 months ago
    
    I'd love to get some examples of Rust's best-practice shared-mutable-state code. So far when I ask around here I get answers equivalent to "Rust guarantees that you aren't doing that."
    
    DylanSp 8 months ago
    
    It's not a perfect situation, but C# has some dedicated collection classes for concurrent use - https://learn.microsoft.com/en-us/dotnet/api/system.collecti.... There's still some footguns possible, but knowing "I should use these collections instead of the regular versions" is less error-prone than needing to take/release locks at every single use site.
    
    sunshowers 8 months ago
    
    Concurrent maps are generally worse in terms of being able to understand the system than either non-concurrent maps guarded by a lock, or a channel/actor model with single ownership. Data-parallel algorithms should also generally use map-reduce rather than writing into the same map concurrently.
    I've written highly concurrent software with bog-standard hash maps plus channels. There are so many advantages to this style, such as events being linearized (and thus being easy to test against, log, etc).
    
    stouset 8 months ago
    
    > "just remember to take and release locks"
    If only it were so easy.
    
    sunshowers 8 months ago
    
    STM is not going to ever be a production thing outside of purely functional languages.
    
    saagarjha 8 months ago
    
    That’s what everyone thought about affine types, too.
    
    sunshowers 8 months ago
    
    True! I've been following STM and HTM research work for a while, and it all seems quite niche unless all side effects are captured (which is something purely functional languages can do). There isn't a real path to scalability I think, which there was with affine types.
    Optimistic concurrency in general is a useful design pattern in many cases, though.
- sunshowers 8 months ago
  
  The usual issue is code evolution over time, not the initial version which tends to be okay. You really want to have tooling strictly enforce invariants, and do so in a way that fails closed rather than open.
  In other words, use Rust.
- kachapopopow 8 months ago
  
  Tell that to inexperienced developers or making a massive single-thread project have multi-threaded capabilities.
  - stuff4ben 8 months ago
    
    I've been that developer making a single-threaded app multi-threaded. Best way to learn though!
  - baggy_trough 8 months ago
    
    Multi-threading - ain't nobody got time for that.
    
    mrkeen 8 months ago
    
    Yeah, our software politely waits for one customer to finish up with their GETs and POSTs before moving onto the next customer.
    We have almost one '9' of uptime!
    
    baggy_trough 8 months ago
    
    There are better ways than threading.
    
    mrkeen 8 months ago
    
    Yeah, like pretending you aren't
    
    baggy_trough 8 months ago
    
    I don't know what you mean.
    
    mrkeen 7 months ago
    
    If you write:
    int balance = 0; int get {...} void set(int balance) {...} void withdraw(int amt) { if amnt <= balance { balance -= amnt; } }
    it will be flagged immediately in code review as a race condition and/or something that doesn't guarantee its implied balance>=0 invariant. Because threading.
    But lift the logic up into a REST controller like pretty much all web app backends:
    GET /balance/get POST /balance/set POST /balance/withdraw
    And it will sail straight through review. (Because we pretend each caller isn't its own Thread.)
    
    baggy_trough 7 months ago
    
    You're right, we can't get away from concurrency control with database, cache, even the file system. I'm still happy to never have to think about in-process multithreading bugs.
    
    saagarjha 8 months ago
    
    async/await, for example.
foobarian 8 months ago

I ran into my share of concurrency bugs, but one thing I could never intentionally trigger was any kind of inconsistency stemming from removing a "volatile" modifier from a mutable field in Java. Maybe the JVM I tried this with was just too awesome.
- hashmash 8 months ago
  
  Were you only testing on x86 or any other "total store order" architecture? If so, removing the volatile modifier has less of an impact.
bob1029 8 months ago

I've universally found that even when I am convinced that I am OK with the consequences of sharing something that isn't synchronized, the actual outcome is something I wasn't expecting.
- loeg 8 months ago
  
  The only things that should be shared without synchronization are readonly objects where the initialization is somehow externally serialized with accessors, and atomic scalars -- C++ std::atomic, Java has something similar, etc.
saagarjha 8 months ago

This is kind of a hot take but I actually prefer debugging races in C/C++ for this reason. Yes, the language prescribes insane semantics (basically none) when it happens, but in practice you’ll get memory corruption or other noisy issues pretty often, and the fact that races are mostly illegal means you can write something like thread sanitizer without needing source code changes to indicate semantics. Meanwhile in Java you’ll never have UB but often you’ll have two fields be subtly out of sync and it’s a lot harder to track this kind of thing down.
ivanjermakov 8 months ago

Some (maybe most?) operations on Java Collections perform integrity checks to warn about such issues, for example map throwing ConcurrentModificationException
- kachapopopow 8 months ago
  
  ConcurrentModificationException does not check threads, it triggers when it is already too late. It also triggers on the same thread if you remove while iterating an iterator
- smarks 8 months ago
  
  ConcurrentModificationException is typically thrown from an iterator when it detects that it’s been invalidated by a modification to the underlying collection. It’s harder to check for the case described in this article, which is about multiple threads calling put() concurrently on a non thread safe object.

bmm6o 7 months ago

I've seen this in production C#. Same symptoms, process dump showed a corrupt dictionary. We thought we were using ConcurrentDictionary everywhere but this one came in from a library. We were using .net framework, IIRC .net core has code to detect concurrent modification. I don't know how they implement it but it could be as simple as a version counter.

It's odd he gets so hung up on npe being a critical ingredient when that doesn't appear to be present in the original manifestation. There's no reason to think you can't have a bug like this in C just because it doesn't support exceptions.

To me it's all about class invariants. In general, they don't hold while a mutator is executing and will only be restored at the end. Executing another mutator before the invariants are reestablished is how you corrupt your data structure. If you're not in a valid state when you begin you probably won't be in a valid state when you finish.

TYMorningCoffee 7 months ago

It came down to poor logic. I got hung up on it because I got unlucky and couldn't reproduce it with uncaught NPE so I incorrectly concluded that uncaught NPE was a necessary condition.

scottlamb 8 months ago

> Could an unguarded TreeMap cause 3,200% utilization?

I've seen the same thing with an undersynchronized java.util.HashMap. This would have been in like 2009, but afaik it can still happen today. iirc HashMap uses chaining to resolve collisions; my guess was it introduced a cycle in the chain somehow, but I just got to work nuking the bad code from orbit [1] rather than digging in to verify.

I often interview folks on concurrency knowledge. If they think a data race is only slightly bad, I'm unimpressed, and this is an example of why.

[1] This undersynchronization was far from the only problem with that codebase.

TYMorningCoffee 8 months ago

Oh no I didn't know this can happen with HashMaps too! Something to do with the linked list they use for collisions?
- scottlamb 7 months ago
  
  There's a dead comment replying to yours with this link: https://mailinator.blogspot.com/2009/06/beautiful-race-condi...
  ...which I don't think I've ever seen before, but nicely explains the problem I saw, and coincidentally is from the same year!
- vineet_rc 7 months ago
  
  [dead]

jonathanlydall 8 months ago

What about spotting a cycle by using an incrementing counter and then throwing an exception if it goes above the tree depth or collection size (presuming one of these is tracked)?

Unlike the author’s hash set proposal it would require almost no memory or CPU overhead and may be more likely to be accepted.

That being said, in the decade plus I’ve used C# I’ve never found that I failed to consider concurrent access on data structures in concurrent situations.

saagarjha 8 months ago

It’s not a horrible idea but adding useful preconditions in the face of data races is quite hard in general. This is just one of the ways the tree can break.
exDM69 7 months ago

Yes, this is a good idea. I've done this before with binary search and tree structures. I avoid unbounded loops wherever possible and the unavoidable cases are rather rare.
It's not a fix but it is a good mitigation strategy.
Infinite loops are one of the nastiest kind of bugs. Although they are easy to spot in a debugger they can have really unfriendly consequences like OP's "can barely ssh in" situation.
Infinite loops in library code are particularly unpleasant.
TYMorningCoffee 8 months ago

That's much better. Constant memory. The number of nodes is guaranteed to be less than or equal to the height of the tree.

deskr 8 months ago

Exceptions in threads are an absolute killer.

Here's a story of a horror bughunt where the main characters are C++, select() and thread brandishing an exception: https://news.ycombinator.com/item?id=42532979

TYMorningCoffee 8 months ago

I remember reading that article but being unable to understand it due to my lack of knowledge in the area. I will have to give it another go.

hinkley 8 months ago

The author has discovered a flavor of the Poison Pill. More common in event sourcing systems, it’s a message that kills anything it touches, and then is “eaten” again by the next creature that encounters it which also dies a horrible death. Only in this case it’s live-locked.

Once the data structure gets into the illegal state, every subsequent thread gets trapped in the same logic bomb, instead of erroring on an NPE which is the more likely illegal state.

thinkingemote 8 months ago

"I could barely ssh onto it"

Is there a way to ensure that whatever happens (CPU, network overloaded etc) one can always ssh in? Like reserve a tiny bit of stuff to the ssh daemon?

homebrewer 8 months ago

Create a systemd override (by using systemctl edit sshd) and add MemoryMin=32M (or whatever makes sense for your system). This makes sure sshd is never pushed out into swap.
https://www.freedesktop.org/software/systemd/man/latest/syst...
You can also use sibling knobs to increase the CPU and IO weights of the unit, for example but setting this to something higher than 100:
https://www.freedesktop.org/software/systemd/man/latest/syst...
- masklinn 7 months ago
  
  Is there a way to do that for alternate tty?
  From time to time I’ll run something dumb on my machine (e.g. GC aggressive the wrong repo) and if I don’t catch the ramp up the machine will lock up until the oom killer find the right process. Sufficiently locked up, accessing alternate ttys to kill the offending process won’t work either.
  I guess I could just reserve ssh then ssh into it from an other computer but…
beisner 8 months ago

On Linux I’ve done this by pinning processes to a certain range of CPU cores, and the scheduler will just keep one core free or something. Which allows whatever I need in terms of management to execute on that one core, including SSH orUI.
LtdJorge 8 months ago

Nice? Or maybe give the Systemd slice a special cgroup with a reservation.
- Aloisius 8 months ago
  
  I'd consider doing the inverse and nice the JVM with a lower priority instead in certain situations.

bsder 8 months ago

Does a ConcurrentSkipListMap not give the correct O(log N) guarantees on top of being concurrency friendly?

java.util.concurrent is one of the best libraries ever. If you do something related to concurrency and don't reach for it to start, you're gonna have a bad time.

kccqzy 8 months ago

It's often a better design if you manage the concurrency in a high-level architecture and not have to deal with concurrency at the data structure level.
- bsder 8 months ago
  
  Designing your own concurrency structures instead of using ones designed by smart people who thought about the problem more collective hours than your entire lifetime is unwarranted hubris.
  The fact that ConcurrentTreeMap doesn't exist in java.util.concurrent should be ringing loud warning bells.
  - wat10000 8 months ago
    
    It's not like you have a choice. Thread safety doesn't compose. A function that only uses thread-safe constructs may not itself be thread safe. This means that using concurrent data structures isn't any sort of guarantee, and doesn't save you from having to think about concurrency. It will prevent you from running into certain kinds of bugs, and that may be valuable, but you'll still have to do your own work on top.
    If you're doing that anyway, it tends to be easier and more reliable to forget about thread safety at the bottom, and instead design your program so that thread safety happens in larger blocks, and you have big chunks of code that are effectively single-threaded and don't have to worry about it.
  - layer8 8 months ago
    
    The GP comment is not about designing your own concurrency data structures. It’s about the fact that if your higher-level logic doesn’t take concurrency into account, using the concurrent collections as such will not save you from concurrency bugs. A simple example is when you have two collections whose contents need to be consistent with each other. Or if you have a check-and-modify operation that isn’t covered by the existing collection methods. Then access to them still has to be synchronized externally.
    The concurrent collections are great, but they don’t save you from thinking about concurrent accesses and data consistency at a higher level, and managing concurrent operations externally as necessary.
  - kccqzy 8 months ago
    
    The people behind the concurrent containers in java.util.concurrent are smart, but they are limited by the fact that they are working on a necessarily lower-level API. As an application programmer, you can easily switch the high-level architecture so as not to require any concurrent containers. Perhaps you have sharded your data beforehand. Perhaps you use something like map-reduce architecture where the map step is parallel and require no shared state.
  - mrkeen 8 months ago
    
    Once they expose data structures that allow generic uses like "if size<5, add item" I'll take another look. Until then, their definition of thread-safety isn't quite the same as mine.
    
    kccqzy 8 months ago
    
    You can take a look at Haskell's Software Transactional Memory then. Or you can take a look at something like the Linux kernel's Read-Copy-Update (RCU) abstraction, add some persistent data structures and a retry loop on top. It's indeed a very programmer friendly way of doing concurrency.
TYMorningCoffee 8 months ago

You're right. That would have been a better choice.

johnklos 8 months ago

Does it not strike anyone else as odd that if someone said they had a single CPU, and that CPU were running a normal priority task at 100%, and that caused the machine to barely allow ssh, we'd say there's a much bigger problem than that someone is letting on?

No 32 core (thread, likely) machine should ever normally be in a state where someone can "barely ssh onto it". Is Java really that janky? Or is "barely ssh onto it" a bit hyperbolic?

sc68cal 8 months ago

I've absolutely experienced situations like this where the system is maxed out and ssh becomes barely usable.
The problem is that all processes run with pretty much the same priority level, so there's no differentiation in the scheduler between interactive ssh sessions and other programs that are consuming all the CPU, and most of the time you only realize this fact far after the point where you can SSH in and renice the misbehaving processes.
spockz 7 months ago

My macOS install got into a state before the Christmas holiday where something would spawn #corecount amount of process. They seemed to be related to processing thumbnails for contents of directories.
This would go on for a few minutes in which the machine was completely unresponsive including the mouse.
So it seems possible still to bring a high core count machine to its knees. But something then is indeed very wrong.

lolc 8 months ago

What I like here is the discovery of the extra loop, and then still digging down to discover the root cause of the competing threads. I think I would have removed the loop and called it done.

loeg 8 months ago

Just to probe the code review angle a little bit: shared mutable state should be a red/yellow flag in general. Whether or not it is protected from unsafe modification. That should jump out to you as a reviewer that something weird is happening.

MadVikingGod 8 months ago

I was excited to see that not only does this article cover other languages, but that this error happens in go. I was a bit surprised because map access is generally protected by a race detector, but the RedBlack tree used doesn't store anything in a map anywhere.

I wonder if the full race detector, go run -race, would catch it or not. I also want to explore if the RB tree used a slice instead of two different struct members if that would trigger the runtime race detector. So many things to try when I get back to a computer with go on it.

w10-1 8 months ago

Very well said, and very nice to see references to others on point.

As a sidebar: I'm almost distracted by the clarity. The well-formed structure of this article is a perfect template for an AI evaluation of a problem.

It'd be interesting to generate a bunch of these articles, by scanning API docs for usage constraints, then searching the blog-sphere for articles on point, then summarizing issues and solutions, and even generating demo code.

Then presto! You have online karma! (and interviews...)

But then if only there were a way to credit the authors, or for them to trace some of the good they put out into the world.

So, a new metric: PageRank is about (sharing) links-in karma, but AuthorRank is partly about the links out, and in particular the degree to which they seem to be complete in coverage and correct and fair in their characterization of those links. Then a complementary page-quality metric identifies whether the page identifies the proper relationships between the issues, as reflected elsewhere, as briefly as possible in topological order.

Then given a set of ordered relations for each page, you can assess similarity with other pages, detect copying (er, learning), etc.

moffkalast 8 months ago

Anyone else mildly peeved by how CPU load is just load per core summed up to an arbitrary percentage all too often?

Why not just divide 100% by number of cores and make that the max, so you don't need to know the number of cores to know the actual utilization? Or better yet, have those microcontrollers that Intel tries to pass off as E and L cores take up a much smaller percentage to fit their general uselessness.

mystifyingpoi 8 months ago

IDK but the current convention makes it easy to see single-threaded bottlenecks. So if my program is using 100% CPU and cannot go faster, I know where to look.
ahoka 8 months ago

This is “Irix” vs “Solaris” mode of counting, the latter being summed up to to 100% for all cores. I think the modern approach would be to see how much of its TDP budget the core is using.
shrx 8 months ago

If you think that's confusing, let me introduce you to CPU load average values on android... https://stackoverflow.com/questions/10829546/how-to-read-the...

svilen_dobrev 8 months ago

i found this article very/deeply informative , memory-wise , concurency vs optimizations and troubles thereof:

Programming Language Memory Models

https://research.swtch.com/plmm

https://news.ycombinator.com/item?id=27750610

jchw 7 months ago

Somewhat relatedly, I've always appreciated Go adding some runtime checks to detect race conditions, like concurrent map writes. It's not as good as having safe concurrency from the ground up, but on the other hand it's a lot better than not having detection, as everyone makes mistakes from time to time, and imperfect as it may be, the detection usually does catch it quickly. Especially nice since it is on in your production builds; a big obstacle with a lot of debugging tools is they're hard to get them where you need them...

hyperhello 8 months ago

Should I read this as the Java TreeMap itself is thread unsafe, and the JVM is in a loop, or that the business logic itself was thread unsafe, and just needed to lock around its transactions?

otterley 8 months ago

https://docs.oracle.com/en/java/javase/21/docs/api/java.base...
"Note that this implementation is not synchronized. If multiple threads access a map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more mappings; merely changing the value associated with an existing key is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the map. If no such object exists, the map should be "wrapped" using the Collections.synchronizedSortedMap method. This is best done at creation time, to prevent accidental unsynchronized access to the map..."
- layer8 8 months ago
  
  To add to this: Java’s original collection classes (Vector, Hashtable, …) were thread-safe, but it turned out that the performance penalty for that was too high, all the while still not catching errors when performing combinations of operations that need to be a single atomic transaction. This was one of the motivations for the thread-unsafe classes of the newer collection framework (ArrayList, HashMap, …).
  - wging 8 months ago
    
    > still not catching errors when performing combinations of operations that need to be a single atomic transaction
    This is so important. The idea that code is thread-safe just because it uses a thread-safe data structure, and its cousin "This is thread-safe, because I have made all these methods synchronized" are... not frequent, but I've seen them expressed more often than I'd like, which is zero times.
    
    magicalhippo 7 months ago
    
    Shows up in other places as well, such as file systems. Trying to check if a filename exists, and if it doesn't then create the file for example. Instead one should simply try to create the file and handle the error if it already exists.
lucianbr 8 months ago

Java TreeMap is thread unsafe.
The business logic is thread unsafe because it uses a TreeMap concurrently. It should either use something else or lock around all usages of the TreeMap. It does not seem to be "in itself" wrong, meaning for any other cause than the usage of TreeMap.
TYMorningCoffee 8 months ago

Both! The TreeMap is thread unsafe. The business logic needs to protect against concurrent access to the TreeMap or not use the treemap at all.

neonsunset 8 months ago

In practice, it's rarely an issue in C# because it offers excellent concurrent collections out of box, together with channels, tasks and other more specialized primitives. Worst case someone just writes a lock(thing) { ... } and calls it a day. Perhaps not great but not the end of the world either.

I did have to hand-roll something that partially replicates Rust's RWLock<T> recently, but the resulting semantics turned out to be decent, even if not providing the exact same level of assurance.

fulafel 7 months ago

> A while back my machine was so messed up that I could barely ssh onto it. 3,200% CPU utilization - all 32 cores on the host were fully utilized!

You wouldn't notice the load from 32 cpu pegging threads or processes on a 32-core host when ssh'ing in. Sounds like the OP is maybe leaving out what the load on the machine was, maybe it was more like thousands of spinning threads?

doctor_phil 8 months ago

Why does the fix need to remember all the nodes we have visited? Can't we just keep track of what span we are in? That way we just need to keep track of 2 nodes.

In the graphic from the example we would keep track like this:

  low: - high -
  low: 11 high: -
  low: 23 high: -
  low: 23 high: 26
  Error: now we see item 13, but that is not inside our span!

kristianp 7 months ago

Has anyone noticed an inability to horizontally scroll the code samples on chrome Android phones? I've noticed it on a few different blogs. The window has to be dragged at lower point on the screen to scroll the code section.

TYMorningCoffee 7 months ago

Fixed! It was because the table overflowed. This created a scroll bar for the entire page. The entire page scrollbar had a weird interaction with the code block scroll bars. Now that there is no more whole page scroll bar, it works.
- kristianp 7 months ago
  
  Yep, works fine now. Thanks!
TYMorningCoffee 7 months ago

Yes. Sorry about that. Will look into it when I have time.
- TYMorningCoffee 7 months ago
  
  I think I tracked it down to the table in the article adding a scroll bar for the entire page. This scrollbar fights with the code blocks scrollbar. I will try to refactor the table to fit in a phone width somehow.

memoryfault 7 months ago

This is a fairly common bug in multithreaded use of .NET Dictionary, too.

xandrius 7 months ago

Does CPU utilisation calculations work this way?

- 1 CPU at 100% = 100%

- 10 CPUs at 100% = 100%

- 1 CPU at 100% + 9 at 0% = 10%

Is that not right? Or is CPU utilisation not usually normalised?

TYMorningCoffee 7 months ago

It depends which OS. Windows yes, Linux no.

xyst 8 months ago

At least it was using all of the cores. The CPU running this application was cooking.

yburkov 7 months ago

junior level error, dont understand whats to talk here about... Use ConcurrentSkipListMap and never write our own concurrent data structure unless you Doug Lea

NovaX 7 months ago

How about the senior engineer level error of forking Doug Lea's concurrent data structures only to make them operate in worst-case time complexity? Found this doozy recently.
[1] https://gist.github.com/ben-manes/6312727adfa2235cb7c5e25cae...

mvc 8 months ago

Haha, I was recently running a backfill was quite pleased when I managed to get it humming along at 6400% CPU on a 64vcpu machine. Fortunately ssh was still receptive.

krick 7 months ago

Wow, that's pretty high-effort blog post (taking time to compare across 12 mainstream languages).

aqueueaqueue 7 months ago

Serious though. What 100% means on cpu dashboards is as inconsistent as what time zone dates are in.

cytocync 7 months ago

[dead]

procaryote 8 months ago

TL;DR; don't use thread unsafe data structures from multiple threads at once

khana 7 months ago

[dead]

opentokix 8 months ago

click Java close

tasty_freeze 8 months ago

you missed the part where he reproduced it in a number of other languages, and some where he was unable to reproduce it.

lcfcjs6 8 months ago

Java is usually shockingly inefficient.