Adding concurrent read/write to DuckDB with Arrow Flight

94 points by mritchie712 8 months ago

lidavidm 8 months ago

Arrow has several other related projects in this space:

Arrow Flight SQL defines defines a full protocol designed to support JDBC/ODBC-like APIs but using columnar, Arrow data transfer for performance (why take your data and transpose it twice?)

https://arrow.apache.org/blog/2022/02/16/introducing-arrow-f...

There's an Apache-licensed JDBC driver that talks the Flight SQL protocol (i.e. it's a driver for _any_ server that implements the protocol): https://arrow.apache.org/blog/2022/11/01/arrow-flight-sql-jd...

(There's also an ODBC driver, but at the moment it's GPL - the developers are working on upstreaming it and rewriting the GPL bits. And yes, this means that you're still transposing your data, but it turns out that transferring your data in columnar format can still be faster - see https://www.vldb.org/pvldb/vol10/p1022-muehleisen.pdf)

There's an experiment to put Flight SQL in front of PostgreSQL: https://arrow.apache.org/blog/2023/09/13/flight-sql-postgres...

There's also ADBC; where Flight SQL is a generic protocol (akin to TDS or how many projects implement the PostgreSQL wire protocol), ADBC is a generic API (akin to JDBC/ODBC in that it abstracts the protocol layer/database, but it again uses Arrow data): https://arrow.apache.org/blog/2023/01/05/introducing-arrow-a...

majoe 8 months ago

Arrow is pretty cool, although I haven't had the opportunity yet to use it.
I skimmed the paper you linked and wondered, how one measures the ser/de time a query takes or more generally how one would estimate the possible speedup of using Arrow Flight for communication with a database.
Do you by chance have any insights in that direction?
At work we have a Java application, that produces a big amount of simulation results (ca. 1Tb per run), which are stored in a database. I suspect, that a lot of time is wasted for ser/de, when aggregating the results, but it would be good to have some numbers.
- lidavidm 8 months ago
  
  To get a really precise answer you'd have to profile or benchmark. I'd say it's also hard to do an apples to apples comparison (if you only replace the data format in the wire protocol, the database probably still has to transpose the data to ingest it). And it's hard to do a benchmark in the first place since probably your database's wire protocol is not really exposed for you to do a benchmark.
  You can sort of see what benefits you might get from a post like this, though: https://willayd.com/leveraging-the-adbc-driver-in-analytics-...
  While we're not using Arrow on the wire here, the ADBC driver uses Postgres's binary format (which is still row oriented) + COPY and can get significant speedups compared to other Postgres drivers.
  The other thing might be to consider whether you can just dump to Parquet files or something like that and bypass the database entirely (maybe using Iceberg as well).
  - majoe 8 months ago
    
    Thanks for the answer.
    We will start a refactoring of the application in a few weeks to get rid of the performance problems. I will keep your advice in mind and do some thorough benchmarks in the meantime.
chrisjc 8 months ago
And all the Arrow parts work together quite nicely.
```
    ADBC client --> Flight SQL (duckdb/whatever) --> Flight --> ?
```
The result highlights your exact point: why take your data and transpose it twice?
It's quite an exciting space, and lots of projects popping up around Arrow Flight and duckdb.

ianmcook 8 months ago

Arrow developer here, we've invested a lot in seamless DuckDB interop, great to see it getting traction.

Recent blog post here that breaks down why the Arrow format (which underlies Arrow Flight) is so fast in applications like this: https://arrow.apache.org/blog/2025/01/10/arrow-result-transf...

chrisjc 8 months ago
Thank you for all the work you guys do. The Arrow ecosystem is just absolutely incredible.
My few gripes related to interop with duckdb are related to Arrow scanning/pushdowns. And this extends to interop with other projects like pyiceberg too.
Registering an Arrow Dataset (or pyiceberg scan) as a "duckdb relation" (virtual view) is still a little problematic. Querying these "relations" does not always result in an optimal outcome.
For Arrow datasets, you can intercept the duckdb pushdown, but duckdb will have already "optimized" the plan to its liking, and any scanning restrictions that may have been more advantageous based on the nuances of the dataset might have been lost. Eg:
```
    WHERE A IN (3, 5, 7)
```
is presented to the Arrow scanner (pushdown) as "A is between 3 and 7 inclusive" (https://duckdb.org/docs/guides/performance/indexing.html#zon...).
Perhaps in a similar way, turning an pyiceberg scan into a relation for duckdb effectively takes the entire scan and creates an Arrow Table rather than some kind of pushdown/"scan plan" for duckdb to potentially make more efficient with its READ_PARQUET() functionality.
Most of this is probably dependent on duckdb development, but all of the incredible interop work done across communities/ecosystems so far gives me a lot of confidence that these will soon be matters of the past.
- 1egg0myegg0 8 months ago
  
  IN list filter predicate pushdown is much improved in DuckDB 1.2, coming in about a week! I am not sure if it applies to Arrow yet or not. Disclaimer: I work at MotherDuck and DuckDB Labs
  - ianmcook 8 months ago
    
    @1egg0myegg0 that's great to hear. I'll check to see if it applies to Arrow.
    Another performance issue with DuckDB/Arrow integration that we've been working to solve is that Arrow lacked a canonical way to pass statistics along with a stream of data. So for example if you're reading Parquet files and passing them to DuckDB, you would lose the ability to pass the Parquet column statistics to DuckDB for things like join order optimization. We recently added an API to Arrow to enable passing statistics, and the DuckDB devs are working to implement this. Discussion at https://github.com/apache/arrow/issues/38837.
praptak 8 months ago

Congratulations! Now you can say you have your ducks in Arrow.

rustyconover 8 months ago

This is a great lead up to my talk at DuckCon 6 on Friday in Amsterdam.

It’s title is: “Airport for DuckDB: Letting DuckDB take Apache Arrow Flights”

You can stream it at: https://duckdb.org/events/2025/01/31/duckcon6/

You’ll see how DuckDB can integrate with Arrow Flight for reading/writing/creating tables, scalar user defined functions, table returning functions, table in/out streaming functions, row/column filtering and anything else I can breeze through in 15 minutes.

It all starts with a single call to ATTACH with the URL of the Arrow Flight server.

chrisjc 8 months ago

I've been eagerly awaiting this for a couple of months now. And I've long wondered why there hasn't been such an implementation/extension for Flight especially as there are extensions for mysql, postgres, sqlite...
Only seems natural that an extension could developed for ATTACHing to a Flight server.
Looking forward to watching your talk.

mritchie712 8 months ago

Author here, was surprised by how easy this was. The server is only ~50 lines of Python and it works quite well!

memhole 8 months ago

Arrow Flight is a very neat project! How do you not end up with race conditions? Or data overwriting each other? Now that you have multiple writers. I think typically this is solved with WAL files. So maybe it's something internal to the Flight server?

mritchie712 8 months ago

sorry, should have covered this in the post (will likely add).
neomantra nailed it: https://news.ycombinator.com/item?id=42865405
koromak 8 months ago

Yeah I'm confused as to what its actually doing
- neomantra 8 months ago
  
  I do like this article a lot for showing how to do this pattern of slurping data and inserting it into a DB, in the context of Arrow Flight.
  The concurrency rules of DuckDB is here [1]. Reads/writes need to happen in the same process, but multiple threads can do so.
  This is putting a server in front of a DuckDB instance, so all read/writes are funneled there in that one process. DuckDB takes care of the concurrency within the process via MVCC.
  You could do the same thing with an HTTP server or other system, but this shows it with the Flight RPC framework.
  NOTE: I had an incorrect comment for about 2 minutes that I deleted and restructured here. Sorry if you saw that noise.
  [1] https://duckdb.org/docs/connect/concurrency
  - memhole 8 months ago
    
    Thanks! That makes sense now.

lmeyerov 8 months ago

Just sanity checking here - with flight write streams to duckdb, I'm guessing there is no notion of transactional boundary here, so if we want data consistency during reads, that's another level of manual app responsibilities? And atomicity is there, but at the single record batch or row group level?

Ex: if we have a streaming financial ledger as 2 tables, that is 2 writes, and a reader might see an inconsistent state of only 1 write

Ex: streaming ledger as one table, and the credit+debit split into 2 distanced rowgroups, same inconsistency?

Ex: in both cases, we might have the server stream back an ack of what was written, so we could at least get a guarantee of which timestamps are fully written for future reads, and queries can manually limit to known-complete intervals

We are looking at adding streaming writes to GFQL, an open source columnar (arrow-native) CPU/GPU graph query language, where this is the same scenario: appends mean updating both the nodes table and the edges table

eatonphil 8 months ago

Yes, reading this post (working around a database's concurrency control) made me raise an eyebrow. If you are ok with inconsistent data then that's fine. Or if you handle consistency at a higher level that's fine too. But if either of these are the case why would you be going through DuckDB? You could write out Parquet files directly?
- westurner 8 months ago
  
  cosmos/iavl is a Merkleized AVL tree.
  https://github.com/cosmos/iavl :
  > Merkleized IAVL+ Tree implementation in Go
  > The purpose of this data structure is to provide persistent storage for key-value pairs (say to store account balances) such that a deterministic merkle root hash can be computed. The tree is balanced using a variant of the AVL algorithm so all operations are O(log(n)).
  Integer Vector clock or Merkle hashes?
  Why shouldn't you store account balances in git, for example?
  Or, why shouldn't you append to Parquet or Feather and LZ4 for strongly consistent transactional data?
  Centralized databases can have Merkle hashes, too;
  "How Postgres stores data on disk" https://news.ycombinator.com/item?id=41163785 :
  > Those systems index Parquet. Can they also index Feather IPC, which an application might already have to journal and/or log, and checkpoint?
  DLT applications for strong transactional consistency sign and synchronize block messages and transaction messages.
  Public blockchains have average transaction times and costs.
  Private blockchains also have TPS Transactions Per Second metrics, and unknown degrees of off-site redundancy for consistent storage with or without indexes.
  Blockchain#Openness: https://en.wikipedia.org/wiki/Blockchain#Openness :
  > An issue in this ongoing debate is whether a private system with verifiers tasked and authorized (permissioned) by a central authority should be considered a blockchain. [46][47][48][49][50] Proponents of permissioned or private chains argue that the term "blockchain" may be applied to any data structure that batches data into time-stamped blocks. These blockchains serve as a distributed version of multiversion concurrency control (MVCC) in databases. [51] Just as MVCC prevents two transactions from concurrently modifying a single object in a database, blockchains prevent two transactions from spending the same single output in a blockchain. [52]
  > Opponents say that permissioned systems resemble traditional corporate databases, not supporting decentralized data verification, and that such systems are not hardened against operator tampering and revision. [46][48] Nikolai Hampton of Computerworld said that "many in-house blockchain solutions will be nothing more than cumbersome databases," and "without a clear security model, proprietary blockchains should be eyed with suspicion." [10][53]
  Merkle Town: https://news.ycombinator.com/item?id=38829274 :
  > How CT works > "How CT fits into the wider Web PKI ecosystem": https://certificate.transparency.dev/howctworks/
  From "PostgreSQL Support for Certificate Transparency Logs Now Available" https://news.ycombinator.com/item?id=42628223 :
  > Are there Merkle hashes between the rows in the PostgreSQL CT store like there are in the Trillian CT store?
  > Sigstore Rekor also has centralized Merkle hashes.
  - yencabulator 8 months ago
    
    I think you replied in the wrong post.
    
    westurner 8 months ago
    
    No, I just explained how the world does strongly consistent distributed databases for transactional data, which is the exact question here.
    DuckDB does not yet handle strong consistency. Blockchains and SQL databases do.
    
    lmeyerov 8 months ago
    
    Blockchains are a fantastic way to run things slowly ;-) More seriously: Making crypto fast does sound like a fun technical challenge, but well beyond what our finance/gov/cyber/ai etc customers want us to do.
    For reference, our goal here is to run around 1 TB/s per server, and many times more when a beefier server. Same tech just landed at spot #3 on the graph 500 on its first try.
    To go even bigger & faster, we are looking for ~phd intern fellows to run on more than one server, if that's your thing: OSS GPU AI fellowship @ https://www.graphistry.com/careers
    The flight perspective aligns with what we're doing. We skip the duckdb CPU indirections (why drink through a long twirly straw?) and go straight to arrow on GPU RAM. For our other work, if duckdb does gives reasonable transactional guarantees here, that's interesting... hence my (in earnest) original question. AFAICT, the answers are resting on operational answers & docs that don't connect to how we normally talk about databases giving you answers on consistent vs inconsistent views of data.
    
    westurner 8 months ago
    
    Do you think that blockchain engineers are incapable of developing high throughout distributed systems due to engineering incapacity or due to real limits to how fast a strongly consistent, sufficiently secured cryptographic distributed system can be? Are blockchain devs all just idiots, or have they dumbly prioritized data integrity because that doesn't matter it's about big data these days, nobody needs CAP?
    From "Rediscovering Transaction Processing from History and First Principles" https://news.ycombinator.com/item?id=41064634 :
    > metrics: Real-Time TPS (tx/s), Max Recorded TPS (tx/s), Max Theoretical TPS (tx/s), Block Time (s), Finality (s)
    > Other metrics: FLOPS, FLOPS/WHr, TOPS, TOPS/WHr, $/OPS/WHr
    TB/s in query processing of data already in RAM?
    /? TB/s "hnlog"
    - https://news.ycombinator.com/item?id=40423020 , [...] :
    > The HBM3E Wikipedia article says 1.2TB/s.
    > Latest PCIe 7 x16 says 512 GB/s:
    fiber optics: 301 TB/s (2024-05)
    Cerebras: https://en.wikipedia.org/wiki/Cerebras :
    WSE-2 on-chip SRAM memory bandwidth: 20 PB/s / 220 PB/S
    WSE-3: 21 PB/S
    HBM > Technology: https://en.wikipedia.org/wiki/High_Bandwidth_Memory#Technolo... :
    HBM3E: 9.8 Gbit/s , 1229 Gbyte/s (2023)
    HBM4: 6.4 Gbit/s , 1638 Gbyte/s (2026)
    LPDDR SDRAM > Generations: https://en.wikipedia.org/wiki/LPDDR#Generations :
    LPDDR5X: 1,066.63 MB/S (2021)
    GDDR7: https://en.m.wikipedia.org/wiki/GDDR7_SDRAM
    GDDR7: 32 Gbps/pin - 48 Gbps/pin,[11] and chip capacities up to 64 Gbit, 192 GB/s
    List of interface bit rates: https://en.wikipedia.org/wiki/List_of_interface_bit_rates :
    PCIe7 x16: 1.936 Tbit/s 242 GB/s (2025)
    800GBASE-X: 800 Gbps (2024)
    DDR5-8800: 70.4 GB/s
    Bit rate > In data communications: https://en.wikipedia.org/wiki/Bit_rate# In_data_communications ; Gross and Net bit rate, Information rate, Network throughout, Goodput
    Re: TPUs, NPUs, TOPS: https://news.ycombinator.com/item?id=42318274 :
    > How many TOPS/W and TFLOPS/W? (T [Float] Operations Per Second per Watt (hour ?))*
    Top 500 > Green 500: https://www.top500.org/lists/green500/2024/11/ :
    PFlop/s (Rmax)
    Power (kW)
    GFlops/watts (Energy Efficiency)
    Performance per watt > FLOPS/watts: https://en.wikipedia.org/wiki/Performance_per_watt#FLOPS_per...
    Electrons: 50%–99% of c the speed of light ( Speed of electricity: https://en.wikipedia.org/wiki/Speed_of_electricity , Velocity factor of a CAT-7 cable: https://en.wikipedia.org/wiki/Velocity_factor#Typical_veloci... )
    Photons: c (*)
    Gravitational Waves: Even though both light and gravitational waves were generated by this event, and they both travel at the same speed, the gravitational waves stopped arriving 1.7 seconds before the first light was seen ( https://bigthink.com/starts-with-a-bang/light-gravitational-... )
    But people don't do computation with gravitational waves.
    
    lmeyerov 8 months ago
    
    To a reasonable rounding error.. yes
    
    westurner 8 months ago
    
    How would you recommend that appends to Parquet files be distributedly synchronized with zero trust?
    Raft, Paxos, BFT, ... /? hnlog paxos ... this about "50 years later, is two-phase locking the best we can do?" https://news.ycombinator.com/item?id=37712506
    To have consensus about protocol revisions; To have data integrity and consensus about the merged sequence of data in database {rows, documents, named graphs, records,}.

simlevesque 8 months ago

> 1. DuckDB doesn't support concurrent writers

I just write parquet files on S3 and query it using duckdb, every minute. Can have as many writer as I want. For analytics.

yencabulator 8 months ago

Beware, DuckDB SQL can read files. I don't think there's any access control there. Feeding it SQL given to you over the network might not be what you want; it's intended for local use.

bvogelzang 8 months ago

You can disable file access: https://duckdb.org/docs/operations_manual/securing_duckdb/ov...

hoerzu 8 months ago

curious of ideas how to make it distributed, might be the wrong use case but thats usually the bottleneck?

chrisjc 8 months ago
For read-oriented interactions...
```
    ADBC --sql--> Flight SQL Server --flight--> Flight Servers (plural) --> datasources
```
or
```
    SELECT * FROM <relation> WHERE ...
```
where the "relation" is a collection of Arrow stream from a cluster of Flight Servers.
Remember that Flight/Flight-SQL is not a query engine replacement, but rather a way to communicate and exchange data between components with different responsibilities efficiently.