← Ingestions

Ingestion bef18149 extracted

Format
transcript
Kind
talk
External ID
Chris Hasiński - Fantastic Databases and Where to Find Them - wroc_love.rb 2023.txt
Content hash
ec970ec9c809
Source at
2023-03-31 09:00
Manual extractions are temporarily disabled.

Extractions (2)

Status Model Tokens (in/out) Duration Cost Nodes/edges Read set (nodes/edges) Time
completed claude-opus-4-7
463,912 / 17,062
133,059 cached · 10,241 write
269.4s - 29 / 55 107 / 9 2026-04-17 22:11
failed claude-opus-4-7 RubyLLM::BadRequestError: You have reached your specified API usage limits. You will regain access on 2... 2026-04-17 16:18

Content

um Andre introduced me as Kristoff but I


know that for English-speaking people


that's pretty much unpronounceable and


difficult to spell so I usually go by


Chris


and I want to talk to you today about


the some mythical creatures


so heavy dragons


hippie dragons is a phrase that you use


when you have a map and you don't know


what lines there so you put hippie


dragons so people understand that you


don't know that and you're supposed to


explore


this is a phrase that is actually modern


the original phrase was here be Lions


but uh of course ancient map doesn't


have uh dragons in them however there is


one particular


um not a map but the globe Lennox globe


that actually has this phrase written on


it there is a small inscription in Latin


draconess which means Herbie dragons


um I want you guys to remember this for


a second why I talked much about the


slightly more more scary stuff which are


layoffs


layouts are the main topic for our


industry for the last two years I think


but with all the ends come New


Beginnings which usually begin with


recruitments


so we are very much prepared for


recruitment interviews and during those


you're supposed to ask questions about


the project because you're supposed to


show that you're interested you actually


care about those recruitments after some


so long of not caring because you had so


many of them


so you ask questions


um one of the good questions is which


rails version do you guys use because


you're interested in the project if it's


seven that's cool that is his sixth and


it's okay if it's five there is a lot of


swell


unease in the air and if it's for you


actually want to charge them extra some


other questions is is it DDD because


that's a trendy thing yeah to use even


sourcing yeah that's that's yeah you


might actually want to take a pay hit


when you actually get the project like


that


or maybe you are front-ending line and


you want to


get some info whatever you're using


typescript or perhaps react how do you


manage State on the front end


or maybe not maybe you're scared of


front-end and you want to look at the


other side you want to look at the


process The Continuous integration The


Continuous deployment perhaps if you're


into kubernetes or Nomad that's also a


good question that's a valid question


during a job interview


um some people will ask about tests


which are really good questions to be


honest perhaps mutation testing which


you guys didn't get to at the workshops


I think they didn't actually commence


yeah so these are all those questions


but uh the other person usually get the


good answers until you ask them


can you tell me a little bit more about


the database


because you'll get also like oh it's


postgres and uh which version I I don't


remember I'll have to check I'll get


back with you to you with this yeah


most of the real shop talk that we do


and hallway track


interviews at work are related to


storing and manipulating data and that's


precisely what a database does


if you don't believe me let's take a


look at typical rails application


this is extremely simplified and wrong


version of the diagram that shows you a


rails application but it gets a point


across because it can be divided into


three sections the first one is the


front end which is the part that usually


then the user cares about because that's


how you get customer support tickets


something doesn't work and you're


supposed to figure out what actually


doesn't work but they look at the front


end then there is the big part the big


part is Big because we actually care


about this we are Ruby devs we want to


look at this part and this part only


this part is the interesting part but


there is a part that actually does most


of the work which is a database and if


you still don't believe me that this is


the case let's take a look at something


called test pyramid if you Google test


pyramid this is this idealized version


how you suppose the test you're supposed


to have most of your Tesla's unit test


you're supposed to have some integration


test with off of things that are not


supposed to be tested in isolation and a


small layer of end-to-end test because


these are expensive these are difficult


to maintain and there are error problems


if you Google the opposite of this this


is ice cream cone which has uh only a


few unit tests more integration tests


and then you have a lot of end-to-end


tests and if you really want the ice


cream these are the manual test but for


the typical rails app none of those


really apply


this is the idealized version


and this is the rails version of things


we actually can test only a few things


in unit tests


we sometimes have this layer of


end-to-end test because people like the


automate browser for some reason but


majority of the tester integration tests


the reason for them being integration


test is you integrate with the database


you're supposed to hit the database to


actually check if the logic works right


most rails application are pretty


useless without the database


because the applications are made as a


point of interaction with the data


storage with a database


and we still treat those database like


some storage box something that we're


not supposed to really look at these are


exotic these are weird


these are the dragons on the map


I'd like to talk to you about how to


handle those mythical creatures


first of all let's acknowledge databases


are weird like really weird they are not


object oriented they are not functional


they're declarative which is a very very


foreign paradigm


well they are not aware of most of your


abstractions including the user's data


separation which means that well we are


connect only as one user to the database


and the database doesn't know who's


logged in because it's only one user the


rails know which user is it


but they have one redeeming quality they


are very very good at processing data


and much faster than Ruby


okay let's start with the smallest


creature there is


um sqlite by the way images generated by


bin journey I asked to generate a


database as an animal


all right so sqlite typically used for


sqlite is I've made this size during my


free time as a project that I do after


work


and I really want to show it to the


world


but I don't feel like setting up a full


database it's not my thing and if you


look at the sqlite documentation there


is this long text about how to use them


on websites and yeah you can actually


use them for simple white what


if you actually multiply those numbers


they're handling 20 million SQL queries


in a day


on a server that's in a VM seven years


ago Hardware so not your latest thing


the crappy Skylake of the time


and they share this VM with 23 other


applications


this is Escalade website so turns out


that the clock end of sqlite is actually


Superman


and yeah sqlite is extremely extremely


fast


and it's a public domain


and it has very easy backups which those


things are like marketing ideas but


think about this way it can handle lots


and lots of traffic but free to your


hosting


it can be embedded into a gem and you


don't really care if it's well the


license won't work because it's public


domain


and also if you think that the backups


are important think about this way you


create one set of fixtures on one set of


tests or even just a schema and you copy


this file five times and you have five


databases working for your tests and you


can run them in parallel which is


extremely fun


they have some downsides with those like


um


in typical implementation every single


query will look Ruby so you're supposed


to use processes instead of threads


and because there is only one writer you


might get some database unavailable so


you're supposed to do retries when you


insert data or update data


and there is no full text search which


is really handy for small websites


because search is some a popular feature


but there are ways around it


yes so this very small database has lots


of usages and you might want to consider


it even for larger scale product it has


lots of reads


there is also of course the elephant in


the room


and this one is cool because it's it's


probably the best open source project


ever at least in my opinion it's


extremely versatile you can replace all


of your stack with just postgres yes all


of it


if you read this post by


um Stefan Schmidt I think this is


amazing cto's blog this was very popular


some time ago this list the cases in


which postgres can replace some other


part of your stock it can be a graphql


it can be a cache it can be a queue it


can be mostly anything


amazing article it was real popular in


hacking is for some time


and I extracted some of the examples


because I think they are cool the one


that really hit me was the unlock table


this is basically cash


unlocked means that it doesn't write


write the headlock to the disk which


means that if your database goes down or


your tables are gone like cash but it's


very very fast compared to the normal


storage and it's still SQL rails


actually supports this starting version


6 I think you can create unlock tables


straight from migrations


really cool


um but you think that yeah I already


have read this what do I need this for


this is cool as a proof of concept but


it's pretty much useless don't do this


it's a thing you append to our spec on


on tests and if you use this all of the


tables that rspec creates will be


unlocked


which means they are much faster I've


tested in a couple test Suites from


different clients and this managed to


shave off like three and a half minutes


of tests and on a bad test suit and


about 20 seconds on the good without any


of the changes it works really good if


you have lots of Rights so factoric


Cascades and stuff really cool but


remember if you restart your database


the tables are gone so you have to do


migrations again good for CI maybe not


so good in development


but holy this is fast this is


really cool it's the only other features


that we can use like have actually


anyone actually browse the features of


postgres


it's a really long document it was


really fun like there are interesting


things in it


um


I'll skip to the advanced features


section because we like Advanced


features and in my humble opinion the


most fun begins over here in window


functions


have anyone ever used a window function


unfortunately okay it's an interesting


case


window functions are basically parallel


calculations so you get some results but


you want to apply some calculation for


every row


there are context awareness of the aware


of other rows and they can do lots of


things they are mostly misunderstood


people think of them as some exotic SQL


that the data engineers and they have


the data scientists will use in some


other queries that you're supposed to do


in reporting or business intelligence


but actually they are freakishly


powerful


here are some examples I extracted from


various different projects and yeah


let's start with the first one which is


uh calculating some average originally


it was an employee's salary average but


this one actually speaks better


um to you um


and SQL you basically do this over


and give it some idea on how to apply


the calculation for the particular row


so in this case this will actually split


this entire thing into salary per


average salary of uh every Department


because this is the partition


and yeah this looks like SQL and it


looks like SQL that you you're not used


to so you don't really like it but you


can write it in rails and it's perfectly


reasonable you can even put it into some


kind of dto or the kind of object and


you will get a perfectly object-oriented


code out of this without just this tiny


tiny bit of SQL sprinkled over it and


this is much faster than doing this


calculation in rails with two queries


and combining the data from the two


okay second example cheapest product per


category this is useful for e-commerce


stuff like you want to upsell something


something cheap someone but something


expensive so now everything looks cheap


so you sell them the thing in a


particular category


yeah so um


this thing


will assign a row number


to every category and will order them by


price and name the name is important


because now the ordering is stable so


you'll get the same products every


single time


and yeah you could do things like you


can get all the cheapest item for


categories and you want to combine them


and you can do this in Ruby that's fun


but you wouldn't have this much fun


without pattern matching like functional


programming and declarative programming


super cool and yeah you can still do it


in rails yeah it's possible to do this


thing in rails and it's perfectly fine


it works it's wonderful even the pattern


matching thing works


so the last item from the window


functions I'd like to show you is one


that is extremely Googled for everyone


the stack Overflow answer has like


thousands of those how to do the


cumulative sum so something


um one on this then one this plus the


first the previous month in the previous


month and you add something in a


cumulative way


and I actually managed to refactor this


into two different things


which is cool it has come composition


inside of SQL it's not just one single


big query you can understand the parts


and what's even cooler is that oh by the


way this sounds like an English right it


is kind of confusing thing this many


keywords in a row yeah I think this


that's weird but still it's pretty cool


you can still do this in rails including


composition it works you can just do one


query and then do from from that query


and you can do the window function and


it will actually do all rows between the


beginning of time to current row apply


this to to the sum


really cool and this is also really fast


all those queries are really really fast


compared to duty doing this thing in


Ruby


and why let me explain


is a feature that people are aware of


but they don't know how to read it


like raise your hand if you know what


the first thing on the cost estimation


actually means because you have always


from this to this


yeah some people know but it's very very


um I would say weirdly explained in


postgres documentation


it contains a hidden feature postgres


actually keeps a table of statistics and


you can use it for various different


things do you know if you use Gmail and


you search for something and you get


page one of about 300 something that


about 300 something is actually feature


of explain function which will try to


estimate the number of rows for a query


and there is an explain method in active


record but it is kind of bad and


somebody can write the pull request it


will be really really cool to accept


other options because there are options


to explain it's not just like explain


and that's it you don't have anything


else there are lots of different options


that you can enable to get some more


data out of it


let's try a non-trivial but easy query


like just just one join there are two


conditions one of them and join the


other one is in where


nothing too fancy and uh


the one of the features that is


available in explain are costs I've


disabled them explicitly to get the


simplest format possible


and this is the explain for this query


on a small table


okay how to read this thing


the things with arrows are strategies


you can look up each strategy sequential


scan is just each in our term so it's


right over the entire thing


hash is to build a hash and sequential


scan so this Loop can have a filter


which will actually remove some elements


out of it


and if you have a nested item that means


that this strategy is required for this


one so sequential scan is required for


this hash to be built


so you have a list well another list a


tree of strategies to apply


okay sounds simple


let's turn on the default now we have


some more data


and in this particular part you have


this estimation I actually wrote written


a gem that extract this so we can just


do estimate count instead of count and


you get an estimate I know there are


some other gems that do this but they


require migrations and mine doesn't plus


is mine Shameless plug


okay but let's go to to to the original


topic which is explain how to read this


thing


okay


so the first one which is really tricky


for some people is the cost to get the


first row because you have to remember


that those strategies actually streaming


data and they try to parallelize as much


as possible


so this is the cost to get the first row


out of this strategy this is the cost of


getting all rows from the strategy this


is the number of rows that we estimate


to find and this is the size in bytes of


a particular row


really cool if you do limit one you will


only okay get this cost because the


basically we only care about the first


item if you don't do limit you'll get


discussed because we need all items


all right sounds simple


and you have to remember that the costs


are well sometimes the query have to


wait for something because you basically


need to strategy downwards to be


completed before you can do something so


in this case this hash gets written


instantly from the sequential scan all


the sequential scans can start


immediately so the zero is the cost for


the first row because you're basically


iterating and you already got the first


row the first time you iterate


but this hash join has to wait on this


hash that we built to actually do


something


so the strategies can block other


strategies


time to first row is important to know


what is actually waiting for something


to be done to start executing this


strategy


sounds difficult but to be honest it


isn't and you can always validate if


this is actually the case by adding


analyze to explain


analyzer would actually run the query


so you get the real time of how long


something took and if you want to do the


explain analyze on insert or update or


delete rather than transaction and roll


back that's the thing that they don't


mention in docs for some reason I don't


know why people like inserting tons of


rows because they don't understand that


they can roll back after explaining


something


all right so you get some extra data


there is a second parenthesis under this


so in the second parenthesis


there is an actual time to first row


which is in milliseconds time for all


rows number of rows and Loops means that


some strategies execute more than once


so you know how many times they they


executed


okay


sounds simple


you also get the planning time which is


how long postgres took to figure this


out the planning time are reused so the


second time will run and it'll be lower


um yeah you can you can actually tell


that this is basically


um the real time for those costs it's


pretty accurate but the actual time it


takes there is some actual time they


will take to to get the first row


because you have to read it from the


disk and you have to apply this filter


so the cost isn't actually matching the


actual time that closely


okay


but these features are probably you have


used at least I know that people use


explain analyze the one that will be


default soon is buffers


and buffers are important that's why


they will be default I think in postgres


16 or 17 they're still debating that


um


buffers will tell you how much data


was read from the disk and how much was


actually in Ram already


so you can see how much of a disk access


you need to have just something to be


used and you get the memory usage for


those and you eat you have the this um


um just a second


I think I lost one's like


yeah I think I lost one slide not not to


worry um if you rerun this you'll see


that the buffers change and you'll get


uh more hits uh there are the hits in


the in the buffer than reads for the


second time and also the execution time


will be lower because there is already


some data in the buffers


how can you optimize it even further


well use an index


and indexing is like I don't know this


is like magic to some people that will


try to do every single index or they try


to figure out some magical way of of


trying which index will be good there


are many many theories but to be honest


there are some rules that you can


actually apply like query can have only


one index it can be used for either


sorting or filtering or both but you


can't use two different indices one for


filtering one for sorting so we want to


take a look at order by and where


and the database design is which one to


use based on which one seems to be uh


the most beneficial


um


there is an app by Andrew Kane called PG


hero that has a very nice algorithm for


finding missing indices


there are different types that you might


want to try based on cardinality or type


of data


and it's very good to monitor the usage


of indexes which you can find in


postgres manual how to do


after you run vacuum analyze you'll be


able to see which indexes are induced


and if you're interested about how to


find those indexes this is the basic


algorithm that they use


it's I would say


very


um complete in most of the cases like


there aren't many cases in typical the


postgres data based on Rails apps that


this wouldn't find the right index to do


there might be some examples but I


wouldn't count them on them if you're


not to them yourself because this was


actually about dated five years ago so


pull requests are very welcome


there are some dark outs when it comes


to indexes like things that are


non-obvious


if your users are from all around the


world try all the combinations of


geodata on indices


because it seems like sometimes only the


latitude is important sometimes the only


longitude is important sometimes the


order is important of those because the


users are sometimes sliced into


continents or sometimes slides into


countries it depends on what actually


works so try all of them and see which


one will actually get picked


um you can try partial indexes which are


really nice because you can limit the


amount of data in index which will make


it much faster


the where part should mimic part of your


query so if they match the postgres


might pick this particular partial index


instead of the full one


and of course


there is a hidden feature that I haven't


seen in production it for some reason


like it's it's it's not that hidden but


I still think that people are missing it


you can include columns in your index


that are not being used for filtering or


soldering there are just uh so you don't


have to go to the database so if you


actually use the index and the columns


which are not really important but still


are being fetched then you can do an


index only query like you don't get hit


the database at all


and postgres is also fun when it comes


to extensions


this is the part why it's so versatile


the most popular ones are probably


postgis if you have like lots of


geographical data used postgres at some


point time scale for time series also


very popular


um


but the one that's underutilized this


site is it is I never know how to spell


and but to pronounce it and I I have to


look it up and I should probably do it


before this talk


they added tons of different features


because it's not only one thing they


actually merge several different


extensions into one


the most important I think are columnar


storage distributed tables and parallel


querying


raise your hand if you know what's a


columnar storage


some people do when database store the


data on the disk it's stored in rows so


you can actually fetch the entire row


each time but you can also do it the


opposite way which is for the store


columns so if you query have like lots


and lots of data


but only in a few columns the columnar


one is much much faster typical for


analytical use business intelligence and


stuff


but also because every single one of


those columns is the same type


you can compress them and store them in


chunks which means you'll have like 10


times less the storage


really really cool for lots of reads and


mostly multiple rows and or aggregates


inside this there is some downside


because site is on columnar storage


can't really update the data it's append


only but still very useful for


analytical


components of your app


and yeah the building compression is


really cool like you have 40 terabytes


table and now it's four so it fits on an


nvme expensive one but still nvme which


means it's faster


yeah it's very easy to use you just add


using column learn that's it there is no


other magic to it and you can mix and


match in a single query you can use some


tables called columnars some tables row


based and it will still work really cool


um


it's smaller faster for operations mix


match and the use case that I really


like


I remember yesterday we had to talk


about Cold Storage you do cold storage


with columnar database but the last


partition the newest data is in the row


database


so yeah this use case is really cool


like the data that is all that's


compressed is fast to retrieve in bulk


but the new one is easy to find


okay and then distributed tables with


multiple elephants


um


a few similar servers are usually


cheaper than the big one especially in


Cloud


the cloud doesn't really scale uh


vertically it likes to scale


horizontally


so you can distribute your load to


multiple tables


you can paste them on sorry user


location


and you can put the frequently used


together data together even if you copy


it several times the different different


servers


you pick the distribution column like


for example if you have a white label


app or some SAS you take the customer ID


you have this column into every single


table so they get distributed by it


and you use it in the you don't use it


in the tables that are shared which will


get copied to every single server so the


data is together


um you mark the shared table as


reference tables so postgres actually


has some idea on what you're trying to


do


and that's it there's only one function


to invoke which is a table name and


which column you want and for the shared


one is uh I don't have the shared one


but I have the collocate one which is


will be try to fit this data together


with the other one so it knows that the


store ID in this call is this table is


the same as the store ID in the other


table


these are the features that you can


actually use right now with just this


simple installing extension and and


running single SQL there's like nothing


nothing difficult to do here


but if you really want we can talk about


some future curriculum for Fantastic


Beasts


which are the other databases the one


that are more exotic which requires you


to Ping your devops and get some some


approval


cockroachdb is the fun one


it basically does what the scientist


does


but automatically you want to have this


sharding availability you have to want


the multiple servers you don't want to


do it yourself like inside this and


you're really worried that one of your


data centers will go down which is


something that might happen in the


modern world


yeah so it has Auto shorting outdated


distributions with hotspot detection


which means it will actually try to


redistribute data if it feels that it


needs to and it can be geographically


aware so we would knows that the users


in America probably want to get the data


from an American server and using Europe


want to get it from European server it


will do that


the second exotic database that you


probably well this one you actually used


what


what


really


it's like the default choice for PHP


there's it's it's


ugly and Oracle bought it like one rich


 called Larry Ellison bought this


thing


weird


but MySQL has been growing a lot of time


and well it's different now really it is


it is a lot of different thing


it's super fast


even compared to postgres this is really


really fast it can do in-memory tables


which postgres can't do even if you do


unlock tables they will still be written


sometimes sometimes to disk and this


will be entirely in memory perfect for


tests


and also


well it used to be a joke it's a serious


alternative like base camp uses it


probably right now or if hey maybe I


don't know which one of them but one of


them are using using MySQL


it has um if I ask you about what's the


difference between MySQL and postgres


most people will tell you well one is


the one that you choose because it's the


right one and the other one's the first


one which you choose when somebody else


actually picked it and you can't really


change it but the reason why it's so


fast is it handles transactions


differently it makes a different


assumption on how to insert the data


okay let's go to an example this is


postgres and uh this is a very


controversial topic right now because it


used to be like this that if you have


Thai script you're happy but on the


other case you're also happy but because


the AJ changed it you're now unhappy so


postgres will do it it will add the


second row


and Mark this one a stale so the older


queries will actually get this one but


the new one will operate on this one


and you have two rows you have to clean


them up in vacuum or something vacuum


will clean up this thing but until it


until it does this is actually taking


space and time


MySQL Works differently because it will


just update the spring and table and


they create a second table in which you


have the rollback as a segment which


means every single transaction that will


try to use the old data will get the new


data but it will apply the rollback


and because most of the transactions


actually commit pretty quickly and most


of the transactions don't really need


this assumption that you might get the


style data


and most of them will resolve correctly


they won't be rolled back this version


is just faster


it also has all the nice ideas of poster


as there is a Json operators window


functions full text search it all just


works now so it's right now a valid


choice for a database for your project


then there is more exotic fish is the


fish I don't know to be honest it's


something


this one came from Russia


and it was used for Yandex


they have a lot of marketing data this


is columnar database


the use case is my data team takes a


coffee break every time they run


something but if I'd give them access to


some cloud-based solution they will just


run the budget to the ground and oh my


God this thing has pre-calculated


Aggregates and everything is like super


fast the time in Russian is uh forgiving


Point am I Russian clickhouse yeah


term is it


somebody will correct me I'm quite sure


um which means clickhouse doesn't press


breaks these things is not definitely


not the slowest part of your system it


doesn't slow down the other parts the


other parts are the ones that are slow


it has multitude of different table


types which will do pre-calculated


Aggregates or automatically sum the rows


I will try to do some exotic kind of


updates in place really really cool also


really long documentation cloudflare


uses this to do like six million inserts


a second or something like that which is


crazy the ingestion rate on this is


amazing I highly recommend you read this


article they migrated from postgres to


this


yeah


but this is an exotic database and we


can get even more exotic databases


but unfortunately we're kind of running


out of time


and nosql is still a topic document


database or a topic graph database or a


topic


so perhaps a part two is in order next


year


if you really like it but in the


meantime feel free to explore in your


own the Fantastic world of databases


thank you


[Applause]


are there any questions


thank you for presentation yeah let's


see it was really good to hear it and I


have the question because you didn't


explain how to perceive the the cost


value


um it's 100 is high value or how to


evaluate it because I try to read it in


documentation but it's not so


um I don't know it's like story points


for me


it is it is exactly story points you can


actually check out the postgres


configuration and see which operation


has assigned which cost but it's


predetermined and you can modify them if


you really want if you think that oh I


think the sequentials scan actually is


more expensive for me because I have


very crappy hard drive then you assign


it a higher value and the plan will


change but by default there are some


well there are some defaults that are


taken from various use cases but there


are exactly just story points and it


just tries to sum the story points for a


particular part of the query so you're


right


with the unlocked tables you said um


they are effectively


wiped out when the database is restarted


is it is it can also work when the


connection is reset or is it I don't


think it's connection is reset but I I


definitely know that if I restart the


docker they will be gone I know if the


database will crash in any way they will


be gone


I haven't tested how far it goes but for


continuous integration usage like you


want to have the database written from


scratch and you want to run all the


tests and then you don't really care


what happens after it this is really


good but you have to append that if this


is run on CI on local this is actually


problematic because you download some


changes run some migrations and you're


sometimes in this weird state that have


I restarted Docker have I restarted my


computer was it restarted automatically


my tables are gone I'm supposed to run


the entire migration set again oh my God


I don't want to do this yeah cool


okay thank you Chris thank you very much


foreign