Mastering linked data in PHP: graphs and algorithms

SymfonyConPresentationGitdataCLIrustautomation

24 September 2024

Christian Rades

Principal Engineer

Video transcript

We utilized ChatGPT to enhance the grammar and syntax of the transcript.

Hello and welcome everyone to this almost last slot of the day. Today we are going to talk about graphs. And just to be precise one last time—not the kind of graphs that are colorful, but the boring kind of graphs that just consider points and how they are connected.

My name is Christian, and I currently work at Shopware where I’m the resident madman doing all kinds of crazy experiments and working with whatever technology might be interesting for the future or hasn’t been explored yet, like for example, graphs. I’m also active on Mastodon, so feel free to connect with me there if you want.

Today, it's all about Tangles—how things are connected when you put them together, and how you can then get them back out. There are different kinds of connections, really. As the title of the talk says, “Caught in a Web,” we’re talking about webs—they're tangled, there are cities from up above, and there are maps. All of them share something in common: they are visually busy and highly connected. We'll explore that further.

Untangling structures like these isn't a new issue. It's been around for hundreds of years. For example, the first problem of this type is the Seven Bridges of Königsberg, now Kaliningrad, which appeared in 1736. The question was: "How can I move across the city and cross each bridge only once?"

What started as a drunken question over a few beers in the evening turned out to be mathematically profound because there's no easy or elegant solution to this problem with the normal tools like geometry or algebra. The mathematician Leonard Euler, whom you might know because he solved a ton of things, had the idea to think about the parts of this problem. The parts are places where you can be—in this case, the small island, the big island, and the two sides of the river. The bridges connecting them became connections.

This problem no longer related to a geometrical model. It was just about points connected by edges. Euler figured out that to get from one place to another, you have to enter a node, exit it, and enter another one. This meant each node needed an even number of connections because you need one connection to enter and another to exit. If a node has an uneven number of connections, you can't move across all of them without using one edge twice.

Euler proved it’s not mathematically possible to find a path that crosses each bridge once. But how do mathematicians today describe graphs? It’s quite simple in theory—a function of vertices, edges, and mappings. But it’s not that easy in practice, so I’ll give a short introduction.

So, "V" stands for vertices. Vertices are the points you are connecting—just points with an ID, a single number. For example, you have vertex one, vertex two, vertex three, and so on. Then you have edges. Edges are also just identities, but they connect two vertices. "F" maps the identity of the edges to the two vertices, so you can connect two vertices by describing them with an edge.

An edge connects two parts. For example, you have vertex one and vertex two, and the vertices come from the set of all vertices in your graph. Still easy? Not really, but there’s a human way to look at it—a non-mathematician way. Vertices are points, and edges show how those points are connected. Graphs are only about how data points are connected. They don’t have coordinates like you might expect in graphics or computer graphics. The coordinates are arbitrary, and how they’re put into space is just to make it easier to understand as a human.

This is what unites nets, cities, and maps: they can all be displayed and transformed into graphs. Every time two roads meet, you can describe that as a point, with the roads as the edges. This is how tools like Google Maps can use graph algorithms to trace your path from A to B and give you directions.

With that, we see that graphs imply some form of navigation. But this isn’t just physical navigation. As web developers, we rarely think about which roads we need to take to get from one place to the next. But what we deal with all the time is data. And this is where I want to relate it to a tool called JQ.

How many of you know JQ? Okay, I see lots of hands, but not all of them. The quick explanation is: JQ is a command-line tool that makes it easy to edit, read, and modify JSON data. It has its own little syntax for accessing and modifying data in JSON objects. What I want to show you is that this syntax is nothing more than navigating through a piece of JSON.

On the left side, you see a random API response. It contains two entities, each with a name and an address. The JQ script accesses that data, and on the right, you see the data we want to get out. The red arrows show where we are in the data. So, let’s play JQ—I’m the computer now.

We start by accessing the root element, the top-level JSON object. Then, we move to the first named attribute, which is "data," and it contains an array. We access the first element in the array, which is a JSON object containing an address field, which in turn contains the street. At the end of our JQ script, the value returned is "Test Street."

This is how it looks as a graph. It’s the "Google Maps" view of this data structure. You can give the connecting edges meaningful names and plot a path through the data by transforming your JQ statement into this abstract structure—a structure path. This is what I wanted to demonstrate: the structure of data has nothing to do with the data itself. You can completely separate the two and operate solely on the structure.

Why is this powerful? Because you can use sub-elements of those paths. For example, you might say, "I just want to get the street of every address." You navigate to the address field, get the street field, and use this subgraph to query even larger datasets. You don’t care about anything that isn’t on the red path.

But before we get too far ahead of ourselves and end up lost in theory, let’s get back to something we deal with more often: objects versus tables. This is a common problem for developers—it even has a name: object-relational mappers (ORMs). Don't worry, this talk isn’t about ORMs, but it is about people who have to maintain them, which I’m part of.

The issue is that objects often exist only in memory. We don't usually think about the order in which they were created. But when we want to turn those objects into database tables, we already have all the objects we want, and we then have to figure out how to get them into the database. Why? Because the database has a relational model, and that model uses foreign keys. Foreign keys mean you can’t reference a piece of data that the database doesn’t already know about.

At Shopware, we had a particular issue when we wanted to write an API that allowed developers to insert data in a friendly format, like the following JSON object, into our database. For example, you want to insert an "Awesome Product" with some categories into Shopware’s database. The developer's goal was to provide inventory data in this friendly format.

However, we quickly realized that you can’t create a product with categories unless those categories already exist in the database. You need to know the IDs of those categories, because without an ID, you can’t do any inserts into the database. Foreign keys must exist.

This situation reminded me of a directed graph. I started to think of this as a directed graph problem, which then led us to sort the data logically. But first, let’s define what a directed graph is.

A directed graph is a graph where the direction of connections matters. For instance, in a database, one table can know about another table through a foreign key, but the reverse may not be true. Otherwise, you'd create a cycle. A directed graph means that the connection from node A to node B is not the same as a connection from B to A. In contrast, an undirected graph doesn’t care about direction—the connections are mutual.

Now that we know we’re dealing with a directed graph and need to solve dependencies, we can think about how to order the graph. Fortunately, there are existing algorithms for this. You can look up "topological sorting" on Wikipedia. This is also what tools like dependency injection containers use to figure out the order in which objects should be instantiated for dependency injection.

I want to show you an algorithm called Kahn’s Algorithm, which determines the order to insert data from the JSON format I showed you earlier. To do that efficiently, we need to assign a value to each node that indicates the number of incoming connections (or edges).

What’s an incoming connection? It’s where the arrow points to a node. Let’s assign values to our graph. For the "Awesome Product" at the root of the document, it has zero incoming nodes—no incoming connections. For our categories, "Awesome Things" and "Products," they each have one incoming connection, so they get a value of one.

Now we can apply Kahn’s Algorithm. First, we select a node with zero incoming connections. If no such node exists, you have a cycle, which is something we’ll deal with later. For now, assume the user input is correct. We have one or more nodes with zero incoming edges. We select the first one—this will be the last one we need to insert into the database.

We then remove its edges from the graph and decrement the number of incoming edges for the connected nodes. As a result, two more nodes now have zero incoming connections. We repeat the process, selecting a node with zero incoming edges, placing it in the stack, and decrementing the incoming edges of connected nodes.

Eventually, we end up with an insertion order. From left to right: "Root," "Products," "Awesome Things," and finally, "Awesome Product." With this order, you can insert the data into the database without violating any foreign key constraints, as this is the order in which the entities depend on each other.

But what about cycles? How do we deal with those? My answer is: you can Google it. Now that we know how to model our problem as a graph, we can look up how to handle cycles in a graph. If you Google "cycle detection in a graph," you’ll find algorithms like depth-first search (DFS).

Let’s take an example diagram. I’ve intentionally built in a loop to see how we can detect this using depth-first search. We pick a starting node—let’s say our current node is "Handle Basket." On the left side, we have our insertion stack. We examine the node and add it to the stack. We also check its neighbors, which are the next nodes to look at.

Our "Products" node has a connection to the "Root" node, so we add "Products" to the stack and move "Root" to the "Next to Look At" list. We do the same again, and then we arrive at "Awesome Things." When we look at the neighbors of "Awesome Things," we see that it refers back to "Products," creating a cycle.

Why is detecting cycles useful? Because now you can decide what to do with the cycle. In some domains, it might be valid to look at the data and decide that one of the nodes should be processed first. Or, depending on your goal, you might choose to break the connection temporarily and reconnect it later. In our example, however, this is broken data—you can’t get this into a database. You’d need to know the IDs of the rows before they even exist. So, in this case, you can inform the user that there’s a cycle connecting "Products," "Root," and "Awesome Things," which prevents the data from being inserted.

With this approach, you have all the tools you need to take JSON data from an external source and inject it into a database in a correct, ordered way.

But we’re not done yet. There’s another graph we all know—some of us perhaps a bit too well—and that’s Git. Git is actually a tree, and a tree is a graph. Git is a directed tree because it works from the future to the past.

This is an example Git tree. At the top, we have our head commit—where we are currently. There’s a branch up there, marking a point where somebody made changes. We have a merge commit that has two parents, and back at the bottom is the root. Everything is connected from now back to the past.

What’s interesting is that we can use this information, this history of our project, to optimize our architecture. You might wonder why we’re now talking about architecture, but the answer is: this is also just a graph problem. If you know certain algorithms, you can extract useful information.

How do we do this? We do it by transforming our Git data into a social network. In this context, a social network means we have tons of files, and these files "talk" to each other. "Talking" means they get changed together. A commit is a meaningful change to the codebase. For instance, a commit might create or modify files A, B, and C.

Here’s how our commits relate to files: the root commit created our first two files, A and B. The next commit added file C, and so on. These commits connect the files, and these connections mean something—like if file D is a new service in your architecture, file A may need to change because a test is now failing.

These changes create a connection between files, and we can transform this into a graph showing how the files are connected. This is how Git sees it—it’s useful for going back to a particular commit and restoring the code to that state. But we can also look at it from a different perspective—seeing which files are changed together. For example, file A is closely related to file D, but not at all to file C. There’s no commit where A and C appear together, so they’re not closely related.

We can then use this information to chart a path through the graph. These numbers you see next to the connections—they’re called weights. A weight represents how often two files are committed together. You can add extra data to nodes and edges in a graph, and this data, which has no inherent schema, is called a weight. In our case, the weights are simply integers representing how many times two files were committed together.

This seems easy for a small graph like this, but when you run it on a large codebase with thousands of files, you quickly end up with a graph containing hundreds of thousands of edges, which can be difficult to work with.

By calculating paths through the graph, we can apply a metric called "betweenness centrality." This metric was originally invented for social networks, used by companies like Facebook and Twitter to determine which people are important, or central, within the network. In our case, we use it to figure out which files are central in our codebase—those that connect different components of our architecture.

Some of you might have already noticed that files A and D look quite important. Why is that? It’s because if you need to get from files G, E, or F to files A, J, B, or C, you have to go through file D. D is like the one single bridge connecting two parts of the codebase.

How does this look in practice? I wrote a tool that calculates how central each file in our system is, and I ran some data analysis on it. The result was this colorful visualization called a Sunburst diagram. This shows how much each folder in our project contributes to the betweenness centrality.

For example, our "Snippet" folder contributes to 25% of all connectedness. Why? Because, as you can see from the blue inner ring, our "Administration" (UI layer) is quite large. When you change something in the UI, you often need to update the snippets as well, because new UI elements require new strings to describe what they do. This creates a lot of connections between the UI code and the snippets.

Similarly, our end-to-end (E2E) tests, though not as connected, were for a long time quite brittle. Every time someone changed a bit of JavaScript or CSS, the E2E tests would break. So, we knew that any UI change required touching the E2E tests. This is something many of you probably know—poorly executed E2E tests can be more brittle than unit tests.

But this doesn’t excuse our unit tests. I also ran the analysis on our PHP codebase and checked the tests. The betweenness centrality score told me how likely it was for a test to break when a random change was made to the codebase.

The most brittle test was our API-aware test. When I looked into it, I saw it was designed to compare the entire API schema against a committed JSON file to verify field annotations. This made the test purposefully very brittle. Anytime there was a change in the API schema, the test would fail, and developers would have to fix it.

I could identify this brittle test by analyzing the Git history alone. We’ve moved through a lot of topics—from maps, history, and mathematics, to computer science, databases, and graphs. But what’s in it for us? Why do we do all this?

That’s the key takeaway. There are standard ways to handle data—we have patterns like factories and command patterns to describe how a system should behave. These patterns are well understood. But what about structure? This is something we don’t think about as often as data.

We do have patterns for structure, like graphs. Using algorithms that analyze the layout of data can help us derive certain attributes and verify things. And not just that—we can also calculate new data from structure, like looking at the commit history of a project or the dependency graph of classes. We can use standard algorithms, many of which are already well established and easy to find.

I hope this talk has been inspiring for you. As I said before, graphs are everywhere. Have a great evening, and enjoy the last slot of the day!

Q&A

Question: What are some good reasons to use graph structures instead of more traditional structures, outside of the examples you’ve shown? Do you have any more examples to explore further?
Christian: Yes, absolutely. One example is using a tool like DepthTrack, which analyzes how classes depend on each other. I used the information from DepthTrack to display the dependencies as a graph. Then, I applied layout algorithms to organize the graph spatially, which allowed us to see which parts of our system were strongly coupled.

We colored the nodes according to the folder they belonged to, and this helped us identify when two folders that weren’t supposed to depend on each other were closely related. For example, we could see how closely our checkout system was related to our database layer, or how our shipment logic was connected to our payment services. Layout algorithms can help make these relationships visible in a dependency graph.

Question: Is there any toolset you use to analyze your repositories and generate the graphs showing the relationships between your classes?
Christian: The standard answer would be to use Python with the NetworkX library. However, NetworkX can be slow for very large graphs like an entire repository. So, in this case, I rewrote the Python algorithm in Rust and ran it in parallel. This reduced the processing time from 20 hours to just 2 minutes.

It’s an unfinished tool that I have on my GitHub. It outputs a CSV file, and you can use Excel or other tools to visualize it. In this case, I used a graphing library from the statistics language R to generate the Sunburst diagrams.

Question: Do you have the code where you calculated centrality for certain files? Is that available somewhere?
Christian: Yes, it's on my GitHub, and the project is called "Rawal"—it’s named after a whale for some reason. If you're interested, I can share the link in the Symfony Slack.

Mastering linked data in PHP: graphs and algorithms

Video transcript

Q&A

Your greatest work.css-2vew0q{display:inline-block;background:rgb(250, 65, 255);background:linear-gradient(90deg, #806bff 0%, #ed49f0 100%);-webkit-background-clip:text;-webkit-background-clip:text;background-clip:text;-webkit-text-fill-color:transparent;}is just on the horizon

Your greatest work
is just on the horizon