Monday 25 February 2013

You don't have to be Facebook to use the graph

In this post/tutorial, I'm going to introduce graph databases. I'll explain what these new things are, why they exist, why you might want one and then I'm going to help you install one and get playing around with it!
I am by no means a guru on graph databases, Neo4J or databases in general, but I have come into contact with them and have been undergoing my own trial and error learning, so there's every chance some of my findings will help someone out there.

So what are these things?

Ok, so first things first. What? I'll keep this relatively short, sweet and easy to understand. I don't intend to bore anyone to death with theory. Anybody who's done any of the more theory heavy computer science stuff at university will have come across the graph data structure. In graphs, data is stored on a node, or if it helps to think of them as 'things', do so. These nodes can be connected together by relationships, think of them as 'relationships'.
The nodes in the graph could loosely be said to be like an entity from an Entity Relationship Diagram from a traditional RDBMS, or an object in object oriented programming. Basically, it's a thing. An address, a person, a car, a cat, whatever. The data about the node is stored on properties, and you define relationships with, well, relationships. To make things slightly more confusing, relationships have properties too. i.e the in what year a sportsperson joined a particular team, when person A first became friends with person B, etc. Of course, none of this would be complete without a nice graphical representation or pretty picture. And appropriately, there's one on the Neo4J website (more on that later).
An image of a simple graph, showing a graph node connected by a records relationship to 2 nodes called nodes and relationships. Relationships has a relationship to nodes called organise, and both have a relationship called have that points to another node called properties.
So in summary. Graph databases have nodes, which hold your data in the form of properties. These nodes are connected by relationships, which can also have information stored about them in properties. Nodes, relationships, properties, pretty pictures. Beautiful!

So why would I want to stop using my nice easy tables?

Good question. With relational databases being so popular and entrenched in today's systems, most things work pretty much straight off. The expertese are widely available, the tools are out there and are mature and stable and there's the simple fact that some people still don't know that there are other ways. But when it comes to highly connected relational data, well, you might just be making your life easier.
Say you're storing data for a recommendation engine. Your app is the next big thing in telling consumers what they want before they even know they want it. Great. Though it has a pretty inter-related dataset. Person A bought something that their friend, inventively called person B doesn't own. Their friend, person C, also has that thing on their wishlist. How do we check each product we're displaying to person B to see if it's a worthwhile recommendation? In SQL, I doubt it would be impossible, but it might well be a costly operation and might be a headscratcher for all but a guru to write in an efficient way. But even that feels a bit simple. Let's throw person D into the mix and say that they are the same age and from the same location as person B, and went for something completely different, oh, and they aren't personally listed as friends. Now how would we aggregate that up? It might well involve a couple of self joins on the customers table, a join to the customerpurchases table which in turn would go to product, then an addresses lookup and maybe some kind of postcode table as well, as an off-the-cuff guess. At the very least, it would use a lot of joins. I'm not saying that this app is necessarily a good idea, I don't know what criteria these engines really use, but I hope it illustrates a point that this would be difficult in SQL. Not so in a graph database.
In a graph database, you'd enter the graph at the node representing your consumer, person B. You'd then look at their friends relationships and see what they bought, again stored in other nodes. You could then look at person B's city and see what other people in that city are buying. You could look at person B's year of birth and see if any emerging trends are there as well. If anything's a recurring hit in these graph traversals, then you might have found what person B didn't even know they were looking for.
Graphs, as far as I can tell, aren't a panacea, they aren't the be all and end all of data storage. Relational databases are still your best bet for aggregated data. If you want to work out the average salary of a large dataset, you'll want SQL. If you're writing a route planning system, a recommendations engine, a social network or anything else that involves traversing relationships in data, maybe the graph is your best bet. Go for whichever suits your needs. Square pegs in round holes isn't going to help anyone.

I'm interested. Show me more.

The best way to learn is to get one and play around with it. I'm using one called Neo4J. As you may have guessed by the 'J', it's written in Java. Not ideal if you want to connect in non-JVM languages but it has a REST interface as well, so anyone can benefit from it. I will be (hopefully) following this post up with a post on how I've found integrating Neo4J into a project I'm undertaking, for better and worse. But I digress. Let's grab ourselves a copy of Neo4J. This following section is written with Windows in mind. Neo4J is a java based system and running from UNIX OSs or Mac is, at least from my experience, easier to find examples of. It was a learning curve for me, so hopefully I'll make it a little easier for you.
Go to Neo4J.org and grab yourself a copy of the Windows community edition (the milestone release seems to be a bit better). It's a zip, so no install required. Handy!
Unzip the zip to a folder of your choosing. c:\neo4j\ is always a nice one if you can. Ok, let's boot up the database server. Go to the Neo4J location and open up a command prompt in the bin folder (protip: shift and right click a folder and click "Open Command Window Here"). In your prompt, execute the neo4j.bat file.
C:\neo4j\neo4j 1.9.M04\bin>Neo4j.bat
Error: Unable to access jarfile C:\neo4j\NEO4J1~1.M04\bin\windows-service-wrapper-*.jar
If you're getting this message too, don't worry, it's nothing you've done. From reading around online, this is a problem that hasn't been fixed yet in Neo4J. You'll have to edit their base.bat file and change something. Not ideal, but it's easy, I promise. Open up base.bat and go to the line that says:
set wrapperJarFilename=windows-service-wrapper-*.jar
For me, it's on about line 49. Change the "*" to "4". You'll notice in the bin folder when you went to open up base.bat there was a file called "windows-service-wrapper-4.jar". This is what we're referencing. Once that's done, let's try again.
(lots of times, dates and jibba jabba)
... org.neo4j.server.AbstractNeoServer INFO: Server started on [http://localhost:7474/]

That looks promising. So what happens when we navigate with a browser to localhost:7474?

Neo4J web admin panel

A screenshot of the web admin console
"oooo!" I hear you exclaim in delight. Yeah, it's all up and running. But it's virtually empty. For the purpose of this tutorial, let's get some sample data. Neo4J handily has a page of example data sets for you to take and use to play around with. I'm personally getting the Dr Who one, but it's up to you which you grab. Follow the instructions on that page, namely turn the server off. Copy the files into your graph.db folder in data, turn the server on again and go back to the web admin panel at localhost:7474.
You should notice those numbers have grown to a still rather small but tutorial-worthy amount, now. We have a graph of 1062 nodes by 2292 relationships. Ok, so, now what? Well, let's look around the graph.
Go to the data browser on the web admin. I won't bombard you with screenshots here, since it will depend whether you prefer text or more pretty pictures. In either case, in the query box at the top of the data browser screen, remove all the code and replace it with the solitary number 1. Press ctrl+enter or click the execute button. You will now see your first node (well all graphs have a reference node called 0 but I personally never found a use for it). On the Dr Who dataset, it's the character of the doctor. It has key value pairs on screen, these are your properties. here is where the data on the node goes, so in the case of the doctor, his name, which is, well, Doctor. Feel free to punch in other numbers in the query text area and execute them to go to different nodes. Those numbers represent NodeIDs. These IDs are useful in a limited way, but they are not guaranteed to remain the same between each server start up, so I don't recommend getting attached.
Ok, so those are the nodes. But what about their relationships? Well, you might have already done it, but clicking on "Show Relationships" shows a big list of relationships, showing the relationship ID (yes they have IDs too), the start node, the relationship type (it is actually more than just a label for the relationship) and the end node. As you might guess, clicking on the end node takes you to that node and you can view it's properties as well. Now, if you're of a more visual persuasion, click the icon in the top right that is labelled "Switch view mode", or you can just press V (as long as you're not in a textbox anywhere). Your graph will then be shown to you in a nice visual display, centered around your current node. At the minute, it only shows things as node IDs, but it is possible to display properties of nodes instead, which would be much more useful. Play around with the style menu, make a new style and see what you can do. Setting the display property to {props} might well speed you on your way, but it's up to you how you want to view it.
I'll show you one last thing before I wrap up what was supposed to be a short tutorial, my apologies. So this is all nice, but what if we want to actually query the database? No problem. As you'd expect from a graph database, your typical query would be more relational based. Here's an example one for the Dr Who data set. Let's find out the species of all characters that the character The Doctor loves. To do this, we're going to write a short query in a new language called Cypher. It's not a huge jump from SQL to Cypher, but the syntax will look weird. No, it's not just me having fun with ASCII art. Cypher isn't the only language you can query graph databases in, but it's my personal pick. Take a look at this query:
start dr = node(1) // here is our start point, we'll start at node 1
MATCH (dr)-[:LOVES]->(lover)-[:IS_A]->(species) // this is where we draw out the pattern of the sub graphs we're looking for
RETURN distinct species; // what we want returned.

You should get a single node, it turns out the good Doctor only loves humans. Good choice. I warned you the syntax might look a little weird for Cypher, so I'll explain it briefly.
The start clause is needed for all Cypher queries, it's where you begin in the graph. From our recommendation engine earlier, it's the person we're recommending to. In this case, it's the Doctor. You need a start point to traverse around the graph from. This can be done with node IDs as above, or through the use of Neo4J's support for Lucene indexes, but I'll save that for another time.
Match is where we've first encountered a pattern. A pattern will typically go (node)-[:relation_type]-(other_node), but that isn't a given. Remember I mentioned relationships have a type earlier? Well we can use those to filter out relationships to a specific type. So in this case, we were only interested in the LOVES and the IS_A relationships. So between nodes, have a dash "-" and between brackets "[]", put a colon and then the relationship type you're looking for. This is optional, we could have just had (dr)-->(lover)-->(species), but we would have got data from relationships that we weren't interested in and it wouldn't have fit with what we were looking for. It would have essentially been anything that is related to something that is related to the doctor, which is too braud for what we're after. We also used a greater than sign ">" in the query, to denote which way the relationship was going. It is also possible to do (dr)<--(lover) or even to omit a relationship direction entirely to get both, i.e (dr)--(lover). It looks nice in its ascii art format, though. And speaking of the ascii art, the perenthesis around the nodes is optional, I put it there to make it stand out more, but many queries you may see will leave them off. So now we've done this match, the lover and species nodes are available for each subgraph that matches, or each row in our results set in other words. I hope the SQL comparison hasn't just made it more confusing.
Return does what it says on the tin, it returns the columns or data given. Think like a SELECT statement from SQL. In this case, we want to return the species nodes. We also don't want duplicates. So we're using distinct. For people who've used SQL, you know what that does. And yes, it still incurs a slight performance penalty as well. If we want to be more specific, we can return a single property using dot notation.

RETURN distinct species.species; // what we want returned.

will get the species name instead of the node itself.

I'm still interested

Want to know more? Then try out some of the material on the Neo4J learning pages to give yourself a good starting point. It goes more in to depth on some of the things talked about here, though less specific in some instances. You might also find the Cypher manual of some interest, in particular a chapter on going from SQL to Cypher.
We've by no means covered everything, but I hope this has served to be a quick intro to graph databases in the best way I know how, by letting you get your hands on one and do some magic of your own. I'd appreciate any thoughts, comments or constructive tips for improvements in the comments section.
(craig)-[:SAYS]->(goodbye)

No comments:

Post a Comment

As always, feel free to leave a comment.