Recently, I worked on a project which involved building a genealogy portal to gather information about individuals and their genealogical/family links. Through the portal, the users could access and collaborate on comprehensive family trees. Target users were curious amateurs, genealogists, researchers, and professionals related to this space. In order to build this for different users, we needed flexibility to build different search interfaces.
For this project, we evaluated relational databases, NoSQL, search engines and graph databases. Here’s what we considered:
Relational Databases
Relational databases cannot store relationships between data elements. When you try to get relationships in data using a relational database, it involves numerous JOINs, resulting in database complexity and poor performance.
Elasticsearch
Elasticsearch is a full-text search engine. It is based on Lucene and is highly scalable. It allows RESTful web interface and schema-free JSON documents. Elasticsearch is able to achieve fast search responses because it searches an index instead of searching the text directly.
Graph Database
Graph databases use graph theory to store, map and query relationships. A graph database is essentially a collection of nodes and edges. Each node represents an entity (such as a person) and each edge represents a connection or relationship between two nodes. Each node also has a set of properties to define it.
Compared with relational databases, graph databases often deliver faster performance for associative data sets. They can scale more naturally to large data sets as they do not typically require expensive JOIN operations. It can very efficiently and easily traverse 100.000 nodes/relationships of any depth.
Some of the popular graph databases are Neo4J, Orient DB and Apache Titan.
What Fit our Requirement?
Genealogy is about relationships and not text patterns. We needed an engine which could navigate/update graphs very fast.
We listed some of the sample queries that are commonly required for Genealogy portals:
- Show all relationship types such as Father, Mother or Spouse (with constraints to get a graph shape we are interested in)
- Show all relationship nodes
- Find siblings for any node.
- Extract ancestors or decedents to a depth
- Simple queries such as finding nodes where full name starts with Maria, along with other conditions.
- Build a family tree.
Keeping these use cases in mind, we compared Elasticsearch and Graph Database with their ability to get results for these scenarios.
In Elasticsearch, we would need to flatten/denormalize the data while saving data. Graph database, on the other hand, stores and maintains data relationships by default.
We further narrowed it down to the Neo4j graph engine that matches the requirements for the genealogy use case. We did not need to write complex queries to retrieve nested data. Moreover, the schema was a lot simpler compared to the amount of nodes we had in the data.
Cypher Query Language (CQL)
Without graph-based queries, extracting data from graph databases could have posed a huge challenge. Thankfully, the Cypher Query Language (CQL) in Neo4j allows users to extract very complex graph shapes quickly. We could merge nodes and allow string, aggregate and relationship functions. The ‘Cypher Query Optimizer’, which produces a highly optimized query plan, enables faster execution times.
Graph databases are well-suited for analyzing interconnections. Hence, it is widely used in social media, fraud detection and recommendation engines.