Difference between revisions of "Database Query API"

From Gramps
Jump to: navigation, search
(Overview)
Line 1: Line 1:
 
[This is a work in progress.]
 
[This is a work in progress.]
  
= Overview =
+
== Overview ==
  
[http://www.example.com link title]
+
Starting with Gramps 5.0, there are two database backends, Berkeley DB (BSDDB), and [https://www.python.org/dev/peps/pep-0249/ Python's DB-API]. BSDDB is a data store with much of the database code written in Python. We have used BSDDB in Gramps for many years. DB-API is a common interface to the popular SQL engines.
 
 
Starting with Gramps 5.0, there are two database backends, Berkeley DB (BSDDB), and [https://www.python.org/dev/peps/pep-0249/ Python's DB-API]. BSDDB is a data store with much of the database code written in Python. We have used BSDDB in Gramps for many years. DB-API is a common interface to the popular SQL engines.  
 
  
 
Previously, Gramps was a very modular design when it comes to accessing the data. For example, consider getting the People for the flat view. First we get a cursor that iterates over the data. Then we sort it, on whatever criteria we have requested. To sort the data, not only do we need to iterate over all of the data, but we need to unpickle and turn the raw data into Person objects. Of course, it would be much faster if we can store the order in an index, so that we can just get the data that we need. This is even more important when we only want to see a filtered segment of the database, say those people who have a surname of "Smith."
 
Previously, Gramps was a very modular design when it comes to accessing the data. For example, consider getting the People for the flat view. First we get a cursor that iterates over the data. Then we sort it, on whatever criteria we have requested. To sort the data, not only do we need to iterate over all of the data, but we need to unpickle and turn the raw data into Person objects. Of course, it would be much faster if we can store the order in an index, so that we can just get the data that we need. This is even more important when we only want to see a filtered segment of the database, say those people who have a surname of "Smith."

Revision as of 23:22, 25 January 2016

[This is a work in progress.]

Overview

Starting with Gramps 5.0, there are two database backends, Berkeley DB (BSDDB), and Python's DB-API. BSDDB is a data store with much of the database code written in Python. We have used BSDDB in Gramps for many years. DB-API is a common interface to the popular SQL engines.

Previously, Gramps was a very modular design when it comes to accessing the data. For example, consider getting the People for the flat view. First we get a cursor that iterates over the data. Then we sort it, on whatever criteria we have requested. To sort the data, not only do we need to iterate over all of the data, but we need to unpickle and turn the raw data into Person objects. Of course, it would be much faster if we can store the order in an index, so that we can just get the data that we need. This is even more important when we only want to see a filtered segment of the database, say those people who have a surname of "Smith."

In order to make this fast, we need to know the filter information, and sort order when we ask for the data. With SQL we can simply add WHERE clauses and ORDER BY clauses to the basic SELECT statement. But these are only useful if we can have indexes on the relevant data.

This is made more difficult because Gramps uses a hierarchical representation of data. For example, we might wish to have the People data sorted by "surname, given" of the primary_name. But that information is actually in:

  • person.primary_name.surname_list[0].surname
  • person.primary_name.first_name

respectively. We could make special fields for these, and special indexes. But it would be much more flexible if we could create a variety of ad hoc queries on the fly.

The BSDDB datastore doesn't have any schema, which means that it has no idea of "gramps_id" or "primary_name" or any field. An idea of schema has been developed over the last few years. This makes possible the Database Differences Report, without having to write any field-specific code: the data knows its own structure.

The schema idea has been augmented with additional methods based on the idea of "fields". Now, you can ask a person object:

>> person.get_field("primary_name.first_name")
"Sarah"

and with some additional syntax:

>> person.get_field("primary_name.surname_list.0.surname")
"Johnson"

Building on that, a db.select method has been added so that you can get all of those fields from all People, and sort and filter on all of the regular (string, int) fields. This works independently of SQL. However, if you re-implement that method for DB-API, and have the appropriate sql-fields, and sql-indexes, then you have an large speed-up for large data.

So, our old system required a scan of all data, unpickling, creating objects, and sorting for any use. If you have 100k records, that required processing all of them. With the new DB-API implementation, you can do that same query in a fraction of that time. Views (written appropriately) will appear very quickly (milliseconds) regardless of the size of the database.

To interactively test it out, you can try these methods with a regular BSDDB database, but you won't see any speed enhancements. To see it work fast, you'll need a DBAPI database built from the latest addons-source/DBAPIBackend. The new webapp takes search text and turns it into a SQL statement directly for seeing the speed and flexibility of the functionality.

In this stringified field-based API, we would write code as follows. Consider that we want to select the handle of all people whose surname is "Smith", given name starts with a "J", and ordered by gramps_id. We would write:

db.select("Person", 
          ["handle", "gramps_id"], 
          filter=["AND", [("primary_name.surname_list.0.surname", "=", "Smith"), 
                          ("primary_name.first_name", "LIKE", "J%")]],
          order_by=[("gramps_id", "ASC")])

This code works on BSDDB as well as DB-API. Let's see the difference in timing on databases that have 187,294 people (created from GenFan, this is 20 full generations).

Here is a summary:

       | Filter |  Select All   | Sort All
-------|--------|---------------|----------
BSDDB  | 20.6s  |       9.2s    | 18.1s
DB-API |   .3s  |       8.7s    |  8.2s

So, where we can access the data via SQL, we can get a speedup, the biggest will always be in the filter as it makes it so we don't have to load into Python many objects. We have linear code in many places that could benefit from using db.select().

Currently DB-API is automatically creating SQL fields and indexes for all regular (non-list, standard Python types) primary-object attributes (like gramps_id, privacy, etc.). This takes time and space. We may want to manage this a bit more carefully.