S4

From XMMS2
Jump to: navigation, search

S4 is an attempt to create a new database backend for XMMS2. It lives in the s4 git repository, and you can find a version of XMMS2 with S4 and Collections 2.0 in the xmms2-cippo repository (the s4coll2 branch).

Design

The XMMS2 media library does not map very well to a relational database. Every song can have a variable number of attributes, and they can be specific to just one song. That means the database can not be normalized the way you normally do with a relational database. The way it is handled with SQlite for XMMS2 is that a table called Media with the format {id, key, value, source} is created, and every attribute a song has is stored in there. This will obviously lead to a lot of duplication as every song by the artist Foobar will have {id, "artist", "Foobar", "plugin/id3v2"} set. S4 was designed to remove some of this redundancy and make a database optimized for XMMS2.

The basic structure in the S4 database is the entry. An entry maps to a song in XMMS2. An entry can have a variable number of attributes, but only a single key (that would be the song id in XMMS2).The structure looks something like this:

key: "song_id", 1  -- The key is actually a key and a value
attributes:
  "title", "Two-Headed Boy", "plugin/id3v2"            -- Every attribute has a key, a value and a source
  "artist", "Neutral Milk Hotel", "plugin/id3v2"       -- There can also be attributes with the same key
  "artist", "Neutral Milk Hotel", "client/xmms2-nycli" -- but different value and/or source.

But how does this remove redundancy? There is still one artist attribute per song, but behind the scene S4 stores equal strings only once. There is still some redundancy in that the three pointers to the key, value and source are duplicated, but the data they point to are shared.

To speed up queries, S4 also provides indexes. By default there is only an index on the entry key, so in the above example searching for "song_id" and a value would be fast, while searching for "artist", or any other attribute would mean that all entries would have to be searched. The user may create indexes on any attribute, for example "artist", "title" or "album". XMMS2 creates indexes on "url" and "status" by default, because they are searched for a lot and need to be fast. Using an index is an O(log (n)) operation, where n is the number of entries in the index, while searching without an index is an O(n log (m)) operation, where n is the number of entries and m the maximum number of attributes an entry has (this is actually only true if there is one attribute with the key that is being searched for, if there are several attributes with the same key it will be slower).

API

The main API for S4 consists of just six calls

s4_t* s4_open  (const char *filename, const char **indexes, int flags);
int   s4_close (s4_t *s4);
void  s4_sync  (s4_t *s4);
int   s4_add   (s4_t *s4,
                const char *key_a, const s4_val_t *val_a,
                const char *key_b, const s4_val_t *val_b,
                const char *source);
int   s4_del   (s4_t *s4,
                const char *key_a, const s4_val_t *val_a,
                const char *key_b, const s4_val_t *val_b,
                const char *source);

s4_resultset_t *s4_query (s4_t *s4, s4_fetchspec_t *fs, s4_condition_t *cond);

There is also support functions to create conditions, fetch specification, source preferences and other structures needed for the querying. For more information on the different functions look at the doxygen documentation.

Querying

Perhaps the most interesting function is s4_query. It takes an S4 handle, a fetch specification and a condition and returns the data as the fetch specification requested for the entries matching the condition. The fetch specification is basically a list of (key, source-preference) pairs. It fetches the key using the attribute (or attributes) with the highest priority source according to the source-preference. If the key is NULL it will fetch everything.

Internal Design

Internally S4 consists of three parts: memory database, log writer and file reader/writer. On startup S4 reads in the data from the file and loads it into the memory database. The requested indexes are created (no indexes are saved on disk, only data) and it is ready to work. Whenever the user adds or deletes something from the database the copy in memory is modified to reflect this and a log entry is written to disk. The log can be used later to redo everything since the last time we wrote everything to disk. The log is implemented as a circular file, whenever we run out of space the database is written to disk before work can be resumed.

Memory database

This can be thought of as the working copy of the database. It also adds indexes to speed up searching. The basic structures are the entry defined as

typedef struct {
        const char *key;     /* The key */
        const s4_val_t *val; /* and value of the entry (unique) */
        int size, alloc;     /* The used and allocated size of the attribute list */
        GStaticMutex lock;    

        entry_data_t *data;  /* The attribute list, resized with realloc whenever size > alloc */
} entry_t;

/* One of those per attribute in the attribute list */
typedef struct {
        const char *key;
        const s4_val_t *val;
        const char *src;
} entry_data_t;

Log

The log is responsible for redoing any changes done after the last time the database was written to disk. Normally there will be nothing to redo when opening a database, but if the database was closed abruptly (for example if the program using the database crashed) there might be some changes not yet in the on-disk database. The log does not protect fully against power loss, because it just uses fflush (forcing the data to the kernel) but not sync (forcing the data on disk). This is because sync takes much longer than fflush. The FS will normally write data to the disk every 30. second (this depends on your system settings) and you will not loose more data than what was changed in those 30 seconds.

As have already been mentioned the log file is a circular file. The log size is defined in log.c as LOG_SIZE and is currently set to 2 MB. The log file will write log-entries until it hits then end, then it will write a special wrap-around log-entry and start writing at the beginning again. The log will never overwrite log-entries that contains data that has not been written to the on-disk database yet. A log-entry is defined as

struct log_header {
        log_type_t type;  /* The type, can be add, delete or wrap-around */
        log_number_t num; /* The log number, this contains the offset into the log file
                           * and the round number (how many times the log file has been wrapped around)*/
        int32_t ka_len;   /* The length of the key_a string */
        int32_t va_len;   /* The length of the value_a string, or -1 if it is an integer */
        int32_t kb_len;   /* The length of the key_b string */
        int32_t vb_len;   /* The length of the value_b string, or -1 if it is an integer */
        int32_t s_len;    /* The length of the source string */
};

If it is an add or delete log-entry it is followed by the key_a, value_a, key_b, value_b and source.

On-disk database

The on-disk database has a very simple format. This is to make code reading and writing it simple and thus less likely of having bugs. The database-file starts with this header

Offset Bytes  Description
0       4     Magic number
4       4     Version number
8      16     UUID
24      4     Last logpoint   

The file starts with a magic string ("s4db") so we can check if it actually is an S4 database, followed by a version number to find out what format it is using. Then comes an UUID (this is randomly created when the database is created, and can be used to identify a database without using the filename). Last is the number of the last log-entry written when this file was written to disk. This can be used to determine if the log contains new data added after this file was written.

After the header a list of strings follow. They have the following format:

(Offsets relative to the end of the last string
 or the header if this is the first string)

Offset Bytes  Description
0       4     String ID
4       4     String length
8       x     The actual string

To indicate that there are no more strings 0xffffffff (-1) is written instead of an id.

After the strings the actual data comes. The data consists of quintuples of int32_ts defined like this

(Offsets relative to the previous data entry,
 or the end of the strings if this is the first data)

Offset Bytes  Description
 0     4      String ID of entry key
 4     4      String ID or integer value of entry value
 8     4      String ID of attribute key
16     4      String ID or integer value of attribute value
20     4      String ID of attribute source

If the entry value is an integer, the entry key string ID is negated. If the attribute value is an integer value, the attribute key string ID is negated.

Here is a short example. Take the database

key = "song_id"
value = 1
attributes:
  "title"  = "Song A"   (source "plugin/id3v2")
  "artist" = "Artist A" (source "plugin/id3v2")

key = "song_id"
value = 2
attributes:
  "title"  = "Song B"   (source "plugin/id3v2")
  "artist" = "Artist A" (source "plugin/id3v2")

which could result in the following file

(Header)
magic:           "midb"
version:         1
uuid:            a083753e-cfb5-49a7-b0de-2ca8fe2ee016
last_checkpoint: 200

(Strings)
1, 7, "song_id"
2, 5, "title"
3, 6, "artist"
4, 6, "Song A"
5, 8, "Artist A"
6, 6, "Song B"
7, 12, "plugin/id3v2"

-1

(Pairs)
(-1, 1, 2, 4, 7)
(-1, 1, 3, 5, 7)
(-1, 2, 2, 6, 7)
(-1, 2, 3, 5, 7)

CLI Tool

There is a simple command-line tool for S4, named "s4", that can be used to examine an S4 database without using XMMS2. It supports querying, adding and deleting properties. There are five kinds of statements in the tool: statements with no return value, statements that return a condition, statements that return a fetch specification, statements that returns a result and statements that returns a list. If you write ".help;" in the tool you get the following list of statements

All statements must end with a semicolon

Statements with no value:
.add <list>, <list>   - For every (key, val) from the first list it adds
                        the attributes (key, val, src) from the second list
.del <list>, <list>   - For every (key, val) from the first list it deletes
                        the attributes (key, val, src) from the second list
.exit                 - Exit the program
.help                 - Prints this help
.set key value        - Sets the option key to val
.set key              - Shows the value of the key
.set                  - Shows the value of all keys
.vars                 - Prints all bound variables

?var = <cond>         - Assigns cond to the condition variable var
%var = <fetch>        - Assigns fetch to the fetch variable var
@var = <result>       - Assigns var to something returning result
$var = <list>         - Assigns the list to the list variable var
#var = <pref>         - Assigns the pref to the source preference variable var

Conditions (<cond>):
?var                  - Returns the condition bound to var
!cond                 - Matches everything cond does not match
cond1 & cond2         - Matches if both cond1 and cond2 matches
cond1 | cond2         - Matches if cond1 or cond2 matches

Filter conditions
key = value           - Matches all entries where key equals value
key ~ value           - Matches all entries where key matches value
key < value           - Matches all entries where key is smaller than value
key > value           - Matches all entries where key is greater than value
key ^ token           - Matches all entries where key has a token equal to token
key != value          - Matches all entries where key does not equal value
key <= value          - Matches all entries where key is smaller or equal to value
key >= value          - Matches all entries where key is greater or equal to value
= value               - Matches all entries where one or more keys equals value
~ value               - Matches all entries where one or more keys matches value
< value               - Matches all entries where one or more keys is smaller than value
> value               - Matches all entries where one or more keys is greater than value
^ token               - Matches all entries where one or more keys has token
!= value              - Matches all entries where one or more keys does not equal value
<= value              - Matches all entries where one or more keys is smaller or equal to value
>= value              - Matches all entries where one or more keys is greater or equal to value
+key                  - Matches all entries that has key
+                     - Matches everything
<pref> may be added after all filter conditions to use a source preference to only match
against the highest priority source in the source preference

Fetch specification (<fetch>):
%var                  - Returns the fetch spec bound to var
(key1, ..., keyn)     - Fetches keys 1 through n from matching entries
(key1 <pref>,...)     - Fetches key1 using the source preference given
key                   - Fetches key from matching entries
key <pref>            - Fetches key using the source preference given
_                     - Fetches everything from matching entries

Results (<result>):
.query <fetch> <cond> - Queries the database, returns a result

@var                  - Returns the result bound to var

Lists (<list>):
$var                  - Returns the list bound to the variable var
<result>{<rng>,<rng>} - Creates a list of the columns given by {row,col}.
<result>{<rng>, key}  - Creates a list of the column where the column key
                        equals key and the rows are in the range
[key val src, ...]    - Creates a list
[key val, ...]        - Creates a list where source is set to default_source

Ranges (<rng>):
start - stop          - Creates a range from start to stop (inclusive)
      - stop          - Creates a range from 0 to stop
start -               - Creates a range from start with no stop
      -               - Creates a range from 0 with no stop

Source preferences (<pref>):
#var                  - Returns the source preference bound to var
:src1:src2:...:srcn   - Creates a new source preference where src1 has the highest priority,
                        src2 seconds highest and so on

Shorthand:
.q = .query
.a = .add
.d = .del
.v = .vars
.h = .help
.? = .help
.s = .set
.e = .exit

So an example session might be something like this (everything after the # are comments and not in the actual session)

s4> ?cond = album="temple of the dog"; # Creates a condition and assigns it to the condition variable cond.
s4> %fetch = (title, tracknr);         # Creates a fetch specification and assigns it to the variable fetch.
s4> @res = .query %fetch ?cond;        # Queries the database with the condition and fetchspec above and saves the result in res
s4> $titles = @res{-, 0};              # Creates a list of all the titles (column 0)
s4> $tracknrs = @res{-, 1};            # Creates a list of all the tracknrs (column 1)
s4> $tracknrs = @res{-, tracknr};      # The same as the line above, except using column-key indexing
s4> $titles;                           # Prints the value in titles
[title=All Night Thing (plugin/id3v2),
title=Call Me a Dog (plugin/id3v2),
title=Four Walled World (plugin/id3v2),
title=Hunger Strike (plugin/id3v2),
title=Pushin Forward Back (plugin/id3v2),
title=Reach Down (plugin/id3v2),
title=Say Hello 2 Heaven (plugin/id3v2),
title=Times of Trouble (plugin/id3v2),
title=Wooden Jesus (plugin/id3v2),
title=Your Saviour (plugin/id3v2)]

s4> $tracknrs;                         # Prints the value in tracknrs
[tracknr=10 (plugin/id3v2),
tracknr=5 (plugin/id3v2),
tracknr=9 (plugin/id3v2),
tracknr=3 (plugin/id3v2),
tracknr=4 (plugin/id3v2),
tracknr=2 (plugin/id3v2),
tracknr=1 (plugin/id3v2),
tracknr=6 (plugin/id3v2),
tracknr=7 (plugin/id3v2),
tracknr=8 (plugin/id3v2)]

s4> $everything = @res{-, -};          # Puts everything in the list everything
s4> $everything;                       # Prints the value in everything
[title=All Night Thing (plugin/id3v2),
tracknr=10 (plugin/id3v2),
title=Call Me a Dog (plugin/id3v2),
tracknr=5 (plugin/id3v2),
title=Four Walled World (plugin/id3v2),
tracknr=9 (plugin/id3v2),
title=Hunger Strike (plugin/id3v2),
tracknr=3 (plugin/id3v2),
title=Pushin Forward Back (plugin/id3v2),
tracknr=4 (plugin/id3v2),
title=Reach Down (plugin/id3v2),
tracknr=2 (plugin/id3v2),
title=Say Hello 2 Heaven (plugin/id3v2),
tracknr=1 (plugin/id3v2),
title=Times of Trouble (plugin/id3v2),
tracknr=6 (plugin/id3v2),
title=Wooden Jesus (plugin/id3v2),
tracknr=7 (plugin/id3v2),
title=Your Saviour (plugin/id3v2),
tracknr=8 (plugin/id3v2)]

s4> $first_title = @res{0, title};     # Puts the value in row 0, title column in first_title
s4> $first_title;
[title "All Night Thing" plugin/id3v2]

s4> @res{0, title};                    # It doesn't have to go through a variable
[title "All Night Thing" plugin/id3v2]

s4> $first_track = @res{0, tracknr};   # Puts the value in row 0, tracknr column in first_track
s4> $first_track;
[tracknr 10 plugin/id3v2]

s4> $ids = .query song_id album="Wrong Album Title"{-, 0};
s4> .add $ids, album "Right Album Title" "client"; # Adds (album, "Right Album Title", "client") to all the songs found in the above query

s4> .set;                              # Prints the value of all configuration variables
default_source = s4

print_mode = verbose
Possible values:
- verbose
- pretty
- compact

s4> .set default_source "server";  # Sets the configuration variable "default_source" to "server"
s4> .set default_source;
default_source = server

# The default print mode is verbose. It prints a full "key=value (source)" for every column value
s4> @res;
 row  |  col  | data
    0 |     0 | title=All Night Thing (plugin/id3v2)
      |     1 | tracknr=10 (plugin/id3v2)
    1 |     0 | title=Call Me a Dog (plugin/id3v2)
      |     1 | tracknr=5 (plugin/id3v2)
    2 |     0 | title=Four Walled World (plugin/id3v2)
      |     1 | tracknr=9 (plugin/id3v2)
    3 |     0 | title=Hunger Strike (plugin/id3v2)
      |     1 | tracknr=3 (plugin/id3v2)
    4 |     0 | title=Pushin Forward Back (plugin/id3v2)
      |     1 | tracknr=4 (plugin/id3v2)
    5 |     0 | title=Reach Down (plugin/id3v2)
      |     1 | tracknr=2 (plugin/id3v2)
    6 |     0 | title=Say Hello 2 Heaven (plugin/id3v2)
      |     1 | tracknr=1 (plugin/id3v2)
    7 |     0 | title=Times of Trouble (plugin/id3v2)
      |     1 | tracknr=6 (plugin/id3v2)
    8 |     0 | title=Wooden Jesus (plugin/id3v2)
      |     1 | tracknr=7 (plugin/id3v2)
    9 |     0 | title=Your Saviour (plugin/id3v2)
      |     1 | tracknr=8 (plugin/id3v2)

# The 'pretty' print mode is a little less verbose, and puts everything in a nice table.
# It prints "value (source)" for every column value.
s4> .set print_mode pretty;
s4> @res;
| row  | title                                                | tracknr                                              |
|------|------------------------------------------------------|------------------------------------------------------|
|    0 | All Night Thing (plugin/id3v2)                       | 10 (plugin/id3v2)                                    |
|    1 | Call Me a Dog (plugin/id3v2)                         | 5 (plugin/id3v2)                                     |
|    2 | Four Walled World (plugin/id3v2)                     | 9 (plugin/id3v2)                                     |
|    3 | Hunger Strike (plugin/id3v2)                         | 3 (plugin/id3v2)                                     |
|    4 | Pushin Forward Back (plugin/id3v2)                   | 4 (plugin/id3v2)                                     |
|    5 | Reach Down (plugin/id3v2)                            | 2 (plugin/id3v2)                                     |
|    6 | Say Hello 2 Heaven (plugin/id3v2)                    | 1 (plugin/id3v2)                                     |
|    7 | Times of Trouble (plugin/id3v2)                      | 6 (plugin/id3v2)                                     |
|    8 | Wooden Jesus (plugin/id3v2)                          | 7 (plugin/id3v2)                                     |
|    9 | Your Saviour (plugin/id3v2)                          | 8 (plugin/id3v2)                                     |

# Since 'pretty' does not print the key it can be confusing if you have columns with
# different keys (which you get when you use the '_' as the fetch key).
s4> .query _ artist~burning*;
| row  | _                                                                                                            |
|------|--------------------------------------------------------------------------------------------------------------|
|    0 | 2 (plugin/mad)                                                                                               |
|      | -1 (server)                                                                                                  |
|      | 254223 (plugin/mad)                                                                                          |
|      | 13 (plugin/id3v2)                                                                                            |
|      | 230253 (plugin/mad)                                                                                          |
|      | Tastykake (plugin/id3v2)                                                                                     |
|      | 1 (plugin/mad)                                                                                               |
... and so on ...
# As you can see the value without the key does not make much sense. If you plan on using the
# '_' fetch key the 'verbose' print mode is more helpful.

# 'compact' is like pretty, except it does not print the source.
s4> .set print_mode compact;
s4> @res;
| row  | title                                                | tracknr                                              |
|------|------------------------------------------------------|------------------------------------------------------|
|    0 | All Night Thing                                      | 10                                                   |
|    1 | Call Me a Dog                                        | 5                                                    |
|    2 | Four Walled World                                    | 9                                                    |
|    3 | Hunger Strike                                        | 3                                                    |
|    4 | Pushin Forward Back                                  | 4                                                    |
|    5 | Reach Down                                           | 2                                                    |
|    6 | Say Hello 2 Heaven                                   | 1                                                    |
|    7 | Times of Trouble                                     | 6                                                    |
|    8 | Wooden Jesus                                         | 7                                                    |
|    9 | Your Saviour                                         | 8                                                    |

s4> .exit;       # ^D (Control-D) can also be used to exit

Caveat

S4 does not do any locking of the database yet. That means that two programs can write to a database at the same time. This might lead to loss of data. Running one program that writes to the database and another just reading the database should in theory be fine as long as you exit the reading program before the writing one, but no guarantees are made. To be sure not to lose any data, run only one program at a time. Hopefully this will be fixed soon.