Gillius's Programming

RealDB

Old Progress Reports

22-Nov-2009

Currently I am working on completing milestone 4, which really has caused me to come up with a full solution to the issue of treating the database as a database of continuous signals/streams rather than a series of points. I did not encounter much challenges in defining a data model to represent a "stream interval" over a time range. But then, I faced a dilemma about what kind of classes of interpolation ("compression") methods to support -- what is the minimum that I need in my API to support everything that I want. There were two major problems with my original concept:

  1. The compression algorithms were able to store additional data with the fields to be able to reconstruct intervals later, for example a "slope" with a linear compression. However, I had no way of representing these hidden data elements in the "raw" data model (preventing copying/duplication/movement of data), and the physical data format would be tied to the interpolation algorithm used (and could not be updated or changed).
  2. The compression algorithms worked only by being given the "current" point and are asked if they wish to drop or write it. However, having to make an interpolation decision with only a single point works only for 0 order interpolation.

For awhile each adjustment to that design I tried would be unworkable in some way. I didn't want this to hold up my project any longer especially since I was trying to really focus on reliability and performance than complex interpolation methods. But then I finally had a breakthrough after reviewing some typical methods. Many reasonable methods didn't really require anything but the points, but they do sometimes require to know about points coming ahead in the stream. Those that do take more parameters than the control points themselves are either too complicated, I think (NURBS), can be set to, reasonable, "generic" values (Hermite). I realized that this is true only after some research on interpolation methods, the most concise and useful I found is at Paul Bourke's site. So I made two fundamental changes in my design, based on the above problems, and with the new knowledge of what the algorithms would really need:

  1. Remove ability for codecs to save custom fields in the records. Now, the stored format of the record is exactly the same as the measured format from the user. This allows verbatim copying between databases if needed without involving the codec. The codec will need to reconstruct stream intervals using only the measured values.
  2. Allow the codecs to see a finite number (defined by them at configuration time) of adjacent samples before and after the sample they are outputting/reconstructing.

The first change also allowed the separation of the serialization code from the compression code, which significantly simplified the design by separating the responsibilities better as well as allowing the compression code to be potentially used outside of the RealDB context. The new model for both compression and reconstruction is a stream transformation function. The "look ahead" and "look behind" features are enabled by the design by allowing the stream transformation function to be delayed. For example see the table of interactions below for a theoretical transformation of a stream with 5 samples to a stream with 3:

Input Output
2.0 null
3.0 null
4.0 2.0
4.0 null
4.5 4.0
null null
null 4.5

The compression algorithm delays the output by 2 samples (because it wants to see 2 samples ahead to determine if a sample is needed), and preserves the first and last samples.

It is expected that 0 order interpolation (deadbands, always sample) need 0 look-ahead, first order needs 1 look-ahead, and higher order like cubic or hermite need 2 or more look-ahead points. Until I thought of the better design, I was ready to say that RealDB would only support nothing more complicated than linear interpolation under the theory that anything else would be too slow. However, looking at the implementation, cubic and/or Hermite don't involve a high number of complex operations.

The other thought I had about methods like Cubic and Hermite that work on 4 control points is that the complexity to find the best points could be something like NP-complete, and I almost ruled it out on that fact, until I thought that it is conceivable to have O(n) methods that have a chance of being quite good even though not perfect. I don't intend on researching these techniques in this project -- I only want to design a framework that could support such a technique were one to design it. The example technique I used to prove to myself it could be possible to be practically useful in terms of performance is this simple shell of an heuristic point-selection algorithm for a cubic/Hermite interpolation function:

  1. Know the last 2 points output by the algorithm
  2. Look at the m+2 most recent points in the stream (the finite sized m keeps this algorithm linear time with respect to the samples in the stream, complexity O(n*m))
  3. Consider the last 2 output points and the 2 most recent input points. Form the interpolation based on these 4 points and check the error of the interpolation on the m (middle) points.
    1. If the interpolation's error is under the threshold, drop the 3rd most recent point.
    2. If the error is out of threshold, output the 3rd and 2nd most recent points (and the most recent point becomes part of the "m" set for the next period).
  4. If the size of the "m" buffer is full, then output the 2 most recent points.

There would be some handling of edge cases in the above that I left out for simplicity, to express the point simply that such an algorithm is feasible and could be successful in eliminating a significant portion of points.

Current Progress

Both steps above have finished tested (via unit testing).

Next Steps

29-Jun-2009

In the last 4 months, there was development and fixing work leading up to a graphical demonstration program that works on physical disks; demonstrated on hard drive and removable flash. The demo produced at the middle of May would occasionally fail to create the database, which was surprising given the automated testing I had done.

I sat out to create an automated test based on the demo and was able to reproduce the problem in some cases. There were two bugs:

  1. When advancing the head (oldest end) of the index block "queue", I write a flag "headAdvanced" to the header that denotes that the head pointer points to the actual "real" block, and not to consider the backup block. The bug is that when writing to the "alternate" head, I don't clear the "headAdvanced" flag -- if the database fails immediately after, on reloading the old version of the head is used, causing previously removed data blocks to appear at the start of a stream, which usually leads to the same data block being referenced by two indices (since the DB is always full and transfers blocks in a rolling queue).
  2. There was an error in calculating the length of the queue when the head is physically after the tail (wrap-around): the return from the function was 1 too large. This effectively caused the data index iterator to traverse the "tail" (newest) block of the queue twice.

I wondered, why didn't my tests where I test failures at every possible I/O operation detect these bugs? Because I chose parameters such that a single index block was sufficient to hold all of the indices of a stream; a supported but degenerate case where head == tail and never moves.

I fixed the two bugs and I have yet to get the demo to fail again. I updated the original tests and ensured that they failed with the old code. However, with the new code they get much farther in the test, but still fail. So, I am still finding some bugs. These tests are needed not just because it's virtually impossible to duplicate all of the write faults in a real situation, but will come in critical use when I start "optimizing" some of the naive code just to get things to "work," in order to meet the complexity goals for the project (i.e. turn O(n) recovery into O(1), for example).

1-Feb-2009

Updated the milestone 3 acceptance test framework to extend to n streams, where each stream skips every x records. This allows for streams to write between 50% to 100% of the full dataset. I haven't figured out how to "test" the performance aspects -- the test looks for the literal guarantees made by RealDB. Because RealDB is allowed to reclaim space (deleting records) technically as long as there are some records and they are in order and the same records that I've written, the test can pass. I've written some statements to print that give me a "heuristic" way to check that the test is reasonable. Here is an example of the results of the full (non-corrupting) run of an 8 stream test with skips = 0, 5, 2, 3, 0, 6, 50, 90. What this means is that stream 1 skips no records (100%), stream 3 skips 1 out of every 2 records (50%) and stream 7 skips 1 out of every 50 records (98%).

Total operations performed: 19623
Stream Test0 (ID 0) has 12999 records from 37001 to 50000
Stream Test1 (ID 1) has 10399 records from 37001 to 49999
Stream Test2 (ID 2) has 6499 records from 37001 to 49999
Stream Test3 (ID 3) has 8667 records from 37000 to 50000
Stream Test4 (ID 4) has 13000 records from 37000 to 50000
Stream Test5 (ID 5) has 10833 records from 37000 to 50000
Stream Test6 (ID 6) has 12739 records from 37001 to 49999
Stream Test7 (ID 7) has 12856 records from 37000 to 50000

Note that the time ranges are fairly close together -- the deletion algorithm always picks the oldest block, which is supposed to try to keep the tail of each stream together. The fact that the streams each contain proportionally different records is also a good sign. However, it's hard to write a JUnit assertion for this since I haven't defined the precise tolerance -- because of flushing and block boundaries, the stream that outputs half of the records of another will likely not have exactly half (but close -- within a block's worth).

Every 250 iterations I output the statistics again. Below is an example of the statistics on iteration 10250 (meaning that after the 10,250th I/O operation it throws an exception and, if it was a write, changes the target block or blocks to random data).

Beginning iteration 10250
Stream Test0 (ID 0) has 12999 records from 13501 to 26500
Stream Test1 (ID 1) has 10460 records from 13424 to 26499
Stream Test2 (ID 2) has 6499 records from 13501 to 26499
Stream Test3 (ID 3) has 8666 records from 13501 to 26500
Stream Test4 (ID 4) has 12999 records from 13500 to 26499
Stream Test5 (ID 5) has 10909 records from 13408 to 26499
Stream Test6 (ID 6) has 12890 records from 13346 to 26499
Stream Test7 (ID 7) has 12854 records from 13501 to 26499

We can see from this example that the tails are pretty close together and so are the ratios, which gives a good impression that the transaction and recovery handling is not just passing the strict requirements but also is performing reasonably. An additional piece of evidence is that the total number of records is close to the non-failing example.

Milestone 3 Caveats

In my last status update I forgot to mention that not everything is clean -- there are some limitations:

29-Jan-2009

Summary: Although 2 months have passed without a progress report, a lot of work has been done. The version on 25-Nov was not capable of space management (reclaiming the oldest block to write new data), nor was it capable of handling corruption. 39 revisions of code later, RealDB now has space management and is capable of handling file storage failures at any point. Tentatively I am calling milestone 3 completed, but I want to clean up a few loose ends.

Features/Details of work

Milestone 3 Acceptance Test

As an acceptance test to confirm completion of milestone 3, created a JUnit-based automated test similar to the one in milestone 2. The key component in this test is the usage of the FailingBlockFile, which allows throwing an IOException after x calls into the API and, if the call was a write, overwrites the entire block (or set of blocks if a multi-block call) with random data. The steps in this test are as follows:

  1. Create a database with an RDL file input and pointed to an in-memory BlockFile implementation (ByteArrayBlockFile)
  2. Wrap the file with a ProfilingBlockFile and run the test, and record the number of I/O operations performed (7969 in my test):
    1. Write 200,000 unique, ordered, records into a single stream. The database is small enough that 200,000 records is enough to fill up the whole database many times.
    2. Every 500 records, perform a manual flush operation.
    3. Close the database and reload it (creating a wholly new set of objects)
    4. After the write is complete, re-read the database ensure that the database contains a set of records that is entirely in order, and includes records in sequence up to the last flushed or written record. For example, if we write records 0 to 1345, then the last record flushed is 1000. If, for example, the database contains records 567 to 1000, or 950 to 1340, those are considered passing tests. If the records are in a different order or data before the last flush is lost (355 to 900 for example), the test fails.
  3. For every integer X from 1 to the number of operations performed:
    1. Rerun the test, this time with a FailingBlockFile that fails after X operations are performed, overwriting the target block(s) with random data if a write fails.
    2. Skip to step 3 of the test when a failure occurs: reload the database and re-read all records. The database should reload without exceptions (recover in 100% of cases), and not have lost any records from before the last flush, and all records must be in order and contain the correct data.

The test runs in 461 seconds on my machine to loop over failures in each of the 7969 operations. It is important to note that the 7969 operations even includes the operations to actually create the database from an empty file, so the test also tests for failures when creating the database. However, RealDB does not guarantee that you can reload such a database. If the database throws IOException before it is built, the test does not attempt to read any data and passes.

Transactions

Unfortunately I had to end up implementing a lightweight transaction system in RealDB. I wanted initially to stay away from transactions and rely more on "checkpoints", which scale with the number of flush operations. However, to adhere to RealDB's guarantees, I only have to do transactions on the two data index operations: Moving a block from the free pool to an index and moving a block from one index to another (or to itself).

The key feature in the transaction system to improve the performance is to not only let multiple transactions be pending at the same time, but also allow the operations to be scheduled for a different time. Each transaction is a list of operations represented as object instances implementing java.io.Flushable. The first operation of a transaction always is to open the transaction by writing it to disk. The add operation returns a Flushable object itself of a task that must be performed if the given operation is performed. This ensures that operations are performed always in order, which is essential for allowing transaction recovery.

An example might be a transaction that requires the steps A, B, C, D, E. A coordinator gets a request to perform an operation, which is to be done as a single transaction. It schedules A, B, C, D, and E to occur in order, although none of the operations actually occur yet, through calls like "addBlock( int block, Transaction tx );" If commit is called directly on the transaction, all 5 operations will occur at that time, in order. But of course there is more than one piece of data in every block in RealDB. That necessarily means that if a block is committed on an index, for example, a whole number of transactions that may be affected.

Let's say that D is the operation to add the item to the index. The coordinator calls index.addBlock( block, tx ). The index calls tx.addAction( D ) and that method returns back an object that performs {A,B,C} in order. If index.flush() is called, say because the client wants to commit some other operation "F", this might mean that D is about to be written to disk. Therefore, before the flush occurs, the index calls its object, which performs {A,B,C}, then it performs D directly. A Transaction remebers what is already done, so if later a piece of code asks for C to happen, it looks to run {A,B,C} and sees that those are already completed, and therefore skips those steps.

I am still currently concerned somewhat about circular dependencies, so I will have to make sure that what can be a transaction is managed properly and if a circular dependency could happen, it forces an early commit of part of a transaction. In fact, this is what happens on a smaller scale when a transaction occurs to transfer a block from the end of a stream to the front of that same stream. To properly support transaction recovery, I have to make sure that the removal happens before the add -- and because the head and tail of the queue could rest in the same physical block, it is possible that the block cannot be removed before it is added. It would be possible to perform this operation entirely without a Transaction, but that would require the coordinator to have a too deep of an understanding of the DataIndex implementation.

Future Plans

I would like to try to solidify that I've met milestone 3 by testing a multiple stream database and testing on real flash card hardware and creating faults by actually pulling the drive out of the socket as it is writing. I have a high confidence that both situations should work because:

  1. The multiple-stream transaction handling is actually simpler than single-stream, because in the single-stream case I have to move a block from a stream to itself in one transaction, which requires extra effort to track, especially when the head and tail of the stream are in the same physical block.
  2. I have a high confidence that this will work on real hardware because I tested failures at each of the 7969 possible failure points on the in-memory fail in the previous test.

I would also like to extend the test a bit to check to make sure that no free blocks are unaccounted for. The database could "leak blocks" but still pass in the scenarios above.

25-Nov-2008

Next Steps to complete milestone 3:

28-Oct-2008

12-Oct-2008

29-Sept-2008

2-Sept-2008

1-Sept-2008

18-Aug-2008

7-Aug-2008

24-July-2008

23-July-2008

17-July-2008

13-July-2008

12-June-2008

27-May-2008

20-May-2008