Writing a Database Storage Engine


This semester I am in the advanced database class at the University of Alabama.  Me and a friend convinced the professor to have the final project to be an implementation of a NoSQL database.  I have always been interested in creating things and making a database is one of things I have always wanted to make.  Before I have not had to time due to class and work but now it is required for class so no one can yell at me for “wasting my time”.

We have written a fair bit of the database but the current part we are working on is the storage engine.  We didn’t want to use anything pre-built or create a naive solution like a giant flat file where we have to copy data down to insert in the middle.  In the end we designed a fake file system on top of the current file system with the extra method called insert.  Insert allows the end user to write in the middle of a document without moving all the extra data down.  For this we had to create a few classes which emulate common file system operations.  We create a FileSystem class, File class, FileReader class, and a FileWriter class.


The FileSystem class is the main entry point for everything.  It is what is used to open/create files and we also provide private functions which can be used by other classes in the same namespace.  The conceptual idea of the FileSystem is that there are some files stored within it and we provide access to loading those files and to read and write to and from them.  This requires us to save meta-data about the FileSystem and also meta-data about each File as well.  This class ensures that everything is consistent when things are written and read from disk.  Also it is tasked with creating a FileSystem on the disk if one does not exists and also allows for reading/saving the file systems data as well.


The File class is just an empty class that keeps track of the current properties about the file.  It also has a pointer to the FileSystem object which created it.  This was mainly to be used to allow readers and writers access to a File objects actual content on disk.  The only problem I have with this is that if you open multiple files and one is modified the properties of the others is not.  In the future this needs to be changed.


The FileReader class takes in a File and then provides a read interface for the file.  This class does not really do any of the heavy lifting but rather leaves it up to the file system to handle that.  I would like to be able to extend this to allow for more helpful write methods such as writeString, writeInt, and so forth.


Like the FileReader class this class facilities writing to a File but does no actual heaving lifting itself.  Like the FileReader class I also want to implement methods like readInt and readString and so on.

I think that these are good classes to have for right now but I wish I could either pull out more things from the FileSystem class or merge them into smaller functions.  Currently it has a lot of private methods and there are a few which repeat some work but with a little extra in it.

Will try to update more about this project in the coming weeks.


Total Views (243)

Bringing CUDA to Go

Lately I have been playing around with Go. If you have not seen it you should check it out over [here].

One of the things that would be really cool to me is to be able to program in Go but have the code execute on the GPU. The only problem is that Go does not support CUDA. Currently if you wanted to write something in Go which used CUDA you would need to write some CUDA code in a language which has support and then call that from Go. If you provided enough calls you could write a decent amount of code in Go but you incur a performance overhead due to communication from starting and then stopping kernels.

The two alternatives that I have been thinking about is

  1. Somehow write CUDA code using a library written in Go which transforms specified functions into code which can then be compiled for CUDA
  2. Write some type of interpreter which runs on the GPU and accepts byte-code that is given to it from Go

In both cases I would have to modify the Go compiler.  In the first case I would have to modify it to be able to split the code into two parts: CUDA code and host code.  What I would have to do with the CUDA code depends on how much work I want to do.  I could take the AST and convert it into C/C++ code and then compile that using the CUDA compiler.  If done this way I would have to devise strategies on how to emulate Go functionality on the GPU which does not hurt performance too much.  One example of this is channels.

If I had a channel which I passed to the GPU then I would have to generate code which would handle it.  This would require either limiting the options to the programmer or hiding the limitations somehow.

That is all for now but I will follow up soon.  This is really burning in my brain and I want to do it.

Total Views (344)

Google Interview

Today was my google interview and I think it went okay.  The only problem I had was when they asked me about a specific problem I had in the past that I had to solve.  I don’t really remember those because they are usually something that I solve and never see again.

They did ask me this coding question:

Given a sorted list of distinct integers from 0 to 99, such as [0, 1, 2, 40, 42, 86], produce a string that describes numbers missing from the list: in this case “3-39,41,43-85,87-99”. The items should be sorted in ascending order and separated by commas. When a gap spans only one number, the item is the number itself; when a gap is longer, the item comprises the start and the end of the gap, joined with a minus sign.

I wrote a solution, which was partially correct but still not entirely correct.  Mine was O(n) on the input data in the worst case and O(1) in the best (technically though the worst cast is O(1) because the array length is constant).  I then modified it to allow for arbitrary ranges as well. After cleaning it up and removing all the bugs here I my final solution.

public class Main {

    private static String formatString( int a , int b , boolean addComma ) {
        String s;
        // Construct string
        if( a == b ) {s = "" + a;}
        else {s = a + "-" + b;}
        // Add comma is necessary
        if( addComma ) s = "," + s;
        return s;

    public static String generateString( int[] data, int start , int max ) {
        if( data == null || data.length == 0) return "0-" + max;
        if( data.length == (max - start) ) return "";

        StringBuffer sb = new StringBuffer();

        int current = start;
        boolean hasAdded = false;
        int pos = 0;
        while( pos < data.length - 1) {
            if( data[pos] != current ) { // Missing current to data[pos] - 1
                sb.append( formatString( current , data[pos] - 1 , hasAdded ) );
                hasAdded = true;
                current = data[pos];
        if( current != max ) {sb.append( formatString( current + 1 , max , hasAdded ) );}
        return sb.toString();

Total Views (443)


One of my weakest points in my computer science career is the writing. I never know what to say or how to say it; even worse is that I am very concise. For instance I can write my entire research in a few paragraphs. This leads me to my current problem: I have to write a journal paper. I did take a technical writing course in my masters but most of that was fixing chapters out of the professors, to be published, book. We did get some insight on how to write a technical document given the source material, but not so much in how to start from scratch.

This entire post is just going to be describing what I did, some of the results, and what they mean. This is about as good as I get on writing a technical anything, short, and only the bare minimum. Some background is needed as such: I took a previous implementation of a TSP solver which uses 2-opt written for NVIDIA GPUs and I re-wrote it with the hopes of it having better performance. So here I go.

I took some previous work, modified it, now it runs better. The code runs on NVIDIA GPUs and was a single self-contained kernel function. What the code did was take in a TSP problem and randomly generate tours and then apply the 2-opt algorithm on each tour. The best tour found was the winner. How it was divided for CUDA was that each block would get it’s own tour and perform 2-opt on it. End the end there was a single spot in memory where the best tour length would be saved and the CPU would print out how long it took to find this solution and what the length was. The input was how many retries you wanted to perform and the tour.

First I took this code and broke it into its logical parts so that each part would only have to handle its own concerns. Next I looked at each part independently to see if there was any improvement that could be done. Also for some reason I took data that was related, i.e x and y pairs of coordinates, and put them in a struct. I think I originally did this because I would always get x and then y, but it doesn’t really matter because everything thread grabs the x and then the y as shown below.

Comparison of separate arrays or a merged array of elements.

Comparison of separate arrays or a merged array of elements.

What I found was that the code was already really good but I did make some minor improvements. I was not sure how to improve upon it and I almost gave up. But while I was running some experiments I noticed some really weird things going on with the thermal readings for the GPU. What was happening was that there was a good bit of thermal fluctuation as the kernel was running. What I realized was that blocks were stopping and starting and it would have some idle times and then some busy times and this would lead to the ups and downs. After thinking about it I realized I don’t need to start some variable number of blocks. I could just start as many blocks that can be active on the GPU and give each block some set of tours to work on. This gave me two things: thermal fluctuation dropped and it allowed more memory to be used per block. In the original program all the memory would be allocated up front for each block that would be running. This means that if your problem instance took 1MB and you wanted to launch 1024 blocks you would end up using 1GB of memory. Now you can have as many units of work, and divide them among the blocks that are running. The card I have is a GTX 780 which can have 16 resident blocks per SMX and it has 12 SMX processors on it giving me a total of 192 blocks which can be active. Because of this change my card which has 3GB of data can have tour lengths over 1 million cities but with 1000 restarts (But would take a year to finish).

Notice the fluctuations over time.  Especially that drop near the end.

Notice the fluctuations over time. Especially that drop near the end.

More stable readings after fixing block count.

More stable readings after fixing block count.

So far with what I have my performance has not really increased that much, maybe about 10-15% faster than the original. Of course you could argue for large enough problem sizes this is okay because 15% of a month is a pretty good amount of time. I did however finally get the performance to improve dramatically but was more related to the algorithm itself than anything to do with CUDA. What I noticed was that each thread finds the best solution for it’s subset of data. Then of all the solutions per thread the best is picked and we apply that solution. What I noticed was that sometimes solutions do not overlap and if you apply both you get converge much much quicker. For an input of size 33k I get a speedup from the original problem around 20x.

In the previous work the authors stated that 128 threads is the way to go but I ended up finding that setting the threads to the minimum between 1024 and the number of cities (but multiples of 32) seemed to have the best performance. Also the shared memory is better when I do the same thing.

Total Views (360)

Possible Job

So I got referred to Microsoft by my friend. I already had my first interview about two weeks ago. They just contacted me yesterday and told me that I would have an onsite interview.

The only problem is that I applied for an internship but they said I could not do one because I would be graduating the semester after the internship. So I am going in for a full-time position. This interview will be a lot harder than the internship interviews. My recruiter has already given me a list of recommended (decade old) books to review. I am about half-way through one of them (Programming Interviews Exposed) so that is good. The only problem is that I cannot afford to purchase them so I have to… find them online.

So far the book seems pretty straight forward. The main thing that the author(s) mention is to never stop thinking out loud. They also push think before acting a lot, which to me seems like common sense. I guess this is because I am 30 and have a few more year of life experience on the usual programmers that get a job when they graduate at 22?

Overall I am really happy and excited but still nervous. I won’t hear back from Microsoft until about October through February, quite the gap.

Well back to reading!

Total Views (486)

Strange Results

Like I mentioned before I am doing some experiments to evaluate my 2-opt solution.  The only problem is now that I have the solution I have some interesting results that I cannot explain right away.  A little background should be in order before I just dump the graph.  My experiments are how the thread size, and shared buffer size, might affect performance.  So I have a fully connected digraph with 1400 nodes and I ran over all thread and shared memory combinations that are multiples of 32.  Thread count is obvious but shared memory needs more explanation.  The shared memory allocation represents the amount of nodes in my TSP path I can store for quick lookup.  Each node holds the position, two floats, and the weight to the next city in the path, an int.  This means each shared memory spot holds 12 bytes.

The issue is that around thread count of 576 I start to notice strange results around shared memory size of 544.  Below are a few graphs that start to show what I am talking about:

576 threads

576 threads

608 threads

608 threads

640 threads

640 threads

As you can see above something strange starts to happen around allocating 544 slots in shared memory.  I have tested to make sure that there were no banks conflicts and none were reported.  Another even stranger thing is that my occupancy is higher when shared memory is 544 than before or after that jump!

Good ole CUDA, keeping you on your toes.

Total Views (331)

Running Experiments

I find running experiments to be a little overwhelming sometimes.  For instance I just ran CUDA experiments with my TSP implementation to see how multiple configurations affect performance.  In the end I had 800 different configurations which took around 11 hours.  The real problem was that I did not automate the process because I was using Windows and was not familiar with batch programming.

It would be nice if I could just write a bash script to automate everything like I currently do for grading the algorithms and data structures class I am assigned too.  It is not that I don’t have a Linux machine but that I don’t have much experience doing real development without a GUI interface.  The biggest thing that gives me pause is that I would have to set aside time from work and research to learn how to use the command line tools, which would be really good to know, and how to parse the output they give.  Currently I can do the debugging using the NSIGHT debugger in Visual Studio and profile my applications using the NVIDIA Visual Profiler or with NSIGHT’s Visual Studio Profiler.

The Plan

I think that I will try to learn how to batch program.  One issue is that I used #define to control some of the shared memory sizes and to help let the compiler unroll loops where they are used.  Now I have to move the #define statements into templates so that I can pass the parameters in from the command prompt and also not have to recompile my code each time I make a change.  I am not too familiar with templates so now I will need to learn templates while I refactor my code.  So in recap I have to:

  1. Learn Windows batch programming
  2. Learn C++ templates
  3. Refactor my code

I hope in the long run I get good results, learn how to make my code better, and be able to automate tasks in windows using scripts.

Preliminary Results

So I made some quick changes to my code to use templates.  I was hoping that I would see some performance increase but it seems that It made my code worse.  Originally it took 89 seconds to find a length of 2073 for a complete asymmetric graph with 1400 cities, now it takes  121 seconds.  It seems that I need to work on it a little more and figure out where I can fix things.

Total Views (384)

Reading the documentation

Some background information may be of use.  My current work involves me taking a working implementation of 2-opt and optimizing it the best I can.  For those who don’t know, 2-opt [wiki] is a heuristic to finding a solution to the Travelling Salesman Problem [wiki] (TSP).

The idea behind 2-opt is that you find two edges that cross and then reconnect them so they don’t cross.  The algorithm performs these swaps until no more swaps are possible.  One thing to note is that 2-opt does not guarantee an optimal solution.

2-opt edge swap

2-opt edge swap

Now that I have explained what I am doing let me explain the totally unrelated error I had.

My algorithm needs some memory to keep track of the path and the edge weights along the path.  To do this I allocate memory at the top and then before I leave I free it.  So far so good.  However, what happens if you don’t free the memory?  You get things like this :

2-opt results before and after


On the left is the original time to complete each problem.  On the right is the new time after I don’t free.  If you don’t know the correct answer then you more than likely will come to my conclusion as well.  Releasing memory is really really slow for some reason.  Of course since you are reading this you can guess that is probably not what is happening.

Like an idiot I decided to let the whole world know I was an idiot by posting my question asking what was going on to Stack Overflow [link].  Within minutes I was informed I was an idiot and told that when allocating memory inside a CUDA kernel you have a default heap size of 8MB to allocate memory from.  Sure enough in section B.18.1 in the “CUDA C Programming Guide“, (v6.0), it states

The device memory heap has a fixed size that must be specified before any program using malloc() or free() is loaded into the context.  A default heap of eight megabytes is allocated if any program uses malloc without explicitly specifying the heap size.

Then it goes further to state I should use cudaDeviceSetLimit( cudaLimitMallocHeapSize , size_in_bytes) to update the heap size.

Once I allocate a heap size large enough and run my program runs as though I never made any optimizations and now I am back to the beginning of what I was supposed to do. Next time I will probably read the documentation just a little better.

Total Views (416)

Working With Cuda Without Aspirin

I find that working with CUDA is overall a fun experience.  The only issues I seem to have is that sometimes I get very unexpected results.

For instance let’s say I have this code below :

for(int k = threadIdx.x; k &lt; TileSize; k += blockDim.x) {
    int index = front + k;
    if( index &gt; lower_bound ) {
        shared_x[k] = glob[index].x;
        shared_y[k] = glob[index].y;

If your threadDim.x is always less than TileSize (which in my case it was) then you can replace the above code with :

if(threadIdx.x &lt; TileSize) {
    int index = front + threadIdx.x;
    if( index &gt; lower_bound ) {
        shared_x[threadIdx.x] = glob[index].x;
        shared_y[threadIdx.x] = glob[index].y;

Apparently not on my machine though.  When I make this change my kernel crashes with a memory access violation.  It cannot be shared memory because the blockDim.x is always less or equal to TileSize.  This leaves the global memory allocations.  But this cannot be because any access made in the second had to be made in the first!  It seems I will never be a great CUDA developer.

Total Views (1323)