Lexers and Lexing


Lexical analysis is the process of constructing lexemes* from a sequence of characters. One way to think of this is that we take the contents of a file, split it into strings where each string represents some basic idea, such as identifier, operator, or left parenthesis. We call these generated strings lexemes. In addition to creating these strings we also pair them with their type resulting in a token. For instance you might generate a token with type identifier for the lexeme value printf


* The raw text is the lexeme and tokens are tuples containing the lexeme type and lexeme value.

Some lexical analysis is so common that some languages, in some limited fashion, provide it natively. For instance, Java has the Scanner class and C (and automatically C++) has the strtok function. These implementations limit users to constructing lexemes, and not tokens, based on delimiters. For instance the default delimiter is whitespace and therefore you return all contiguous string of non-whitespace characters.

For handling programming languages we need something more powerful. For instance, given the input below how should we tokenize it?

    int main() {

Would you expect these lexemes (types omitted on purpose)

     <main()> <{> <print("Hi");> <}>

or these lexemes (types omitted on purpose)

      <(> <)> <{>  <(> <"Hi"> <)> <;> <}>

In the first example we see something that Scanner or strtok would return; while the second example would require a more complicated solution, which you will end up creating in this course.

A program which is dedicated to creating lexemes, and additionally tokens, from some input is called a lexer. To determine how a lexer constructs lexemes we need to look at regular expressions.

Regular Expressions

A regular expression is a string which can be used to match other strings. This might sound confusing so we provide an example below which matches all strings of upper and lowercase letters of length at least 1


The brackets and plus have special meanings which are described in the list of common operations below

    []  - character classes ([a-z] matches all lowercase letters)
    ()  - grouping
    +   - 1 or more times
    *   - 0 or more times
    ?   - 0 or 1 times
    ^	- Depending on context, beginning of line, or not.  [^\s] means any character that is non-whitespace
    \s  - Whitespace characters, equivalent to [ \t\r\n\v\f]

For instance the regular expression below would match an identifier in C


which matches strings that begin with a letter or underscore and is followed by 0 or more letters, digits, or underscores.
If you want to match strings which contain the symbols in those table above you would need to escape the character by preceding it with a backslash. This also means that if you want to match a string with backslashes you would need to write \\ to match a single backslash.

Lexemes and Tokens

Now that we have talked a little about regular expressions we can talk about how they relate to lexical analysis. Usually, we want to associate a lexeme type with a regular expression. For this post we use the notation:

    TYPE       = RegEx

Using the above definition and regular expressions we can now examine the two examples we gave above. We can that for the first example the regular expression is any string of non-whitespace of length at least 1 and for lack of a better name, type LEXEME:

    LEXEME      = [^\s]+

and the second is much more complicated. We match quoted strings, identifiers, periods, parenthesis, braces, and the semicolon

    STRING      = \"[a-zA-Z]*\"
    IDENT       = [a-zA-Z]+
    PERIOD      = \.
    L_PAREN     = \(
    R_PAREN     = \(
    L_BRACE     = {
    R_BRACE     = }
    SEMICOLON   = ;

From this we can see that the lexeme types with the tokens from the second example would have resulted in

    <IDENT,int> <IDENT,main>    <IDENT,print>  <STRING,"Hi">   

Notice that for some of the tokens we don’t include the lexeme value since the lexeme type is enough to generate the value if needed.

Lexer (theory)

Now that we know know how regular expression map to lexeme types and generate tokens lets take a look at how it can be used in a lexer. Conceptually what a lexer does is it reads the text of a file and tries to match the longest sequence of characters it can to a lexemes regular expression. For instance 1234.6 we could match the characters 123 to a lexeme of type INTEGER and 4.6 to a lexeme of type REAL. However, we want to match the longest sequence of character possible so we match 1234.6 to REAL.

More concretely it starts at a position in a file, either at the begining or after just reading a lexeme, and performs the following

        REGS = ALL_REGS
        CURR = START
        S = ""
            S = CONCAT(S,DATA[CURR])
            FOR R in REGS:
                IF NO_CHANCE(R,S):			# There is no chance that R can match S
                    REGS = REGS - {R}
                IF REGS == []:
		    IF ACCEPTS(R,S[start:curr-1])	# The regular express R matches S[start:curr-1]
                    	RETURN (R,S[start:curr-1])
			ERROR("Unknown Lexeme type")
            CURR += 1
	IF SIZE(REGS) == 1				# If a lexeme is accepted by two regular expressions you will have a bad day
                RETURN (REGS[0],S)
	ERROR("Unknown Lexeme Type")

You should be able to follow most of what the psuedocode is doing except for NO_CHANCE and ACCEPTS.

What NO_CHANCE, and similarly ACCEPTS, does is checks to see if given this string would it ever be possible to match this regular expression. To help with this lets look at three regular expressions and some input.

    R1      =   [a-z]+[0-9]?[A-Z]+
    R2      =   [a-zA-Z]+
    R3      =   [a-z]+

    S1      =   helloEveryone

        S1 -> has a partial match [a-z]+[0-9]?[A-Z]+
        S2 -> has a match [a-zA-Z]+
        S3 -> has a match [a-z]+

        S1 -> has a partial match [a-z]+[0-9]?[A-Z]+
        S2 -> has a match [a-zA-Z]+
        S3 -> has a match [a-z]+
        S1 -> has a match [a-z]+[0-9]?[A-Z]+ ([0-9] is optional remember!)
        S2 -> has a match [a-zA-Z]+
        S3 -> cannot match, remove S3 from REGS

        S2 -> has a match [a-zA-Z]+
        S2 -> has a match, on next iteration we break out and return (S2,"helloEveryone) 

Lexer (practice)

Now that you see some idea of what is going on conceptually, how would you implement this in code? There are general purpose programs which do work with regular expressions to generate lexers, such as lex and Flex. Luckily you don’t have to work with regular expressions if your lexeme types are simple. For instance if you recall identifiers in C: begin with either a letter or underscore followed by 0 or more letters, underscores, or digits.

How would you implement this in code? You would not have to create a regular expression but rather just check to make sure you meet the requirements listed

// Assumption: we already validated the first characters
Token* readIdentifier(char* data, int* pos, int N) {
    int start = pos[0];
    while(isLetter(pos[0] < N &amp;&amp; (data[pos[0]]) || isDigit(data[pos[0]]) || isUnderscore(data[pos[0]]))) {
    int length = (pos[0] - 1) - start;
    return createToken(IDENTIFIER, data + start, length);

Token* readLexeme(char* data, int* pos, int N) {
    else if(identifierWaiting(data,pos,N)) return readIdentifier(data,pos,N);

Some of the more astute readers might ask the question: what if I want to have a lexeme type for each keyword in my language? Luckily most keywords match identifiers, therefore when you read in an identifier just check to see if it is a keyword and change the type from IDENTIFIER to your keyword type, for instance FOR or IF.

From the above paragraph, and possible regular expression matching example, you might also ask the question: what do I do if it is possible for two lexeme types to start with the same value? You will need to read until there is a difference in the two. This can happen if you want a lexeme type INTEGER and REAL. Both can begin with a digit but as you reading you might realize you are reading a REAL instead of an INTEGER or vice versa. To handle this you can have a generic readNumber function which handles the cases inside.


In your language you are required to have comments. The easiest comments to handle are single character comments such as the # symbol. While you are reading whitespace if you see the # symbol read until the end of line. If you want to support more complicated symbols such as // or /* … */ you will need to do some extra work. The best way to handle this is to have a check in the readLexeme function which check for the first character, then checks the second and handles the remaining characters as necessary. Make sure you can handle escaped characters if you do multiline comments.

Error Reporting

One issue that you will be required to deal with is error reporting (in both your lexer and parser). What happens if you read an unsupported character? You should be able to display the line and column that the error appears. Additionally, what happens in your parser when you find an operator, say division, and the left value is an integer and the right is an array? You should be able to print out the file, line, and column this error appears on.

The best way to handle this is to have your lexer keep track of the current line and column. When you read a lexeme save the file, line and starting column with your lexeme. This can then be used by your parser to display with error messages. In addition, can also store extra information such as the starting position of the line the lexeme is on if you wish to print the entire line.

Case Study

Let’s look at a lexer which supports the following lexeme types:

    IDENT       = [a-zA-Z]+
    L_PAREN     = \(
    R_PAREN     = \)

The three lexeme types we have are left parenthesis, right parenthesis, and identifiers. For the parenthesis it should be obvious and for identifiers we accept 1 or more letters.

Let us look at some C code to see how we might structure our code to handle these requirements. There will be two methods that are called the most by the user, lexerHasToken and lexerNextToken. The lexerHasToken function returns true if there are, possibly, more Tokens to be read. The lexerNextToken function tries to read the next token and prints an error if one occurs, otherwise it returns a new Token.

The code for lexerHasToken is below and returns true if there is (possibly) another lexeme waiting to be read

    int lexerHasToken(Lexer* l) {
        //  We ignore whitespace
        while(lexerHasMore(l) &amp;&amp; isSpace(lexerPeek(l))) {
        return lexerHasMore(l);

In this function we have four helper functions that you will have to create.

  • lexerHasMore takes in the lexer and returns true if there is another character in the file.
  • isSpace return true if the character supplied is whitespace
  • lexerPeek returns the current character without moving to the next
  • lexerNextCharacter moves to the next character in the file

In addition to handling whitespace this is where you would handle comments. This can be implemented by having a loop that has two jobs, one is to remove whitespace, the other is to remove comments. If you look at the attached code I have provided you can see that how I perform these operations. Unfortunetly, you will have to change how this would be done in your lexer because you are only allowed to read one character at a time from the file.

The code below is for lexerNextLexeme and returns the next lexeme, unless there is an error

    Token* lexerNextToken(Lexer* l) {

        if(lexerPeek(l) == '(') {
            return lexerReadLeftParen(l); 
        }else if(lexerPeek(l) == ')') {
            return lexerReadRightParen(l);
        }else if(isLetter(lexerPeek(l))) {
            return lexerReadIdentifier(l);

        return NULL;

We can see there are five new functions

  • lexerReadLeftParen reads in the left parenthesis and construct a new Token for it, this will move to the next character in the file
  • lexerReadRightParen reads in the right parenthesis and construct a new Token for it, this will move to the next character in the file
  • isLetter checks to see if the current character is a letter as defined by the regular expression [a-zA-Z]
  • lexerReadIdentifier will read in 1 or more characters and construct a new Token with type IDENT, this will move over 1 or more characters in the file
  • lexerError will print out any information we have about the error, including possibly the file, line, and column along with an error message

If you wanted to add a new lexeme type this is where you would have to add a new else if statement. For instance adding a new lexeme type for the + operator would add

	}else if(lexerPeek(l) == '+') {
            return lexerReadOperator(l);

If you are interested on how we handle identifiers you can either check out the code attached or scroll up and examine the code for parsing C identifiers.

Extra Fun

Here is some things you can try to add to my lexer, in order of easier to difficult

  • Handle operators
  • Handle integers
  • Handle characters and strings
  • Handle reals
  • Realize you can parse scheme

Attached Code

The attached code can be compiled using either the makefile, or with the command

	gcc lexer.c -o lexer

You can then use the program with the command

	./lexer filename

I have attached three sample files, due to attachment limitation, you can run on the lexer. An additional input which can be used is


Total Views (13)

Writing a Database Storage Engine


This semester I am in the advanced database class at the University of Alabama.  Me and a friend convinced the professor to have the final project to be an implementation of a NoSQL database.  I have always been interested in creating things and making a database is one of things I have always wanted to make.  Before I have not had to time due to class and work but now it is required for class so no one can yell at me for “wasting my time”.

We have written a fair bit of the database but the current part we are working on is the storage engine.  We didn’t want to use anything pre-built or create a naive solution like a giant flat file where we have to copy data down to insert in the middle.  In the end we designed a fake file system on top of the current file system with the extra method called insert.  Insert allows the end user to write in the middle of a document without moving all the extra data down.  For this we had to create a few classes which emulate common file system operations.  We create a FileSystem class, File class, FileReader class, and a FileWriter class.


The FileSystem class is the main entry point for everything.  It is what is used to open/create files and we also provide private functions which can be used by other classes in the same namespace.  The conceptual idea of the FileSystem is that there are some files stored within it and we provide access to loading those files and to read and write to and from them.  This requires us to save meta-data about the FileSystem and also meta-data about each File as well.  This class ensures that everything is consistent when things are written and read from disk.  Also it is tasked with creating a FileSystem on the disk if one does not exists and also allows for reading/saving the file systems data as well.


The File class is just an empty class that keeps track of the current properties about the file.  It also has a pointer to the FileSystem object which created it.  This was mainly to be used to allow readers and writers access to a File objects actual content on disk.  The only problem I have with this is that if you open multiple files and one is modified the properties of the others is not.  In the future this needs to be changed.


The FileReader class takes in a File and then provides a read interface for the file.  This class does not really do any of the heavy lifting but rather leaves it up to the file system to handle that.  I would like to be able to extend this to allow for more helpful write methods such as writeString, writeInt, and so forth.


Like the FileReader class this class facilities writing to a File but does no actual heaving lifting itself.  Like the FileReader class I also want to implement methods like readInt and readString and so on.

I think that these are good classes to have for right now but I wish I could either pull out more things from the FileSystem class or merge them into smaller functions.  Currently it has a lot of private methods and there are a few which repeat some work but with a little extra in it.

Will try to update more about this project in the coming weeks.


Total Views (243)

Bringing CUDA to Go

Lately I have been playing around with Go. If you have not seen it you should check it out over [here].

One of the things that would be really cool to me is to be able to program in Go but have the code execute on the GPU. The only problem is that Go does not support CUDA. Currently if you wanted to write something in Go which used CUDA you would need to write some CUDA code in a language which has support and then call that from Go. If you provided enough calls you could write a decent amount of code in Go but you incur a performance overhead due to communication from starting and then stopping kernels.

The two alternatives that I have been thinking about is

  1. Somehow write CUDA code using a library written in Go which transforms specified functions into code which can then be compiled for CUDA
  2. Write some type of interpreter which runs on the GPU and accepts byte-code that is given to it from Go

In both cases I would have to modify the Go compiler.  In the first case I would have to modify it to be able to split the code into two parts: CUDA code and host code.  What I would have to do with the CUDA code depends on how much work I want to do.  I could take the AST and convert it into C/C++ code and then compile that using the CUDA compiler.  If done this way I would have to devise strategies on how to emulate Go functionality on the GPU which does not hurt performance too much.  One example of this is channels.

If I had a channel which I passed to the GPU then I would have to generate code which would handle it.  This would require either limiting the options to the programmer or hiding the limitations somehow.

That is all for now but I will follow up soon.  This is really burning in my brain and I want to do it.

Total Views (344)

Google Interview

Today was my google interview and I think it went okay.  The only problem I had was when they asked me about a specific problem I had in the past that I had to solve.  I don’t really remember those because they are usually something that I solve and never see again.

They did ask me this coding question:

Given a sorted list of distinct integers from 0 to 99, such as [0, 1, 2, 40, 42, 86], produce a string that describes numbers missing from the list: in this case “3-39,41,43-85,87-99”. The items should be sorted in ascending order and separated by commas. When a gap spans only one number, the item is the number itself; when a gap is longer, the item comprises the start and the end of the gap, joined with a minus sign.

I wrote a solution, which was partially correct but still not entirely correct.  Mine was O(n) on the input data in the worst case and O(1) in the best (technically though the worst cast is O(1) because the array length is constant).  I then modified it to allow for arbitrary ranges as well. After cleaning it up and removing all the bugs here I my final solution.

public class Main {

    private static String formatString( int a , int b , boolean addComma ) {
        String s;
        // Construct string
        if( a == b ) {s = "" + a;}
        else {s = a + "-" + b;}
        // Add comma is necessary
        if( addComma ) s = "," + s;
        return s;

    public static String generateString( int[] data, int start , int max ) {
        if( data == null || data.length == 0) return "0-" + max;
        if( data.length == (max - start) ) return "";

        StringBuffer sb = new StringBuffer();

        int current = start;
        boolean hasAdded = false;
        int pos = 0;
        while( pos < data.length - 1) {
            if( data[pos] != current ) { // Missing current to data[pos] - 1
                sb.append( formatString( current , data[pos] - 1 , hasAdded ) );
                hasAdded = true;
                current = data[pos];
        if( current != max ) {sb.append( formatString( current + 1 , max , hasAdded ) );}
        return sb.toString();

Total Views (458)


One of my weakest points in my computer science career is the writing. I never know what to say or how to say it; even worse is that I am very concise. For instance I can write my entire research in a few paragraphs. This leads me to my current problem: I have to write a journal paper. I did take a technical writing course in my masters but most of that was fixing chapters out of the professors, to be published, book. We did get some insight on how to write a technical document given the source material, but not so much in how to start from scratch.

This entire post is just going to be describing what I did, some of the results, and what they mean. This is about as good as I get on writing a technical anything, short, and only the bare minimum. Some background is needed as such: I took a previous implementation of a TSP solver which uses 2-opt written for NVIDIA GPUs and I re-wrote it with the hopes of it having better performance. So here I go.

I took some previous work, modified it, now it runs better. The code runs on NVIDIA GPUs and was a single self-contained kernel function. What the code did was take in a TSP problem and randomly generate tours and then apply the 2-opt algorithm on each tour. The best tour found was the winner. How it was divided for CUDA was that each block would get it’s own tour and perform 2-opt on it. End the end there was a single spot in memory where the best tour length would be saved and the CPU would print out how long it took to find this solution and what the length was. The input was how many retries you wanted to perform and the tour.

First I took this code and broke it into its logical parts so that each part would only have to handle its own concerns. Next I looked at each part independently to see if there was any improvement that could be done. Also for some reason I took data that was related, i.e x and y pairs of coordinates, and put them in a struct. I think I originally did this because I would always get x and then y, but it doesn’t really matter because everything thread grabs the x and then the y as shown below.

Comparison of separate arrays or a merged array of elements.

Comparison of separate arrays or a merged array of elements.

What I found was that the code was already really good but I did make some minor improvements. I was not sure how to improve upon it and I almost gave up. But while I was running some experiments I noticed some really weird things going on with the thermal readings for the GPU. What was happening was that there was a good bit of thermal fluctuation as the kernel was running. What I realized was that blocks were stopping and starting and it would have some idle times and then some busy times and this would lead to the ups and downs. After thinking about it I realized I don’t need to start some variable number of blocks. I could just start as many blocks that can be active on the GPU and give each block some set of tours to work on. This gave me two things: thermal fluctuation dropped and it allowed more memory to be used per block. In the original program all the memory would be allocated up front for each block that would be running. This means that if your problem instance took 1MB and you wanted to launch 1024 blocks you would end up using 1GB of memory. Now you can have as many units of work, and divide them among the blocks that are running. The card I have is a GTX 780 which can have 16 resident blocks per SMX and it has 12 SMX processors on it giving me a total of 192 blocks which can be active. Because of this change my card which has 3GB of data can have tour lengths over 1 million cities but with 1000 restarts (But would take a year to finish).

Notice the fluctuations over time.  Especially that drop near the end.

Notice the fluctuations over time. Especially that drop near the end.

More stable readings after fixing block count.

More stable readings after fixing block count.

So far with what I have my performance has not really increased that much, maybe about 10-15% faster than the original. Of course you could argue for large enough problem sizes this is okay because 15% of a month is a pretty good amount of time. I did however finally get the performance to improve dramatically but was more related to the algorithm itself than anything to do with CUDA. What I noticed was that each thread finds the best solution for it’s subset of data. Then of all the solutions per thread the best is picked and we apply that solution. What I noticed was that sometimes solutions do not overlap and if you apply both you get converge much much quicker. For an input of size 33k I get a speedup from the original problem around 20x.

In the previous work the authors stated that 128 threads is the way to go but I ended up finding that setting the threads to the minimum between 1024 and the number of cities (but multiples of 32) seemed to have the best performance. Also the shared memory is better when I do the same thing.

Total Views (360)

Possible Job

So I got referred to Microsoft by my friend. I already had my first interview about two weeks ago. They just contacted me yesterday and told me that I would have an onsite interview.

The only problem is that I applied for an internship but they said I could not do one because I would be graduating the semester after the internship. So I am going in for a full-time position. This interview will be a lot harder than the internship interviews. My recruiter has already given me a list of recommended (decade old) books to review. I am about half-way through one of them (Programming Interviews Exposed) so that is good. The only problem is that I cannot afford to purchase them so I have to… find them online.

So far the book seems pretty straight forward. The main thing that the author(s) mention is to never stop thinking out loud. They also push think before acting a lot, which to me seems like common sense. I guess this is because I am 30 and have a few more year of life experience on the usual programmers that get a job when they graduate at 22?

Overall I am really happy and excited but still nervous. I won’t hear back from Microsoft until about October through February, quite the gap.

Well back to reading!

Total Views (489)

Strange Results

Like I mentioned before I am doing some experiments to evaluate my 2-opt solution.  The only problem is now that I have the solution I have some interesting results that I cannot explain right away.  A little background should be in order before I just dump the graph.  My experiments are how the thread size, and shared buffer size, might affect performance.  So I have a fully connected digraph with 1400 nodes and I ran over all thread and shared memory combinations that are multiples of 32.  Thread count is obvious but shared memory needs more explanation.  The shared memory allocation represents the amount of nodes in my TSP path I can store for quick lookup.  Each node holds the position, two floats, and the weight to the next city in the path, an int.  This means each shared memory spot holds 12 bytes.

The issue is that around thread count of 576 I start to notice strange results around shared memory size of 544.  Below are a few graphs that start to show what I am talking about:

576 threads

576 threads

608 threads

608 threads

640 threads

640 threads

As you can see above something strange starts to happen around allocating 544 slots in shared memory.  I have tested to make sure that there were no banks conflicts and none were reported.  Another even stranger thing is that my occupancy is higher when shared memory is 544 than before or after that jump!

Good ole CUDA, keeping you on your toes.

Total Views (331)

Running Experiments

I find running experiments to be a little overwhelming sometimes.  For instance I just ran CUDA experiments with my TSP implementation to see how multiple configurations affect performance.  In the end I had 800 different configurations which took around 11 hours.  The real problem was that I did not automate the process because I was using Windows and was not familiar with batch programming.

It would be nice if I could just write a bash script to automate everything like I currently do for grading the algorithms and data structures class I am assigned too.  It is not that I don’t have a Linux machine but that I don’t have much experience doing real development without a GUI interface.  The biggest thing that gives me pause is that I would have to set aside time from work and research to learn how to use the command line tools, which would be really good to know, and how to parse the output they give.  Currently I can do the debugging using the NSIGHT debugger in Visual Studio and profile my applications using the NVIDIA Visual Profiler or with NSIGHT’s Visual Studio Profiler.

The Plan

I think that I will try to learn how to batch program.  One issue is that I used #define to control some of the shared memory sizes and to help let the compiler unroll loops where they are used.  Now I have to move the #define statements into templates so that I can pass the parameters in from the command prompt and also not have to recompile my code each time I make a change.  I am not too familiar with templates so now I will need to learn templates while I refactor my code.  So in recap I have to:

  1. Learn Windows batch programming
  2. Learn C++ templates
  3. Refactor my code

I hope in the long run I get good results, learn how to make my code better, and be able to automate tasks in windows using scripts.

Preliminary Results

So I made some quick changes to my code to use templates.  I was hoping that I would see some performance increase but it seems that It made my code worse.  Originally it took 89 seconds to find a length of 2073 for a complete asymmetric graph with 1400 cities, now it takes  121 seconds.  It seems that I need to work on it a little more and figure out where I can fix things.

Total Views (384)

Reading the documentation

Some background information may be of use.  My current work involves me taking a working implementation of 2-opt and optimizing it the best I can.  For those who don’t know, 2-opt [wiki] is a heuristic to finding a solution to the Travelling Salesman Problem [wiki] (TSP).

The idea behind 2-opt is that you find two edges that cross and then reconnect them so they don’t cross.  The algorithm performs these swaps until no more swaps are possible.  One thing to note is that 2-opt does not guarantee an optimal solution.

2-opt edge swap

2-opt edge swap

Now that I have explained what I am doing let me explain the totally unrelated error I had.

My algorithm needs some memory to keep track of the path and the edge weights along the path.  To do this I allocate memory at the top and then before I leave I free it.  So far so good.  However, what happens if you don’t free the memory?  You get things like this :

2-opt results before and after


On the left is the original time to complete each problem.  On the right is the new time after I don’t free.  If you don’t know the correct answer then you more than likely will come to my conclusion as well.  Releasing memory is really really slow for some reason.  Of course since you are reading this you can guess that is probably not what is happening.

Like an idiot I decided to let the whole world know I was an idiot by posting my question asking what was going on to Stack Overflow [link].  Within minutes I was informed I was an idiot and told that when allocating memory inside a CUDA kernel you have a default heap size of 8MB to allocate memory from.  Sure enough in section B.18.1 in the “CUDA C Programming Guide“, (v6.0), it states

The device memory heap has a fixed size that must be specified before any program using malloc() or free() is loaded into the context.  A default heap of eight megabytes is allocated if any program uses malloc without explicitly specifying the heap size.

Then it goes further to state I should use cudaDeviceSetLimit( cudaLimitMallocHeapSize , size_in_bytes) to update the heap size.

Once I allocate a heap size large enough and run my program runs as though I never made any optimizations and now I am back to the beginning of what I was supposed to do. Next time I will probably read the documentation just a little better.

Total Views (416)

Working With Cuda Without Aspirin

I find that working with CUDA is overall a fun experience.  The only issues I seem to have is that sometimes I get very unexpected results.

For instance let’s say I have this code below :

for(int k = threadIdx.x; k &lt; TileSize; k += blockDim.x) {
    int index = front + k;
    if( index &gt; lower_bound ) {
        shared_x[k] = glob[index].x;
        shared_y[k] = glob[index].y;

If your threadDim.x is always less than TileSize (which in my case it was) then you can replace the above code with :

if(threadIdx.x &lt; TileSize) {
    int index = front + threadIdx.x;
    if( index &gt; lower_bound ) {
        shared_x[threadIdx.x] = glob[index].x;
        shared_y[threadIdx.x] = glob[index].y;

Apparently not on my machine though.  When I make this change my kernel crashes with a memory access violation.  It cannot be shared memory because the blockDim.x is always less or equal to TileSize.  This leaves the global memory allocations.  But this cannot be because any access made in the second had to be made in the first!  It seems I will never be a great CUDA developer.

Total Views (1323)