KivEnt Rendering Speed Improved 38X

Fri 05 June 2015

One of the main goals of the entire KivEnt framework is to allow you to write tight update loops for your game systems that avoid touching python objects as much as possible, while still maintaining a fairly pythonic approach to writing your update code. One of the advantages of cython here is that we can freely use Python if we want, or if we are perhaps transitioning from pure python code.

However, there is a cost to making use of Python objects in your code and it is larger than I expected. For a while now, there has been a small section of code in the Renderer.update function that indexes into a python list per entity. It has been on my list to replace with something that doesn't use python, but I was waiting to redo the way we were handling model data entirely.

Anyway, I finally got around to making the changes and I am astounded by the performance increase I have observed. Using the 3_creating_a_gamesystem example from KivEnt and drawing 30,000 entities on my computer I get the following profiling results:

Old Method:

Runtime: 12.260 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      161    6.177    0.038    6.180    0.038 kivent_core\systems\renderers.pyx:521(update)
      161    1.701    0.011    7.882    0.049 kivent_core\gameworld.pyx:421(update)
    30000    0.618    0.000    1.295    0.000 kivent_core\gameworld.pyx:346(init_entity)

As you can see the update function with 30,000 entities is taking up the majority of the update time for the entire frame, and taking much too long to be running at 60 fps (entire game must update in under < .016 ms).

New Method:

Runtime: 10.595 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      497    1.553    0.003    2.007    0.004 kivent_core\gameworld.pyx:436(update)
    30000    0.636    0.000    1.364    0.000 kivent_core\gameworld.pyx:361(init_entity)
      497    0.443    0.001    0.452    0.001 kivent_core\systems\renderers.pyx:492(update)

Huge improvement! We are now capable of easily running at 60 fps, and are only using 25% of the frametime available. This leaves a lot of time for performing more game logic. So what is going on in this particular case?

The Implementation

Let's take a look at the old update function, the key line to look at is Line 43 where we index into the meshes list using the index of the model stored in the RenderStruct. If you would like an indepth introduction to how the update function of Renderer works read here

Old code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def update(self, dt):
    cdef IndexedBatch batch
    cdef list batches
    cdef unsigned int batch_key
    cdef unsigned int index_offset, vert_offset
    cdef RenderStruct* render_comp
    cdef PositionStruct2D* pos_comp
    cdef VertexFormat4F* frame_data
    cdef GLushort* frame_indices
    cdef VertMesh vert_mesh
    cdef float* mesh_data
    cdef VertexFormat4F* vertex
    cdef unsigned short* mesh_indices
    cdef unsigned int used, i, real_index, component_count

    cdef ComponentPointerAggregator entity_components
    cdef int attribute_count = self.attribute_count
    cdef BatchManager batch_manager = self.batch_manager
    cdef dict batch_groups = batch_manager.batch_groups
    cdef list meshes = model_manager.meshes
    cdef CMesh mesh_instruction
    cdef MemoryBlock components_block
    cdef void** component_data

    for batch_key in batch_groups:
        batches = batch_groups[batch_key]
        for batch in batches:

            entity_components = batch.entity_components
            components_block = entity_components.memory_block
            used = components_block.used_count
            component_count = entity_components.count
            component_data = <void**>components_block.data
            frame_data = <VertexFormat4F*>batch.get_vbo_frame_to_draw()
            frame_indices = <GLushort*>batch.get_indices_frame_to_draw()
            index_offset = 0
            for i in range(used):
                real_index = i * component_count
                if component_data[real_index] == NULL:
                    continue
                render_comp = <RenderStruct*>component_data[real_index+0]
                vert_offset = render_comp.vert_index
                vert_mesh = meshes[render_comp.vert_index_key]
                vertex_count = vert_mesh._vert_count
                index_count = vert_mesh._index_count
                if render_comp.render:
                    pos_comp = <PositionStruct2D*>component_data[
                        real_index+1]
                    mesh_data = vert_mesh._data
                    mesh_indices = vert_mesh._indices
                    for i in range(index_count):
                        frame_indices[i+index_offset] = (
                            mesh_indices[i] + vert_offset)
                    for n in range(vertex_count):
                        vertex = &frame_data[n + vert_offset]
                        vertex.pos[0] = pos_comp.x + mesh_data[
                            n*attribute_count]
                        vertex.pos[1] = pos_comp.y + mesh_data[
                            n*attribute_count+1]
                        vertex.uvs[0] = mesh_data[n*attribute_count+2]
                        vertex.uvs[1] = mesh_data[n*attribute_count+3]
                    index_offset += index_count
            batch.set_index_count_for_frame(index_offset)
            mesh_instruction = batch.mesh_instruction
            mesh_instruction.flag_update()

Our code essentially loops through every batch, looking at each entity in the batch, looks up the model for that entity as well as some other game data and combines this data, writing out to the actual frame_data and frame_indices arrays.

The new code eschews with the list lookup entirely by storing a pointer to the actual model object. We can then cast this pointer directly to the object, avoiding the cost of looking into the list at all, as in Line 40 here:

New code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def update(self, dt):
    cdef IndexedBatch batch
    cdef list batches
    cdef unsigned int batch_key
    cdef unsigned int index_offset, vert_offset
    cdef RenderStruct* render_comp
    cdef PositionStruct2D* pos_comp
    cdef VertexFormat4F* frame_data
    cdef GLushort* frame_indices
    cdef VertexFormat4F* vertex
    cdef VertexModel model
    cdef GLushort* model_indices
    cdef VertexFormat4F* model_vertices
    cdef VertexFormat4F model_vertex
    cdef unsigned int used, i, real_index, component_count, n
    cdef ComponentPointerAggregator entity_components
    cdef BatchManager batch_manager = self.batch_manager
    cdef dict batch_groups = batch_manager.batch_groups
    cdef CMesh mesh_instruction
    cdef MemoryBlock components_block
    cdef void** component_data

    for batch_key in batch_groups:
        batches = batch_groups[batch_key]
        for batch in batches:
            entity_components = batch.entity_components
            components_block = entity_components.memory_block
            used = components_block.used_count
            component_count = entity_components.count
            component_data = <void**>components_block.data
            frame_data = <VertexFormat4F*>batch.get_vbo_frame_to_draw()
            frame_indices = <GLushort*>batch.get_indices_frame_to_draw()
            index_offset = 0
            for i in range(used):
                real_index = i * component_count
                if component_data[real_index] == NULL:
                    continue
                render_comp = <RenderStruct*>component_data[real_index+0]
                vert_offset = render_comp.vert_index
                model = <VertexModel>render_comp.model
                if render_comp.render:
                    pos_comp = <PositionStruct2D*>component_data[
                        real_index+1]
                    model_vertices = <VertexFormat4F*>(
                        model.vertices_block.data)
                    model_indices = <GLushort*>model.indices_block.data
                    for i in range(model._index_count):
                        frame_indices[i+index_offset] = (
                            model_indices[i] + vert_offset)
                    for n in range(model._vertex_count):
                        vertex = &frame_data[n + vert_offset]
                        model_vertex = model_vertices[n]
                        vertex.pos[0] = pos_comp.x + model_vertex.pos[0]
                        vertex.pos[1] = pos_comp.y + model_vertex.pos[1]
                        vertex.uvs[0] = model_vertex.uvs[0]
                        vertex.uvs[1] = model_vertex.uvs[1]
                    index_offset += model._index_count
            batch.set_index_count_for_frame(index_offset)
            mesh_instruction = batch.mesh_instruction
            mesh_instruction.flag_update()

The astute reader will now call me out on the fact that more than just the list lookup has changed. This is True, and I haven't tested the exact impact of just the list lookup to pointer change, however the other changes are mostly just changing the type of data fetched as a result of expanding the types of data we support. In this particular test case, the change is mostly semantic, we still have a 4 float array (albeit stored as GLfloats this time), taking up the exact same amount of memory and everything.

The New Model Format

In the first code, our mesh is represented by an array of floats to hold the vertex data and an array of unsigned shorts to hold the index data. This comes from the model format used internally by Kivy, where all geometry is represented by a 2 float pos and a 2 float uvs attribute. One of my longstanding goals for KivEnt was to expand the type of supported geometry to cover more of the types GL allows us.

Since Kivy and KivEnt are purely OpenGL, we can actually store our model data directly in GL types, as we will always be using those types by the time we are submitting our vertex data for the frame. Kivy uses pure floats because it makes it fairly easy to wrap in python as you just need to know the index of your attribute in a float array. However, even if Kivy did support other types (it doesn't the support was never completed), we would be forced to separate data of different types into completely different VBOs. Whereas, it is most efficient for vertex data to be interleaved together. This may involve assuming a small cost in padding if array elements need it, but in generally this padding should be considered helpful as it better aligns the data for the processor to read.

KivEnt already has everything in place to use a struct containing arbitrary GL types for rendering, and I was hoping to create a python object that could arbitrary wrap any of the vertex formats we create for rendering, as KivEnt is designed to let you rapidly introduce new rendering behavior, and we actually already have the information to do so since we need it to describe our vertex structure to GL.

For instance, our most basic vertex format mirrors Kivy's default:

ctypedef struct VertexFormat4F:
    GLfloat[2] pos
    GLfloat[2] uvs

Which is described by the list of tuples:

vertex_format_4f = [
    (b'pos', 2, b'float', 0), 
    (b'uvs', 2, b'float', 8),
]

The last value is the offset in the struct of the start of the attribute, and is important as we must call glVertexAttribPointer with an invocation like this:

glVertexAttribPointer(attr.index, attr.size, attr.type,
                GL_FALSE, <GLsizei>vbytesize, 
                <GLvoid*><long>offset)

The tuple in the format holds most of this data, the name is used to fetch the attr.index (the actual index is determined by GL), the second entry is the attr.size, the third is the attr.type, and the final is the offset. The vbytesize is the sizeof of the entire struct, the GL_FALSE specifies whether the data should be normalized.

Since we have this data we can use it to arbitrary wrap any of our vertex formats, which is implemented in the new rendering.model.Vertex and rendering.model.VertexModel classes that will be replacing the old VertMesh that stored its vertex data as floats.

First we transform our list of tuples into a dict for ease of use:

vertex_format_4f = [
    (b'pos', 2, b'float', 0), 
    (b'uvs', 2, b'float', 8),
]

becomes:

format_dict = {
    b'pos': (2, b'float', 0),
    b'uvs': (2, b'float', 8),
    }

Now we use this dict as the basis for a new __getattr__ and __setattr__ for our Vertex class. We need to support types of: GLfloat, GLint, GLuint, GLshort, GLushort, GLbyte, and GLubyte. Each vertex gets passed a pointer to the start of its struct, and then casts that pointer as a char* so that we can index by bytes. We can now use the offset contained in our vertex format in order to fetch the actual location in memory of the attribute of our struct. We cast this data to the appropriate type and then either retrieve the values or set them to the new values.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
cdef class Vertex:

    def __cinit__(self, dict format):
        self.vertex_format = format

    def __getattr__(self, name):
        cdef int count
        cdef unsigned int offset
        cdef bytes attr_type
        cdef char* data = <char*>self.vertex_pointer
        cdef GLfloat* f_data
        cdef GLint* i_data
        cdef GLuint* ui_data
        cdef GLshort* s_data
        cdef GLushort* us_data
        cdef GLbyte* b_data
        cdef GLubyte* ub_data
        if isinstance(name, unicode):
            name = bytes(name, 'utf-8')
        if name in self.vertex_format:
            attribute_tuple = self.vertex_format[name]
            count = attribute_tuple[0]
            attr_type = attribute_tuple[1]
            offset = attribute_tuple[2]
            if attr_type == b'float':
                f_data = <GLfloat*>&data[offset]
                ret = [<float>f_data[x] for x in range(count)]
            elif attr_type == b'int':
                i_data = <GLint*>&data[offset]
                ret = [<int>i_data[x] for x in range(count)]
            elif attr_type == b'uint':
                ui_data = <GLuint*>&data[offset]
                ret = [<unsigned int>ui_data[x] for x in range(count)]
            elif attr_type == b'short':
                s_data = <GLshort*>&data[offset]
                ret = [<short>s_data[x] for x in range(count)]
            elif attr_type == b'ushort':
                us_data = <GLushort*>&data[offset]
                ret = [<unsigned short>us_data[x] for x in range(count)]
            elif attr_type == b'byte':
                b_data = <GLbyte*>&data[offset]
                ret = [<char>b_data[x] for x in range(count)]
            elif attr_type == b'ubyte':
                ub_data = <GLubyte*>&data[offset]
                ret = [<unsigned char>ub_data[x] for x in range(count)]
            else:
                raise TypeError()
            if len(ret) == 1:
                return ret[0]
            else:
                return ret
        else:
            raise AttributeError()

    def __setattr__(self, name, value):
        cdef int count
        cdef unsigned int offset
        cdef bytes attr_type
        cdef char* data = <char*>self.vertex_pointer
        cdef GLfloat* f_data
        cdef GLint* i_data
        cdef GLuint* ui_data
        cdef GLshort* s_data
        cdef GLushort* us_data
        cdef GLbyte* b_data
        cdef GLubyte* ub_data
        if isinstance(value, tuple):
            value = list(value)
        if not isinstance(value, list):
            value = [value]
        if isinstance(name, unicode):
            name = bytes(name, 'utf-8')
        if name in self.vertex_format:
            attribute_tuple = self.vertex_format[name]
            count = attribute_tuple[0]
            attr_type = attribute_tuple[1]
            offset = attribute_tuple[2]
            if len(value) != count:
                raise AttributeCountError('Expected list of length {count} got'
                    'list of size {length}'.format(count=count, 
                    length=len(value)))
            for x in range(count):
                if attr_type == b'float':
                    f_data = <GLfloat*>&data[offset + x*sizeof(GLfloat)]
                    f_data[0] = <GLfloat>value[x]
                elif attr_type == b'int':
                    i_data = <GLint*>&data[offset + x*sizeof(GLint)]
                    i_data[0] = <GLint>value[x]
                elif attr_type == b'uint':
                    ui_data = <GLuint*>&data[offset + x*sizeof(GLuint)]
                    ui_data[0] = <GLuint>value[x]
                elif attr_type == b'short':
                    s_data = <GLshort*>&data[offset + x*sizeof(GLshort)]
                    s_data[0] = <GLshort>value[x]
                elif attr_type == b'ushort':
                    us_data = <GLushort*>&data[offset + x*sizeof(GLushort)]
                    us_data[0] = <GLushort>value[x]
                elif attr_type == b'byte':
                    b_data = <GLbyte*>&data[offset + x*sizeof(GLbyte)]
                    b_data[0] = <GLbyte>value[x]
                elif attr_type == b'ubyte':
                    ub_data = <GLubyte*>&data[offset + x*sizeof(GLubyte)]
                    ub_data[0] = <GLubyte>value[x]
                else:
                    raise TypeError()
        else:
            raise AttributeError()

Vertex forms the basis of the new VertexModel, which also handles allocating its memory using KivEnt's MemoryBlock system, instead of mallocing all over the place. This ensures that all models of the same format are stored contiguously, hopefully improving cache miss rate for our Renderers as they will be pulling models from the same area of memory. The other changes in the update function are a result of switching from the float array to the array of arbitrary structs consisting of GLtypes.

You can expect to see these changes in KivEnt 2.1.0, hopefully releasing in the next few weeks. If you want to preview these changes you can visit the new_model_format branch on github.

blogroll