Index: [thread] [date] [subject] [author]
  From: Brian S. Julin <bri@tull.umassp.edu>
  To  : Andreas Beck <becka@rz.uni-duesseldorf.de>
  Date: Fri, 6 Aug 1999 00:44:28 -0400 (EDT)

Re: Accel command queues (was Re: Matrox GGI accellerator)

On Fri, 6 Aug 1999, Andreas Beck wrote:
> This is IMHO not a very good idea. It strongly depends on the cards
> protocol, if that makes sense. An abstract protocol might save space 
> in the buffers thus optimizing throughput.

Agreed.  My exposure to accels is the somewhat simple wd90cxx engine.
In this case, all that really needs to be screened is the DMA flags
before an "execute" command.  For a much more complex card, some abstraction
may actually even simplify the kernel driver.  Let's just settle
on 1) make the kernel driver as simple as it can be 2) without doing
anything that really hinders performance.

> > It was mentioned that 
> > having the KGI driver filter these accesses would be a hassle --
> > well the driver authors are free to make the filter as restrictive as 
> > they want, 
> 
> ... thus as well making the driver complex, what is what they should help
> avoiding.

Actually what I meant here was making the driver simpler by screening
unused registers out, so they don't have to be "validated".

> > Second, IMHO we need to have accel buffers work independent of 
> > the mechanisms they use (IOCTL, MMAP, MMAP+ping-pong, whatever.)
> 
> Yes. This is handled in the OS glue layer.

I have to admit that I haven't done a thorough reading of Steffen's
latest permedia driver, however I haven't seen this part of the glue layer
implemented.  If it was, I'd get back to coding drivers :).  If
there's a header with accel queue management macros that abstract
the mechanism, point me to it.

> > Maybe I'm overly adverse to wheel reinvention, but it seem to me
> > it would be very good if the drivers could just hook their
> > queue management onto one of these mechanisms through a simple
> > #define or something.
> 
> They shouldn't even know how the commands came in. They should just accept
> queue commands.

Exactly.

> > Also the queues need to be more bidirectional so that result
> > codes from the engine can be used in userspace when handy.
> 
> This is very difficult usually due to the asynchronous nature of queue
> execution. A "backward queue" with requested answer codes would help here.

An implementation like the following might be neat:

forward control queue -- this takes extra data
that would get in the way of pumping the data into the 
accel engine with a tight loop or DMA.

forward data queue -- this has the data to put in the registers,
such that chunks can be DMA'd while the kernel driver is doing other 
stuff if this works out to be a speed gain.  Sometimes the driver leaves
pads in here for the kernel driver to fill or the kernel driver will
play with the data a bit before sending it (e.g. clipping on broken engines 
that lock up.)

This is not to say the two queues are not using the same ping-pong/IOCTL
buffer or are not interleaved in some chunkwise fashion in the same stream,
but conceptually there's the above separation. 

(The following example may not make my point as convincingly as
it could, this is a rushed e-mail.)
Suppose the accelerator had consecutive registers like so and takes
data by DMA with a starting index and a value (i.e., won't take index
values in the DMA data):

BLT_SRC         -- sensitive, must filter out requests for DMA
BLT_BGCOLOR
BLT_FGCOLOR
BLT_TEXTURE_SEL
BLT_X
BLT_Y
BLT_H
BLT_W

This is what would appear in the queues, in human readable form:

Type of command:       control queue          data queue      
-------------------------------------------------------------------------
abstract               command_semaphore      pad for BLT_SRC         
                       do_solid_box           value for FGCOLOR
                                              value for BGCOLOR
                                              pad for BLT_TEXTURE_SEL
                                              value for BLT_X
                                              value for BLT_Y
                                              "X2" value/pad for BLT_H
                                              "Y2" value/pad for BLT_W

pretranslated          start_block_semaphore  pad for BLT_SRC
                         set_bgcolor          value for FGCOLOR
                         set_fgcolor          value for BGCOLOR
                         set_texture          value for BLT_TEXTURE_SEL
                         set_coords           value for BLT_X
                         do_box_fill          value for BLT_Y
                       end_block              "X2" value/pad for BLT_H
                                              "Y2" value/pad for BLT_W

incomplete             incomplete_block_semaphore  [incomplete data]
command                [incomplete control data]


The "semaphores" (term used very loosly) get written in userspace last, 
and are invalidated by the kernel driver when the command is transfered.  
Mainly this is done because of Steffen's multiple-queue concept, so 
that the kernel driver is told when it can interleave commands from 
another queue, but also to reduce race condition problems under SMP.  
Other tokens like saving a state in kernel space and restoring it 
would be good.  In this example the driver author has decided that it's 
fastest to pass x1,y1,x2,y2 coords in the buffer and let the kernel 
clip and translate x2,y2 to w,h.

If the chipset takes registers along with data, register indexes can appear
in the data queue.  Basically whatever's fastest, and mostly that
format is under the driver author's control.  All the OS glue
layer would do is implement some macros for putting commands
in the queue in userspace e.g. put_accel_cmd(queue, len, buf)
and for getting the next complete command in kernel space
get_accel_cmd(&buf, &len).  Oh and of course macros/functions
to find/create queues in the first place and to force flush.
The kernel driver would have to know whether it had access to 
DMA'able memory to decide what method to use when implementing
the register load, of course.

Then a backward queue for results if useful.

Anyway, I should reread that metalanguage stuff in the new KGI until I can
trace the flow and understand how it works.  Maybe it's enough
to work with already.

--
Brian


Index: [thread] [date] [subject] [author]