Interesting capture (to screen) numbers

Hey all, as you may know from recent history, I have been capturing using
the v4l2 streaming interface and displaying to the screen using a number
of different methods.  Here are some results...

OpenGL display system-I used (basically) a glClear, then a glSetRoster,
then a glDrawPixels to display to the screen.   On the capturing side I
used the RGB32 (or BGR32, I am not sure) pixel format for displaying.  I
had to offset the buffer I handed to glDrawPixels by 1 byte (I'll send you
the code if you don't believe me, but it is designed to work with a bttv
card) then I could draw to the screen using GL_RGBA as a pixel format.
This used 50-60% processor, as reported by top.

Xvideo display system-basically, used the basic extentions, and copied the
uyvy buffer from the capture stream to an xvideo shm buffer, then
displayed that.  Processor usage: 55-65%

XDGA display system-I set the display device into 24-depth
32-bits_per_pixel format, at 1024x768.  Then I could transfer the buffer
returned by setting the capture device at BGR32 (that I can remember)
straight to the hardware framebuffer.  For that transfer I used a simple
(unrolled) for loop, such as outLong[i] = inLong[i].  This used 30-40%
processor.

I found out that on my machine, pII400, memcpy was the slower way to copy
each scanline (each scanline is 4*640 bytes, so not a trivial amount).
Array indexing was FASTER.  Odd, huh?  Any ideas?  I am also not sure if
there isn't a faster way to transfer the memory from here to there, but I
don't know how to initiate a string of dma transfers from userspace mem to
the framebuffer.  Any ideas would be nice.  Chris