Re: Webcam (V4L-V4L2) — Video for Linux

You can still use MMX and make it portable, since you can conditionally
compile-in the MMX code based on the platform (generate only if compiled for
x86), and you can do a run-time check to determine if the x86 machine you are
running on has MMX.  I don't have that run-time code here, but I can probably
find it at home if you really need it, but I believe you can find it on the
Intel developer site under MMX documentation, or perhaps other readers could
help you.  MMX-based will be much faster: SIMD, so more pixels done
per-instruction, and clamping is "free" with MMX: no branches.


For non-MMX, my experience is that a lookup table "can't" be safely used for
YUV / YCbCr to RGB conversion.  I found that most video clips would be ok, but
occasionally one frame (or even pixel) would yield a value that was way off -
mostly negative, if I remember.  Since you reference outside the table in that
special case, and probably crash, lookup tables can't be used.  I don't know if
that is true for RGB to YCbCr/YUV conversion.

I stopped using lookup tables because I found a safe way to clamp that was about
as fast on machines then, and is probably faster now - machines have gotten
faster but memory and even cache hasn't so much, so memory /cache fetches cost
relatively more .

For YCbCr to RGB, generally a clamp is not taken. I measured it on a bunch of
MPEG-1 videos and I remember that clamps were less than 10% of the cases; much
less than 5% I think.  So the trick is to not branch unless there is a clamp.

The obvious way to clamp is:

    int value;
    if (value > max)
        value = max;
    else if (value < 0)
        value = 0;

Which generates the (pseudo) machine code (where the comparisons are signed):

    if (value <= max) goto notgreater;
    value = max;
    goto done;
notgreater:
    if (value >= 0) goto done;
    value = 0;
done:

That's two compares and at least one branch per clamp.  You can do much better
by using a single unsigned compare:

    int value;
    if ((unsigned int)value > max) {
        if (value < 0)
            value = 0;
        else value = max;
    }

which does only one compare to determine if a clamp is needed.  If it is, it
then takes two compares to figure out which clamp it is, but since clamps are
relatively rare it ends up being faster.  Note that this works because a
negative value, treated as an unsigned, will always be greater than the max
positive value; that's just twos-complement arithmetic.  (You wanted the code to
be portable, but I don't think you have to support ones-complement machines :-)
I won't bother to show the generated code.

But note that doing that you are always taking a branch if you don't clamp, to
skip over the code that does the clamping.  So - and you purists out there are
going to hate this, because this is "assember written in C"! - the fastest way
to clamp, when clamping is occasional, is:

    int R, G, B;
    /* Generate R, G, B: may be better to generate before clamp to save
registers,
        or to generate R, B, clampR, generate B, clampG, etc. so clamp can be
dual-issued with generation.
        Move the code around and benchmark it; compiler will probably move it
for you anyway.
    */
    if ((unsigned int)R > maxR)
        goto clampR;                        /* this branch mostly not taken; no
branch if no clamp */
   clampRDone:
    if ((unsigned int)G > maxG)
        goto clampG;
    clampGDone:
    if ((unsigned int)B > maxB)
        goto clampB;
    clampBDone:

    /*    Rest of code, e.g. pack to 565 and write. BTW write 32 bits, on 32-bit
boundary, if writing to PCI bus! */
    return;    /* skip below clamping code. */

clampR:
        if (R < 0) R = 0 else R = max;
        goto clampRDone;
clampG:
        if (G < 0) G = 0 else G = max;
        goto clampGDone;
clampB:
        if (B < 0) B = 0 else B = max;
        goto clampBDone;

}    /* actual end of function */


Yes, it's ugly, but it is faster, at least on the hardware I tested it on.  I
would give it a try.

In the above "maxR", "maxG" and "maxB" may all be the same, i.e. 255.  You could
trying to get it in a register by:

    unsigned int maxAll = 255;

Or you could also try not tying up a register by (branch taken if upper 24 bits
nonzero, i.e. number is negative or is > 255) by:

    if ((value >> 8) != 0)
        goto clampValue;

You would think this one wouldn't be better on x86, because it has to shift then
compare as two separate instructions.  But those instructions are shorter than
if you compare to literal "255", and you save a register, which you don't have
many of on x86.  I would give it a try and benchmark it.

Note that this clamping technique also works when clamping output from an IDCT -
but an IDCT is *much* faster using MMX.

Peter K



Benedict Bridgwater wrote:

> Justin Schoeman wrote:
> >
> > Benedict Bridgwater wrote:
> > >
> > > Justin Schoeman wrote: (regarding RGB/YUV conversion)
> > > >
> > > > I actually benchmarked these two versions a while ago, and it turns out
> > > > that on a reasonably new PC (K6/PII or up), fixed point arithmetic is
> > > > quite a bit faster than tables - It turns out that a fixed point
> > > > multiply is a lot quicker than a memory access (especially when taking
> > > > the heavy cache usage of image processing into account).  On older PCs
> > > > (Pentium/K5/etc) the table version is quicker.
> > >
> > > Justin, do you remember if this difference would show in a simple
> > > conversion loop benchmark, or did it have to be in context of an actual
> > > image conversion with corresponding cache usage?
> > >
> > > Ben
> >
> > I didn't test it in simple loops, but the difference will definitely be
> > more marked with real images (due to cache contention for the look-up
> > tables).  As a rough estimate, I would expect the two routines to come
> > out approximately equal on simple conversion loops (a L-1 cache hit is
> > about as expensive as an integer multiply  on newer CPUs).  One area
> > where tables can help is the final downscale and clamp operation.
> > Conditional jumps based on (relatively) random data can be very hard on
> > deeply pipelined CPUs, while a table lookup will usually hit L-1
> > cache...
>
> Thanks for the info - I'll redo my conversion code when I get a chance
> (I don't want to use MMX since I want to keep it portable). I was too
> lazy to use lookup for the clamping since I couldn't be bothered to
> figure out how much of a "guard band" was necesary outside of the legal
> (clamped) range. Is there a simple way to figure this other than
> experimentally?
>
> Ben
>
> _______________________________________________
> Video4linux-list mailing list
> Video4linux-list@xxxxxxxxxx
> https://listman.redhat.com/mailman/listinfo/video4linux-list