You can still use MMX and make it portable, since you can conditionally compile-in the MMX code based on the platform (generate only if compiled for x86), and you can do a run-time check to determine if the x86 machine you are running on has MMX. I don't have that run-time code here, but I can probably find it at home if you really need it, but I believe you can find it on the Intel developer site under MMX documentation, or perhaps other readers could help you. MMX-based will be much faster: SIMD, so more pixels done per-instruction, and clamping is "free" with MMX: no branches. For non-MMX, my experience is that a lookup table "can't" be safely used for YUV / YCbCr to RGB conversion. I found that most video clips would be ok, but occasionally one frame (or even pixel) would yield a value that was way off - mostly negative, if I remember. Since you reference outside the table in that special case, and probably crash, lookup tables can't be used. I don't know if that is true for RGB to YCbCr/YUV conversion. I stopped using lookup tables because I found a safe way to clamp that was about as fast on machines then, and is probably faster now - machines have gotten faster but memory and even cache hasn't so much, so memory /cache fetches cost relatively more . For YCbCr to RGB, generally a clamp is not taken. I measured it on a bunch of MPEG-1 videos and I remember that clamps were less than 10% of the cases; much less than 5% I think. So the trick is to not branch unless there is a clamp. The obvious way to clamp is: int value; if (value > max) value = max; else if (value < 0) value = 0; Which generates the (pseudo) machine code (where the comparisons are signed): if (value <= max) goto notgreater; value = max; goto done; notgreater: if (value >= 0) goto done; value = 0; done: That's two compares and at least one branch per clamp. You can do much better by using a single unsigned compare: int value; if ((unsigned int)value > max) { if (value < 0) value = 0; else value = max; } which does only one compare to determine if a clamp is needed. If it is, it then takes two compares to figure out which clamp it is, but since clamps are relatively rare it ends up being faster. Note that this works because a negative value, treated as an unsigned, will always be greater than the max positive value; that's just twos-complement arithmetic. (You wanted the code to be portable, but I don't think you have to support ones-complement machines :-) I won't bother to show the generated code. But note that doing that you are always taking a branch if you don't clamp, to skip over the code that does the clamping. So - and you purists out there are going to hate this, because this is "assember written in C"! - the fastest way to clamp, when clamping is occasional, is: int R, G, B; /* Generate R, G, B: may be better to generate before clamp to save registers, or to generate R, B, clampR, generate B, clampG, etc. so clamp can be dual-issued with generation. Move the code around and benchmark it; compiler will probably move it for you anyway. */ if ((unsigned int)R > maxR) goto clampR; /* this branch mostly not taken; no branch if no clamp */ clampRDone: if ((unsigned int)G > maxG) goto clampG; clampGDone: if ((unsigned int)B > maxB) goto clampB; clampBDone: /* Rest of code, e.g. pack to 565 and write. BTW write 32 bits, on 32-bit boundary, if writing to PCI bus! */ return; /* skip below clamping code. */ clampR: if (R < 0) R = 0 else R = max; goto clampRDone; clampG: if (G < 0) G = 0 else G = max; goto clampGDone; clampB: if (B < 0) B = 0 else B = max; goto clampBDone; } /* actual end of function */ Yes, it's ugly, but it is faster, at least on the hardware I tested it on. I would give it a try. In the above "maxR", "maxG" and "maxB" may all be the same, i.e. 255. You could trying to get it in a register by: unsigned int maxAll = 255; Or you could also try not tying up a register by (branch taken if upper 24 bits nonzero, i.e. number is negative or is > 255) by: if ((value >> 8) != 0) goto clampValue; You would think this one wouldn't be better on x86, because it has to shift then compare as two separate instructions. But those instructions are shorter than if you compare to literal "255", and you save a register, which you don't have many of on x86. I would give it a try and benchmark it. Note that this clamping technique also works when clamping output from an IDCT - but an IDCT is *much* faster using MMX. Peter K Benedict Bridgwater wrote: > Justin Schoeman wrote: > > > > Benedict Bridgwater wrote: > > > > > > Justin Schoeman wrote: (regarding RGB/YUV conversion) > > > > > > > > I actually benchmarked these two versions a while ago, and it turns out > > > > that on a reasonably new PC (K6/PII or up), fixed point arithmetic is > > > > quite a bit faster than tables - It turns out that a fixed point > > > > multiply is a lot quicker than a memory access (especially when taking > > > > the heavy cache usage of image processing into account). On older PCs > > > > (Pentium/K5/etc) the table version is quicker. > > > > > > Justin, do you remember if this difference would show in a simple > > > conversion loop benchmark, or did it have to be in context of an actual > > > image conversion with corresponding cache usage? > > > > > > Ben > > > > I didn't test it in simple loops, but the difference will definitely be > > more marked with real images (due to cache contention for the look-up > > tables). As a rough estimate, I would expect the two routines to come > > out approximately equal on simple conversion loops (a L-1 cache hit is > > about as expensive as an integer multiply on newer CPUs). One area > > where tables can help is the final downscale and clamp operation. > > Conditional jumps based on (relatively) random data can be very hard on > > deeply pipelined CPUs, while a table lookup will usually hit L-1 > > cache... > > Thanks for the info - I'll redo my conversion code when I get a chance > (I don't want to use MMX since I want to keep it portable). I was too > lazy to use lookup for the clamping since I couldn't be bothered to > figure out how much of a "guard band" was necesary outside of the legal > (clamped) range. Is there a simple way to figure this other than > experimentally? > > Ben > > _______________________________________________ > Video4linux-list mailing list > Video4linux-list@xxxxxxxxxx > https://listman.redhat.com/mailman/listinfo/video4linux-list