Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Post new topic  Reply to topic
Author Message
PostPosted: Wed Sep 07, 2011 11:52 am 
Offline
Senior Member
User avatar

Joined: Sun Aug 10, 2008 11:26 am
Posts: 104
Location: ~$
I have two one-megabyte buffers that I need to get the bitwise AND of. It's fine to overwrite one of the buffers to store the result. Currently I'm doing

Code:
#include <stdint.h>
#include <malloc.h>

#define BUFSIZE 0x20000

void main(int argc, char ** argv) {
   uint64_t * buf1;
   uint64_t * buf2;
   
   buf1 = (uint64_t *) malloc(BUFSIZE * sizeof(uint64_t));
   buf2 = (uint64_t *) malloc(BUFSIZE * sizeof(uint64_t));
   
   // (initialization stuff)
   
   for(int i=0; i<BUFSIZE; i++)
      buf1[i] &= buf2[i];
      
   // the above loop takes about 800 microseconds
}

I need to do this hundreds of times in a row, so I care about the speed of each iteration. Is there any faster way to do this, like a memcpy() variant with bitwise operations or something to that effect?


Top
   
 Post subject:
PostPosted: Wed Sep 07, 2011 12:16 pm 
Offline
Senior Member

Joined: Fri Dec 07, 2007 1:37 am
Posts: 385
Location: NC, USA
Depending on the compiler optimizations that are enabled, this might be faster:
Code:
  uint64_t *p1;
  uint64_t *p2;

...

  p1 = buf1;
  p2 = buf2;

  for (i = 0; i < BUFSIZE; i++)
    *p1++ &= *p2++;

I also find that some machines can count down faster than counting up, because they can compare to zero more quickly than comparing to the loop limit, e.g.:
Code:
  i = BUFSIZE;
  while (i--)
    *p1++ &= *p2++;

edit: left out the & :oops:


Top
   
 Post subject:
PostPosted: Wed Sep 07, 2011 12:59 pm 
Offline
Senior Member
User avatar

Joined: Sun Aug 10, 2008 11:26 am
Posts: 104
Location: ~$
Cool suggestion about counting down to zero. I remember doing that on some FPGA for a class years ago.

Anyway, I did something ugly: I programmatically unrolled the entire loop, taking advantage of the fixed-size buffers. This produced 132K lines of C code, and a 6MB executable. It runs twice as fast!

Nobody said it has to be pretty. :D


Top
   
 Post subject:
PostPosted: Wed Sep 07, 2011 4:01 pm 
Offline
Senior Member
User avatar

Joined: Tue May 26, 2009 3:29 pm
Posts: 1691
Location: Montreal, QC
Aren't there some vector versions of this that might speed it up? PAND or something? I think that works on 64-bit chunks in vectors, dunno if the compiler is smart enough to use the MMX instructions for it.


Top
   
 Post subject:
PostPosted: Wed Sep 07, 2011 6:24 pm 
Offline
Senior Member
User avatar

Joined: Sun Aug 10, 2008 11:26 am
Posts: 104
Location: ~$
Guspaz wrote:
Aren't there some vector versions of this that might speed it up? PAND or something? I think that works on 64-bit chunks in vectors, dunno if the compiler is smart enough to use the MMX instructions for it.

Thanks, this is what I was looking for. The compiler won't do it automatically; looks like it requires using inline assembly to call the MMX instructions.

I bet CUDA or OpenCL could parallelize this like crazy. Too bad Linodes don't have a GPU!


Top
   
 Post subject:
PostPosted: Wed Sep 07, 2011 6:28 pm 
Offline
Senior Member

Joined: Mon Dec 07, 2009 6:46 am
Posts: 331
Stever wrote:
Depending on the compiler optimizations that are enabled, this might be faster:
Code:
  uint64_t *p1;
  uint64_t *p2;

...

  p1 = buf1;
  p2 = buf2;

  for (i = 0; i < BUFSIZE; i++)
    *p1++ &= *p2++;


(stupid comment on BUFSIZE vs 64-bit pointer step removed because I didn't read the code properly...)

Edit: I'm an idiot. Forget I said anything..... :oops:


Last edited by Azathoth on Wed Sep 07, 2011 6:38 pm, edited 2 times in total.

Top
   
 Post subject:
PostPosted: Wed Sep 07, 2011 6:30 pm 
Offline
Senior Member

Joined: Mon Dec 07, 2009 6:46 am
Posts: 331
funkytastic wrote:
... I programmatically unrolled the entire loop ...

-funroll-loops?

http://gcc.gnu.org/onlinedocs/gcc-4.5.3 ... ze-Options


Top
   
 Post subject:
PostPosted: Wed Sep 07, 2011 6:56 pm 
Offline
Senior Member
User avatar

Joined: Fri Oct 24, 2003 3:51 pm
Posts: 965
Location: Netherlands
funroll-loops.info

_________________
/ Peter


Top
   
 Post subject:
PostPosted: Wed Sep 07, 2011 7:56 pm 
Offline
Senior Member
User avatar

Joined: Sun Aug 10, 2008 11:26 am
Posts: 104
Location: ~$
Azathoth wrote:

Wow, that's so much better than the kludge I had going. It runs four times as fast as my pre-unrolled behemoth, and doesn't take 30 seconds for gcc to compile! Can't believe I didn't know about that. Thanks.

pclissold wrote:

How informative. :)


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group