2008年10月23日星期四

(转)如何加速for loop? (C/C++)

Introduction
原题如下
有一个for循环,从0加到100,可是我觉得它不够快,要怎样才能让她更快呢?(不可以用数学公式)

for(i = 0; i <= 100; i++)
s = s + i;


我能想到的解法
1.调compiler参数,vc++和gcc都有-o可以调,可以optimize for speed,是最懒的方式。

2.善用多核心:

/*
(C) OOMusou 2008 http://oomusou.cnblogs.com

Filename : parallel_for_100.c
Compiler : Visual C++ 8.0
Description : Demo how to parallel sum 0 to 100
Release : 03/14/2008 1.0
*/
#include <stdio.h>
#include "omp.h"

int main() {
int i;
int s = 0;

printf("Number of core : %d\n", omp_get_num_procs());
printf("Number of threads : %d\n", omp_get_num_threads());

#pragma omp parallel for
for(i = 0; i <= 100; ++i) {
printf("thread %d : %d\n", omp_get_thread_num(), s = s + i);
}
printf("%d\n", s);
}

行结果

mber of core : 2
mber of threads : 1
read 0 : 0
read 1 : 51
read 0 : 52
read 0 : 106
read 1 : 104
read 0 : 109
read 1 : 162
read 0 : 166
read 1 : 220
read 0 : 225
read 1 : 280
read 0 : 286
read 1 : 342
read 0 : 349
read 1 : 406
read 0 : 414
read 1 : 472
read 0 : 481
read 1 : 540
read 0 : 550
read 1 : 610
read 0 : 621
read 1 : 682
read 0 : 694
read 1 : 756
read 0 : 769
read 1 : 832
read 0 : 846
read 1 : 910
read 0 : 925
read 1 : 990
read 0 : 1006
read 1 : 1072
read 0 : 1089
read 1 : 1156
read 0 : 1174
read 1 : 1242
read 0 : 1261
read 1 : 1330
read 0 : 1350
read 1 : 1420
read 0 : 1441
read 1 : 1512
read 0 : 1534
read 1 : 1606
read 0 : 1629
read 1 : 1702
read 0 : 1726
read 1 : 1800
read 0 : 1825
read 1 : 1900
read 0 : 1926
read 1 : 2002
read 0 : 2029
read 1 : 2106
read 0 : 2134
read 1 : 2212
read 0 : 2241
read 1 : 2320
read 0 : 2350
read 1 : 2430
read 0 : 2461
read 1 : 2542
read 0 : 2574
read 1 : 2656
read 0 : 2689
read 1 : 2772
read 0 : 2806
read 1 : 2890
read 0 : 2925
read 1 : 3010
read 0 : 3046
read 1 : 3132
read 0 : 3169
read 1 : 3256
read 0 : 3294
read 1 : 3382
read 0 : 3421
read 1 : 3510
read 0 : 3550
read 1 : 3640
read 0 : 3681
read 1 : 3772
read 0 : 3814
read 1 : 3906
read 0 : 3949
read 1 : 4042
read 0 : 4086
read 1 : 4180
read 0 : 4225
read 1 : 4320
read 0 : 4366
read 1 : 4462
read 0 : 4509
read 1 : 4606
read 0 : 4654
read 1 : 4752
read 0 : 4801
read 1 : 4900
read 0 : 4950
read 1 : 5050
50

以上结果可以发现,thread 0由0开始跑,thread 1由51开始跑,但之后就随机分配到两个核心,所以充分利用两个核心来运算。

inline assembly:我asm忘了差不多了, 一时也写不出来XD

改用unsigned int,而不用int,这种写法经我测试可以快一倍速度,主要是int须动用到有号数运算,速度较慢,详细原理请参考(原创) 无号数及有号数的乘加运算电路设计 (初级) (IC Design) (Verilog) (Linux)

/*
(C) OOMusou 2007 http://oomusou.cnblogs.com

Filename : for_loop_original.c
Compiler : Visual C++ 8.0
Description : Demo how to optimize for loop
Release : 01/05/2008 1.0
*/

#include <stdio.h>

int foo(int n) {
unsigned int i, s = 0;
for(i = 0; i <= n; i++)
s = s + i;

return s;
}

int main() {
int s;
s = foo(100000000000);
printf("%d\n", s);
}

使用Loop unwinding,请参考Wiki Loop unwinding

/*
(C) OOMusou 2007 http://oomusou.cnblogs.com

Filename : for_loop_unwiding.c
Compiler : Visual C++ 8.0
Description : Demo how to optimize for loop
Release : 01/11/2008 1.0
*/

#include <stdio.h>

int foo(int n) {
unsigned int i, s = 0;
for(i = 1; i <= n; i+=10) {
s += i;
s += i+1;
s += i+2;
s += i+3;
s += i+4;
s += i+5;
s += i+6;
s += i+7;
s += i+8;
s += i+9;
}

return s;
}

int main() {
int s;
s = foo(100);
printf("%d\n", s);
}

使用C++的Template Meta Programming方式,由于这是一种在compile time的运算,而不是在run time计算,所以在run time速度很快。

/*
(C) OOMusou 2007 http://oomusou.cnblogs.com

Filename : for_loop_tmp_cpp.cpp
Compiler : Visual C++ 8.0
Description : Demo how to optimize for loop by TMP
Release : 01/05/2008 1.0
*/

#include <iostream>

using namespace std;

template <unsigned n>
struct foo {
enum { value = n + foo<n-1>::value };
};

template<>
struct foo<0> {
enum { value = 0 };
};

int main() {
cout << foo<100>::value << endl;
}

C++的template类似C的macro,是否可用c来达成呢?我试着用以下的C,但无法compile成功,主要是C并不支持recursive macro,所以无福消受。

1#include <stdio.h>
2#define foo(x) ((x)?((x)+foo(x-1)):(0))
3
4int main() {
5 int s = foo(100);
6 printf("%d\n", s);
7}

7.使用硬件加速编译器;

没有评论: