Then I tested two cases of matrix multiplication for 256 and 512 dimensions. (I actually tried to run 1012 case, but cuda could not handle that size.)
Since GPU computing is advertized to be very fast, like 100 time faster than CPU, I was expecting big difference, and did not expected it is slower than CPU!
Here is the result:
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 9800 GT" with compute capability 1.1
[CUDA]
512x512 matrix multiplication: 200.571 msec
[Single Thread Java]
dim: 256, elapsed time: 30 msec
dim: 512, elapsed time: 217 msec
[4 Threads Java(4 cores)]
dim: 256, elapsed time: 18.0 msec, dim^3; 16,777,216, ration(time/dim^3)=1.073
dim: 512, elapsed time: 105.0 msec, dim^3; 134,217,728, ration(time/dim^3)=0.782
From this result, we may say CUDA(with 9800GT) is similar performance with Single Thread Java.
and if we use 4 Threads Java version, CPU(intel I5) is twice faster than CUDA(9800GT).
9800GT is a bit old GPU, but if we compare with nVidia's top model GTX 780, 780 is about 4 times faster than 9800GT, so probably CUDA(GTX 780) will be twice faster than intel I5.
But, Intel is planning to release 8-core HASWELL-E processor in Q3/2014, so if we use this new CPU, the performance will become similar to CUDA(GTX780).
But it is not clear how large dimension GTX 780 can handle, if it is the same as 9800GT, CPU version is more powerful.
Also CUDA requires unportable complex code for this 'optimized' GPGPU computing, Java version is much simpler, and also it may be possible to use 16 core with dual CPU motherboard, then that version will be twice faster than CUDA(GTX780).
Bottom line is if the performance gain using CUDA(GPU) is such small difference, it does not make sense to use GPU.
-----
BTW, following is the my CPU spec. it is intel I5, 4 cores, 32GB RAM
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 58
Stepping: 9
CPU MHz: 1600.000
BogoMIPS: 6935.22
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-3
-----