Saturday, March 8, 2014

ShadowRoot

I did not notice that dart:html includes some basic mechanism to register new HTML element type.
Since if we add internal variables associated to a new type of element, just a basic javascript type dom tree manipulation is not sufficient. dom to object invocation is required. this createShadowRoot mechanism looks useful feature indeed.
But ideally if Dart support nested class, I will define this MyElement (see below) as inner class of the Component class. Other approach is just to merge those features into single(MyElement) class, and this looks feasible way.
but if we want to use some of techniques using annotation for fields, we may need to define MirrorHtmlElement class as the subclass of HtmlElement, and 'component' class should inherit from it.
but there were some subtlety for the order of initialization, this approach need to be investigated further..
import 'dart:html';

void main() {
  document.register('my-element', MyElement);
  document.body.children.add(new Element.tag('my-element'));
}

class MyElement extends HtmlElement {

  MyElement.created() : super.created() {
    var shadowRoot = this.createShadowRoot();
    var button = new ButtonElement()..text = '0';
    var clickCount = 0;
    button.onClick.listen((e) {
      button.text = '${++clickCount}';
    });
    shadowRoot.children.add(button);
  }
}

The HTML is nearly empty except the necessary script imports:

<!DOCTYPE html>
<html>
  <body>
    <script type="application/dart" src="custom.dart"></script>
    <script src="packages/custom_element/custom-elements.debug.js"></script>
    <script src="packages/browser/dart.js"></script>
  </body>
</html>

Parts based Programming vs Components based Programming

Dart is designed to support components based web programming.
But the meaning of 'component' looks different for each person.
For me, component is something like coarse grained composability, not fine grained reusability.
'component' would be best represented by 'component' stereo. We can combine any amplifier, preamplifier of different companies. just connecting by wires, it works.
Another example is PC mother board. We can combine any power supply, DRAM, HDD, Compact Disk, graphics board.
On the other hand, these components themselves uses parts, e.g, condensors, transisters(but these days, the amount of such parts are decreasing..).
What polymer seems trying to achieve is to provide such reusable 'part'.
Components should have bigger self consistent functionality, and should not expose the internal.
So the approach to design 'component' based framework and 'part' based framework will be different.

Such component system works because there is a fixed API/standard any component provider must follow. In programming language, that can be defined by interface.(in Dart this was strangely dropped, but we can still use abstract class).
To connect such providers and integrate functional(functorial) body, we can use generic class.
So if we use analogy, class corresponds 'part', and generic class corresponds to 'component'.

Monday, February 10, 2014

Cuda Sample Matrix performance (with 9800 GT) vs Java

I installed Cuda SDK 5.5, and ran one of sample Cuda Matrix program.
Then I tested two cases of matrix multiplication for 256 and 512 dimensions. (I actually tried to run 1012 case, but cuda could not handle that size.)

Since GPU computing is advertized to be very fast, like 100 time faster than CPU, I was expecting big difference, and did not expected it is slower than CPU!

Here is the result:
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 9800 GT" with compute capability 1.1

[CUDA]
256x256 matrix multiplication: 25.803 msec
512x512 matrix multiplication: 200.571 msec

[Single Thread Java]
dim: 256, elapsed time: 30 msec
dim: 512, elapsed time: 217 msec

[4 Threads Java(4 cores)]
dim: 256, elapsed time: 18.0 msec, dim^3; 16,777,216, ration(time/dim^3)=1.073
dim: 512, elapsed time: 105.0 msec, dim^3; 134,217,728, ration(time/dim^3)=0.782

From this result, we may say CUDA(with 9800GT) is similar performance with Single Thread Java.
and if we use 4 Threads Java version, CPU(intel I5) is twice faster than CUDA(9800GT).

9800GT is a bit old GPU, but if we compare with nVidia's top model GTX 780,  780 is about 4 times faster than 9800GT, so probably CUDA(GTX 780) will be twice faster than intel I5.

But, Intel is planning to release 8-core HASWELL-E processor in Q3/2014, so if we use this new CPU, the performance will become similar to CUDA(GTX780).

But it is not clear how large dimension GTX 780 can handle, if it is the same as 9800GT, CPU version is more powerful. 
Also CUDA requires unportable complex code for this 'optimized' GPGPU computing, Java version is much simpler, and also it may be possible to use 16 core with dual CPU motherboard, then that version will be twice faster than  CUDA(GTX780).

Bottom line is if the performance gain using CUDA(GPU) is such small difference, it does not make sense to use GPU.

-----
BTW, following is the my CPU spec. it is intel I5, 4 cores, 32GB RAM

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Stepping:              9
CPU MHz:               1600.000
BogoMIPS:              6935.22
Virtualization:        VT-x
L1d cache:             32K                                                                                                     
L1i cache:             32K                                                                                                     
L2 cache:              256K                                                                                                    
L3 cache:              6144K                                                                                                   
NUMA node0 CPU(s):     0-3                     
-----

Monday, November 25, 2013

Java Sumatra project for OpenCL

I found an interesting article for Java's Smatra project.
http://pc.watch.impress.co.jp/docs/column/kaigai/20130826_612591.html

I was interested in using OpenCL in Java 3 years ago, at that time there was no such projects, but  2 years ago, a solid project, Sumatra project to support OpenCL in Java seems started by Oracle/AMD.
OpenCL is very important standard to support heterogeneous hardware environment including GPU, CUP, etc. OSX of Apple is now based on LLVM (Since Snow Leopard) and supporting OpenCL. So This move of Java is actually following Apple.

Project Sumatra

This primary goal of this project is to enable Java applications to take advantage of graphics processing units (GPUs) and accelerated processing units (APUs)--whether they are discrete devices or integrated with a CPU--to improve performance.
We expect the initial focus to be on the Hotspot JVM, enabling code generation, runtime support and garbage collection on GPUs. Once basic capabilities are in place in the JVM, we will also examine how to best expose GPU support to application and/or library developers, initially leveraging the new Java 8 Lambda language and library features. As the project progress, we may identify challenges with the Java API and constructs which may lead to new language, JVM or library extensions that will need standardization under the JCP process. While the Java language is an obvious focus, we also anticipate that this project will provide guidance for other JVM-hosted languages such as JavaScript/Nashorn, Scala and JRuby.
You can follow the project and get involved on the sumatra-dev mailing list.

Related projects and material

This Project is sponsored by the HotSpot Group.

This is quite interesting direction, also Java 8's support of Lambda(closure) seems associated with this parallel computing based on GPGPU or OpenCL approach.

I installed two projects in Windows. but it looks like Intel only supports Windows OS(except xeon). In order to use this on Linux, we may need to install AMD's SDK. It seems usable for Intel's graphics core in the CPU(like i5 etc).

1) https://code.google.com/p/aparapi/
2a) http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/
2b) http://software.intel.com/en-us/vcsource/tools/opencl

These are other related projects.

http://openjdk.java.net/groups/hotspot/

http://openjdk.java.net/projects/mlvm/index.html

http://download.java.net/openjdk/jdk7/

Here is a qupte from jdk8 project. It actually seems related to Sumatra project..
 http://mreinhold.org/blog/jdk8-preview

JDK 8 Developer Preview
2013/09/09 10:39:26 -07:00 
The JDK 8 De­vel­oper Pre­view (a.k.a. Mile­stone 8) builds are now avail­able!
This mile­stone is in­tended for broad test­ing by de­vel­op­ers. We’ve run all tests on all Or­a­cle-sup­ported plat­forms and haven’t found any glar­ing is­sues. We’ve also fixed many of the bugs dis­cov­ered since we reached the Fea­ture Com­plete mile­stone back in June.
The prin­ci­pal fea­ture of this re­lease is Pro­ject Lambda (JSR 335), which aims to make it eas­ier to write code for mul­ti­core proces­sors. It adds lambda ex­pres­sions, de­fault meth­ods, and method ref­er­ences to the Java pro­gram­ming lan­guage, and ex­tends the li­braries to sup­port par­al­leliz­able op­er­a­tions upon streamed data.
There are, of course, many other new fea­tures, in­clud­ing a new Date and Time API (JSR 310), Com­pact Pro­files, the Nashorn JavaScript En­gine, and even some anti-fea­tures such as the re­moval of the Per­ma­nent Gen­er­a­tion from the HotSpot vir­tual ma­chine. A com­plete list of new fea­tures is avail­able on the JDK 8 Fea­tures page.
If you’ve been watch­ing JDK 8 evolve from afar then now is an ex­cel­lent time to down­load a build and try it out—the sooner the bet­ter! Let us know if your ex­ist­ing code doesn’t com­pile and run cor­rectly on JDK 8, if it runs slower than be­fore, if it crashes the JVM, or if there are any re­main­ing de­sign is­sues in the new lan­guage and API fea­tures.
We’ll do our best to read, eval­u­ate, and act on all feed­back re­ceived via the usual bug-re­port­ing chan­nel be­tween now and the end of Oc­to­ber. After that we’ll grad­u­ally ramp down the rate of change in order to sta­bi­lize the code, so bugs re­ported later on might not get fixed in time for the GA re­lease.
Other Java 8 related info:

http://www.infoq.com/news/2013/06/jdk8-almost-feature-complete

https://blogs.oracle.com/thejavatutorials/entry/jdk_8_documentation_developer_preview

This kind of direction should be considered in Dart as well.
This will affect both of client and server. Since JavaScript seems going to support OpenGL like architecture, Dart need to handle it also. But server side support would be more natural fit for high performance computation using OpenGL.

In fact, I was surprised to know the direction of Java 8 was shifted to this direction. Previously it was aiming to support of modules etc, but the focus seems changed quite  a lot.

Wednesday, November 20, 2013

Dart vs Java performance comparison with Recursive Maxtrix Multiplication (Part 3)

I ported the last Java implementation of recursive matrix multiplication to Dart.
It is interesting to see the actual performance difference of two languages.
Such comparison will depend on the type of application and level of expected performance.

As we have seen, Java allows considerable fast implementation of recursive matrix multiplication. It is interesting to compare the performance with C++.
But definitely its 1.6 Gaga flop(1 billion multiplication in 0.6 sec) is definitely assembler level performance.

Following is the result of the performance measurement of Dart.
From this, Dart is about 4 times slower than Java if we compare with 1 thread version of Java. but if we compare with 4 threads(4 cores) version of Java, it is about 13 times slower.
So 4 times difference is in a sense close to Java. but Dart does not allow to share memory among isolates, so even we use isolate, it will be more costly to copy matrix elements, but still it is interesting to see the performance using  isolates.
Even Dart does not support array, if the List representation internally uses array(by delegating to external code), it will not cause major performance difference. Other reason will be the representation of int. We may have similar result in Java if we replace long by BigInteger etc.

[Dart]
rec_test_loop: 12
>> dim: 2, elapsed time: 5 milli sec, dim^3; 8, ration(time/dim^3)=625000.0
>> dim: 4, elapsed time: 0 milli sec, dim^3; 64, ration(time/dim^3)=0.0
>> dim: 8, elapsed time: 1 milli sec, dim^3; 512, ration(time/dim^3)=1953.125
>> dim: 16, elapsed time: 32 milli sec, dim^3; 4096, ration(time/dim^3)=7812.5
>> dim: 32, elapsed time: 4 milli sec, dim^3; 32768, ration(time/dim^3)=122.0703125
>> dim: 64, elapsed time: 5 milli sec, dim^3; 262144, ration(time/dim^3)=19.073486328125
>> dim: 128, elapsed time: 16 milli sec, dim^3; 2097152, ration(time/dim^3)=7.62939453125
>> dim: 256, elapsed time: 119 milli sec, dim^3; 16777216, ration(time/dim^3)=7.092952728271484
>> dim: 512, elapsed time: 1239 milli sec, dim^3; 134217728, ration(time/dim^3)=9.231269359588623
>> dim: 1024, elapsed time: 8027 milli sec, dim^3; 1073741824, ration(time/dim^3)=7.475726306438446
>> dim: 2048, elapsed time: 60587 milli sec, dim^3; 8589934592, ration(time/dim^3)=7.0532551035285
>> dim: 4096, elapsed time: 484160 milli sec, dim^3; 68719476736, ration(time/dim^3)=7.045455276966095

[Java with single thread]
>> dim: 2, elapsed time: 1 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 0 milli sec
>> dim: 16, elapsed time: 0 milli sec
>> dim: 32, elapsed time: 5 milli sec
>> dim: 64, elapsed time: 15 milli sec
>> dim: 128, elapsed time: 5 milli sec
>> dim: 256, elapsed time: 30 milli sec
>> dim: 512, elapsed time: 217 milli sec
>> dim: 1024, elapsed time: 1915 milli sec
>> dim: 2048, elapsed time: 14691 milli sec
>> dim: 4096, elapsed time: 115860 milli sec

[Java with 4 threads(4 cores)]
>> dim: 2, elapsed time: 0.0 milli sec, dim^p; 8, ration(time/dim^3)=0
>> dim: 4, elapsed time: 0.0 milli sec, dim^p; 64, ration(time/dim^3)=0
>> dim: 8, elapsed time: 1.0 milli sec, dim^p; 512, ration(time/dim^3)=1,953.125
>> dim: 16, elapsed time: 1.0 milli sec, dim^p; 4,096, ration(time/dim^3)=244.141
>> dim: 32, elapsed time: 3.0 milli sec, dim^p; 32,768, ration(time/dim^3)=91.553
>> dim: 64, elapsed time: 17.0 milli sec, dim^p; 262,144, ration(time/dim^3)=64.85
>> dim: 128, elapsed time: 4.0 milli sec, dim^p; 2,097,152, ration(time/dim^3)=1.907
>> dim: 256, elapsed time: 18.0 milli sec, dim^p; 16,777,216, ration(time/dim^3)=1.073
>> dim: 512, elapsed time: 105.0 milli sec, dim^p; 134,217,728, ration(time/dim^3)=0.782
>> dim: 1024, elapsed time: 602.0 milli sec, dim^p; 1,073,741,824, ration(time/dim^3)=0.561
>> dim: 2048, elapsed time: 4771.0 milli sec, dim^p; 8,589,934,592, ration(time/dim^3)=0.555
>> dim: 4096, elapsed time: 38704.0 milli sec, dim^p; 68,719,476,736, ration(time/dim^3)=0.563

Here is again a puzzling result for the performance degradation of Simple Matrix multiplication in Dart. until dim is 512, simple matrix is faster than recursive matrix, but after that, simple matrix become slower than recursive version, and at 4096, it is 3.3 times slower.

[Simple Matrix]
>> dim: 2, elapsed time: 4 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 0 milli sec
>> dim: 16, elapsed time: 4 milli sec
>> dim: 32, elapsed time: 6 milli sec
>> dim: 64, elapsed time: 2 milli sec
>> dim: 128, elapsed time: 7 milli sec
>> dim: 256, elapsed time: 55 milli sec
>> dim: 512, elapsed time: 483 milli sec
>> dim: 1024, elapsed time: 12269 milli sec
>> dim: 2048, elapsed time: 118535 milli sec
>> dim: 4096, elapsed time: 1597558 milli sec

[Recursive Matrix]
>> dim: 2, elapsed time: 5 milli sec, dim^3; 8, ration(time/dim^3)=625000.0
>> dim: 4, elapsed time: 0 milli sec, dim^3; 64, ration(time/dim^3)=0.0
>> dim: 8, elapsed time: 1 milli sec, dim^3; 512, ration(time/dim^3)=1953.125
>> dim: 16, elapsed time: 32 milli sec, dim^3; 4096, ration(time/dim^3)=7812.5
>> dim: 32, elapsed time: 4 milli sec, dim^3; 32768, ration(time/dim^3)=122.0703125
>> dim: 64, elapsed time: 5 milli sec, dim^3; 262144, ration(time/dim^3)=19.073486328125
>> dim: 128, elapsed time: 16 milli sec, dim^3; 2097152, ration(time/dim^3)=7.62939453125
>> dim: 256, elapsed time: 119 milli sec, dim^3; 16777216, ration(time/dim^3)=7.092952728271484
>> dim: 512, elapsed time: 1239 milli sec, dim^3; 134217728, ration(time/dim^3)=9.231269359588623
>> dim: 1024, elapsed time: 8027 milli sec, dim^3; 1073741824, ration(time/dim^3)=7.475726306438446
>> dim: 2048, elapsed time: 60587 milli sec, dim^3; 8589934592, ration(time/dim^3)=7.0532551035285
>> dim: 4096, elapsed time: 484160 milli sec, dim^3; 68719476736, ration(time/dim^3)=7.045455276966095

Following is the Dart code used for this benchmark test.

library rec_matrix_v6;

import "dart:math";

Random rand = new Random(); 

List<List<int>> _randomMatrix(int dim) {
  //final List<List<int>> matrix = new int[dim][dim];
  final List<List<int>> matrix = new List(dim);
  for (int i = 0; i < dim; i++) {
    matrix[i] = new List(dim);
    for (int j = 0; j < dim; j++) {
      matrix[i][j] = rand.nextInt(50);
    }
  }
  return matrix;
}
List<List<int>> _zeroMatrix(int dim) {
  final List<List<int>> matrix = new List(dim);
  for (int i = 0; i < dim; i++) {
    matrix[i] = new List(dim);
    for (int j = 0; j < dim; j++) {
      matrix[i][j] = 0;
    }
  }
  return matrix;
}

//
// AbstractMatrix
//
abstract class AbstractMatrix {
  int _dim;
  List<List<int>> _matrix;
  
  AbstractMatrix(List<List<int>> matrix) {
    this._dim = matrix.length;
    this._matrix = matrix;
  }
  
  int dim() {
    return _dim;
  }

  SimpleMatrix add(SimpleMatrix m) {
    List<List<int>> m0 = _matrix;
    List<List<int>> m1 = m._matrix;
    List<List<int>> m2 = new List(_dim);
    for (int i = 0; i < _dim; i++) {
      m2[i] = new List(_dim);
      for (int j = 0; j < _dim; j++) {
        m2[i][j] = m0[i][j]+m1[i][j];
      }
    }
    return new SimpleMatrix(m2);
  }

  SimpleMatrix subtract(SimpleMatrix m) {
    List<List<int>> m0 = _matrix;
    List<List<int>> m1 = m._matrix;
    List<List<int>> m2 = new List(_dim);
    for (int i = 0; i < _dim; i++) {
      m2[i] = new List(_dim);
      for (int j = 0; j < _dim; j++) {
        m2[i][j] = m0[i][j]-m1[i][j];
      }
    }
    return new SimpleMatrix(m2);
  }
  
  String toString() {
    return toString1(_matrix);
  }
  
  String toString1(List<List<int>> matrix) {
    int dim = matrix.length;
    StringBuffer sb = new StringBuffer();
    sb.write("(\n");
    for (int i = 0; i < dim; i++) {
      for (int j = 0; j < dim; j++) {
        int v = matrix[i][j];
        if (j != 0) {
          sb.write(", ");
        }
        sb.write(v);
      }
      sb.write("\n");
    }
    sb.write(")\n");
    return sb.toString();
  }
  
  bool equals(Object obj) {
    if (obj is SimpleMatrix) {
      SimpleMatrix m = obj;
      if (_dim != m._dim) {
        return false;
      }
      List<List<int>>matrix1 = m._matrix;
      for (int i = 0; i < _dim; i++) {
        for (int j = 0; j < _dim; j++) {
          if (_matrix[i][j] != matrix1[i][j]) {
            return false;
          }
        }
      }
      return true;
    } else {
      return false;
    }
  }
}

//
// SimpleMatrix
//
class SimpleMatrix extends AbstractMatrix {
  SimpleMatrix(List<List<int>> matrix): super(matrix) {}
  
  SimpleMatrix mult(SimpleMatrix m) {
    List<List<int>> m0 = _matrix;
    List<List<int>> m1 = m._matrix;
    List<List<int>> m2 = new List(_dim);
    for (int i = 0; i < _dim; i++) {
      m2[i] = new List(_dim);
      for (int j = 0; j < _dim; j++) {
        int v = 0;
        for (int k = 0; k < _dim; k++) {
          v += m0[i][k]*m1[k][j];
        }
        m2[i][j] = v;
      }
    }
    return new SimpleMatrix(m2);
  }
}

//
// RecMatrix
//
class RecMatrix extends AbstractMatrix {
  RecMatrix(List<List<int>> matrix): super(matrix) {}
  
  static RecMatrix randomMatrix(int dim) {
    List<List<int>> matrix = _randomMatrix(dim);
    return new RecMatrix(matrix);
  }

  RecMatrix mult(RecMatrix m) {
    RecMatrix m2 = new RecMatrix(_zeroMatrix(_dim));
    m2.mult0(_dim, _matrix, 0, 0, m._matrix, 0, 0);
    return m2;
  }
  
  void mult0(final int dim, final List<List<int>> m1, final int row_index1, final int column_index1, final List<List<int>> m2, final int row_index2, final int column_index2) {
    if (dim == 2) {
      for (int i = 0; i < 2; i++) {
        for (int j = 0; j < 2; j++) {
          int i1 = row_index1 | i;
          int j2 = column_index2 | j;
          int v = 0;
          for (int k = 0; k < 2; k++) {
            int j1 = column_index1 | k;
            int i2 = row_index2 | k;
            v += m1[i1][j1]*m2[i2][j2];
          }
          _matrix[i1][j2] += v;
        }
      }
    } else {
      final int dim0 = dim >> 1;
      for (int i = 0; i < 2; i++) {
        final int r_idx1 = (i == 0) ? row_index1 : (row_index1 | dim0);
        for (int j = 0; j < 2; j++) {
          final int c_idx2 = (j == 0) ? column_index2: (column_index2 | dim0);
          for (int k = 0; k < 2; k++) {
            final int c_idx1 = (k == 0) ? column_index1: (column_index1 | dim0);
            final int r_idx2 = (k == 0) ? row_index2 : (row_index2 | dim0);
            mult0(dim0, m1, r_idx1, c_idx1, m2, r_idx2, c_idx2);
          }
        }
      }
    }
  }
}

//
// Tests
//

// simple_matrix_test1
SimpleMatrix simple_matrix_test1(List<List<int>> matrx1, List<List<int>> matrx2, bool show) {
  SimpleMatrix m1 = new SimpleMatrix(matrx1);
  SimpleMatrix m2 = new SimpleMatrix(matrx2);
  SimpleMatrix m3 = m1.mult(m2);
  if (show) {
    print("m1: ${m1}");
    print("m2: ${m2}");
    print("m3: ${m3}");
  }
  return m3;
}

DateTime simple_matrix_test2(int dim) {
  List<List<int>> matrx1 = _randomMatrix(dim);
  List<List<int>> matrx2 = _randomMatrix(dim);
  
  simple_matrix_test1(matrx1, matrx2, false);

  return new DateTime.now();
}

void simple_test_loop(int max) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
    dim = dim*2;
    DateTime startTime = new DateTime.now();
    DateTime endTime = simple_matrix_test2(dim);
    Duration du = endTime.difference(startTime);
    print(">> dim: ${dim}, elapsed time: ${du.inMilliseconds} milli sec");
  }
}

// rec_matrix_test1
RecMatrix rec_matrix_test1(List<List<int>> matrx1, List<List<int>> matrx2, bool show) {
  RecMatrix rm1 = new RecMatrix(matrx1);
  RecMatrix rm2 = new RecMatrix(matrx2);
  RecMatrix rm3 = rm1.mult(rm2);
  if (show) {
    print("m1: ${rm1}");
    print("m2: ${rm2}");
    print("m3: ${rm3}");
  }
  return rm3;
}

DateTime rec_matrix_test2(int dim) {
  List<List<int>> matrx1 = _randomMatrix(dim);
  List<List<int>> matrx2 = _randomMatrix(dim);
  
  rec_matrix_test1(matrx1, matrx2, false);
  return new DateTime.now();
}

void rec_test_loop(int max) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
    dim = dim*2;
    DateTime startTime = new DateTime.now();
    DateTime endTime = rec_matrix_test2(dim);
    Duration duration = endTime.difference(startTime);
    int b_duration = duration.inMilliseconds;
    int dim_p3 = dim*dim*dim;
    double b_r = (b_duration/dim_p3)*1000000;
    print(">> dim: ${dim}, elapsed time: ${b_duration} milli sec, dim^3; ${dim_p3}, ration(time/dim^3)=${b_r}");
  }
}

// verify_matrix_test1
void verify_matrix_test1(int dim, bool show) {
  List<List<int>> matrx1 = _randomMatrix(dim);
  List<List<int>> matrx2 = _randomMatrix(dim);
  
  SimpleMatrix sm3 = simple_matrix_test1(matrx1, matrx2, show);
  RecMatrix rm3 = rec_matrix_test1(matrx1, matrx2, show);
  SimpleMatrix srm3 = new SimpleMatrix(rm3._matrix);
  if (!sm3.equals(srm3)) {
    print("!!verify_matrix_test1: not equals ");
    print("m1: ${new SimpleMatrix(matrx1)}");
    print("m2: ${new SimpleMatrix(matrx2)}");
    print("sm3: ${sm3}");
    print("srm3:${srm3}");
    throw new Exception();
  } else if (show) {
    print("m1: ${new SimpleMatrix(matrx1)}");
    print("m2: ${new SimpleMatrix(matrx2)}");
    print("m3: ${sm3}");
  }
}

void verify_test_loop(int max, bool show) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
    dim = dim*2;
    verify_matrix_test1(dim, show);
  }
}

void test1(int max) {
  /*
  print("verify_test_loop: "+max);
  verify_test_loop(max, false);
  
  print("simple_test_loop: "+max);
  simple_test_loop(max);
   */
  print("rec_test_loop: ${max}");
  rec_test_loop(max);
}

void main() {
  int max = 12;
  test1(max);
  print(">> done");
  //verify_matrix_test1(2);
  //verify_matrix_test1(4);
  //verify_test_loop(max);
  
  //simple_test_loop(max);
  
  //rec_matrix_test1(4);
  //rec_test_loop(max);
}

Tuesday, November 19, 2013

Java implemetation of Recursive Matrix Multiplication Algorithm (Part 2)

This new implementation of Recursive Matrix Multiplication had a few unexpected observations.
The performance of this recursive implementation (without concurrency) is faster that simple matrix multiplication(without recursion)!
I still do not understand the reason, but it may be related to swapping memory space or something that low level.

As this result shows, at dim 4096, recursive version is 9 times faster than simple version.

[simple_test_loop]
>> dim: 2, elapsed time: 0 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 0 milli sec
>> dim: 16, elapsed time: 0 milli sec
>> dim: 32, elapsed time: 1 milli sec
>> dim: 64, elapsed time: 6 milli sec
>> dim: 128, elapsed time: 6 milli sec
>> dim: 256, elapsed time: 24 milli sec
>> dim: 512, elapsed time: 206 milli sec
>> dim: 1024, elapsed time: 6478 milli sec
>> dim: 2048, elapsed time: 90698 milli sec
>> dim: 4096, elapsed time: 1023754 milli sec

[rec_matrix_test1]
>> dim: 2, elapsed time: 1 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 0 milli sec
>> dim: 16, elapsed time: 0 milli sec
>> dim: 32, elapsed time: 5 milli sec
>> dim: 64, elapsed time: 15 milli sec
>> dim: 128, elapsed time: 5 milli sec
>> dim: 256, elapsed time: 30 milli sec
>> dim: 512, elapsed time: 217 milli sec
>> dim: 1024, elapsed time: 1915 milli sec
>> dim: 2048, elapsed time: 14691 milli sec
>> dim: 4096, elapsed time: 115860 milli sec

Also the result I got in the first implementation was so slow, I thought it seems no point implementing fast algorithm in Java, but new version were surprising fast and no GC is involved during calculation since no such intermediate objects are created. compared to the previous version, at dim 256 case, 13757/20 = 688 time faster. also if we closely examine the unit performance of calculation, they are constant(almost), so kind of linear performance.

[rec]
>> dim: 2, elapsed time: 3 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 2 milli sec
>> dim: 16, elapsed time: 32 milli sec
>> dim: 32, elapsed time: 378 milli sec
>> dim: 64, elapsed time: 494 milli sec
>> dim: 128, elapsed time: 1620 milli sec
>> dim: 256, elapsed time: 13757 milli sec

Further, if we introduce the multiple threads(4 threads(4 cores)), the performance will be quadrupled, or maybe this is a bit exaggerated, but around 3-4 range.
This number of thread should be close to the number of CPU cores to get maximum performance.

>> dim: 2, elapsed time: 0.0 milli sec, dim^p; 8, ration(time/dim^3)=0
>> dim: 4, elapsed time: 0.0 milli sec, dim^p; 64, ration(time/dim^3)=0
>> dim: 8, elapsed time: 1.0 milli sec, dim^p; 512, ration(time/dim^3)=1,953.125
>> dim: 16, elapsed time: 1.0 milli sec, dim^p; 4,096, ration(time/dim^3)=244.141
>> dim: 32, elapsed time: 3.0 milli sec, dim^p; 32,768, ration(time/dim^3)=91.553
>> dim: 64, elapsed time: 17.0 milli sec, dim^p; 262,144, ration(time/dim^3)=64.85
>> dim: 128, elapsed time: 4.0 milli sec, dim^p; 2,097,152, ration(time/dim^3)=1.907
>> dim: 256, elapsed time: 18.0 milli sec, dim^p; 16,777,216, ration(time/dim^3)=1.073
>> dim: 512, elapsed time: 105.0 milli sec, dim^p; 134,217,728, ration(time/dim^3)=0.782
>> dim: 1024, elapsed time: 602.0 milli sec, dim^p; 1,073,741,824, ration(time/dim^3)=0.561
>> dim: 2048, elapsed time: 4771.0 milli sec, dim^p; 8,589,934,592, ration(time/dim^3)=0.555
>> dim: 4096, elapsed time: 38704.0 milli sec, dim^p; 68,719,476,736, ration(time/dim^3)=0.563

dim^3 is the actual number of multiplications of long value. So this took only 0.6 second to calculate 1 billion multiplications. not bad. and even we change the long to double, the performance is almost the same.
And this does involve calculation to traverse matrix locations as well.

The matrix of dimension 1000 can express a network of 1000 nodes, so there would be practical applications.
And in this algorithm, it is very easy to utilize more cores, 4, 16, 256, ... so this number is scalable with the increase of cores in future.

At this moment, I checked a paper on Strassen algorithm implementation on NVidia GPU-GPU.

The number was actually quite impressive.
at dim = 2048, it took only 48 millisec. so 100 time faster than 4 threads(4 cores) version!

BTW, following is the my CPU spec. it is intel I5, 4 cores, 32GB RAM

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Stepping:              9
CPU MHz:               1600.000
BogoMIPS:              6935.22
Virtualization:        VT-x
L1d cache:             32K                                                                                                    
L1i cache:             32K                                                                                                    
L2 cache:              256K                                                                                                   
L3 cache:              6144K                                                                                                  
NUMA node0 CPU(s):     0-3                       

Following code are used for this testing.

package org.zen;

import java.math.BigDecimal;
import java.text.DecimalFormat;
import java.util.Random;

public class Matrix_v6 {
 static Random rand = new Random(); 
 static long[][] _randomMatrix(int dim) {
  final long[][] matrix = new long[dim][dim];
  for (int i = 0; i < dim; i++) {
   for (int j = 0; j < dim; j++) {
    matrix[i][j] = rand.nextInt(50);
   }
  }
  return matrix;
 }
 static long[][] _zeroMatrix(int dim) {
  final long[][] matrix = new long[dim][dim];
  for (int i = 0; i < dim; i++) {
   for (int j = 0; j < dim; j++) {
    matrix[i][j] = 0;
   }
  }
  return matrix;
 }
 
 //
 // AbstractMatrix
 //
 public static abstract class AbstractMatrix {
  protected int _dim;
  protected long[][] _matrix;
  
  public AbstractMatrix(long[][] matrix) {
   this._dim = matrix.length;
   this._matrix = matrix;
  }
  
  public int dim() {
   return _dim;
  }

  public SimpleMatrix add(SimpleMatrix m) {
   long[][] m0 = _matrix;
   long[][] m1 = m._matrix;
   long[][] m2 = new long[_dim][_dim];
   for (int i = 0; i < _dim; i++) {
    for (int j = 0; j < _dim; j++) {
     m2[i][j] = m0[i][j]+m1[i][j];
    }
   }
   return new SimpleMatrix(m2);
  }

  public SimpleMatrix subtract(SimpleMatrix m) {
   long[][] m0 = _matrix;
   long[][] m1 = m._matrix;
   long[][] m2 = new long[_dim][_dim];
   for (int i = 0; i < _dim; i++) {
    for (int j = 0; j < _dim; j++) {
     m2[i][j] = m0[i][j]-m1[i][j];
    }
   }
   return new SimpleMatrix(m2);
  }
  
  public String toString() {
   return toString(_matrix);
  }
  
  public static String toString(long[][] matrix) {
   int dim = matrix.length;
   StringBuilder sb = new StringBuilder();
   sb.append("(\n");
   for (int i = 0; i < dim; i++) {
    for (int j = 0; j < dim; j++) {
     long v = matrix[i][j];
     if (j != 0) {
      sb.append(", ");
     }
     sb.append(v);
    }
    sb.append("\n");
   }
   sb.append(")\n");
   return sb.toString();
  }
  
  public boolean equals(Object obj) {
   if (obj instanceof SimpleMatrix) {
    SimpleMatrix m = (SimpleMatrix)obj;
    if (_dim != m._dim) {
     return false;
    }
    long[][]matrix1 = m._matrix;
    for (int i = 0; i < _dim; i++) {
     for (int j = 0; j < _dim; j++) {
      if (_matrix[i][j] != matrix1[i][j]) {
       return false;
      }
     }
    }
    return true;
   } else {
    return false;
   }
  }
 }
 
 //
 // SimpleMatrix
 //
 public static class SimpleMatrix extends AbstractMatrix {
  public SimpleMatrix(long[][] matrix) {
   super(matrix);
  }
  
  public SimpleMatrix mult(SimpleMatrix m) {
   long[][] m0 = _matrix;
   long[][] m1 = m._matrix;
   long[][] m2 = new long[_dim][_dim];
   for (int i = 0; i < _dim; i++) {
    for (int j = 0; j < _dim; j++) {
     int v = 0;
     for (int k = 0; k < _dim; k++) {
      v += m0[i][k]*m1[k][j];
     }
     m2[i][j] = v;
    }
   }
   return new SimpleMatrix(m2);
  }
 }

 //
 // RecMatrix
 //
 public static class RecMatrix extends AbstractMatrix {
  public RecMatrix(long[][] matrix) {
   super(matrix);
  }
  
  public static RecMatrix randomMatrix(int dim) {
   long[][] matrix = _randomMatrix(dim);
   return new RecMatrix(matrix);
  }

  public RecMatrix mult(RecMatrix m) {
   RecMatrix m2 = new RecMatrix(new long[_dim][_dim]);
   m2.mult0(_dim, _matrix, 0, 0, m._matrix, 0, 0);
   return m2;
  }
  public void mult0(final int dim, final long[][] m1, final int row_index1, final int column_index1, final long[][] m2, final int row_index2, final int column_index2) {
   if (dim == 2) {
    for (int i = 0; i < 2; i++) {
     int i1 = row_index1 | i;
     for (int j = 0; j < 2; j++) {
      int j2 = column_index2 | j;
      long v = 0;
      for (int k = 0; k < 2; k++) {
       int j1 = column_index1 | k;
       int i2 = row_index2 | k;
       v += m1[i1][j1]*m2[i2][j2];
      }
      _matrix[i1][j2] += v;
     }
    }
   } else if (dim == _dim) {
    Thread[] threads = new Thread[4];
    int idx = 0;
    final int dim0 = dim >> 1;
    for (int i = 0; i < 2; i++) {
     final int r_idx1 = (i == 0) ? row_index1 : (row_index1 | dim0);
     for (int j = 0; j < 2; j++) {
      final int c_idx2 = (j == 0) ? column_index2: (column_index2 | dim0);
      threads[idx] = new Thread(new Runnable() {
       public void run() {
              for (int k = 0; k < 2; k++) {
         final int c_idx1 = (k == 0) ? column_index1: (column_index1 | dim0);
         final int r_idx2 = (k == 0) ? row_index2 : (row_index2 | dim0);
         mult0(dim0, m1, r_idx1, c_idx1, m2, r_idx2, c_idx2);
              }
          }
      });
      threads[idx++].start();
     }
    }
    for (int i = 0; i < threads.length; i++) {
        try {
            threads[i].join();
        } catch (InterruptedException e) {
        }
    }
   } else {
    final int dim0 = dim >> 1;
    for (int i = 0; i < 2; i++) {
     final int r_idx1 = (i == 0) ? row_index1 : (row_index1 | dim0);
     for (int j = 0; j < 2; j++) {
      final int c_idx2 = (j == 0) ? column_index2: (column_index2 | dim0);
      for (int k = 0; k < 2; k++) {
       final int c_idx1 = (k == 0) ? column_index1: (column_index1 | dim0);
       final int r_idx2 = (k == 0) ? row_index2 : (row_index2 | dim0);
       mult0(dim0, m1, r_idx1, c_idx1, m2, r_idx2, c_idx2);
      }
     }
    }
   }
  }
 }
 
 //
 // Tests
 //
 
 // simple_matrix_test1
 public static SimpleMatrix simple_matrix_test1(long[][] matrx1, long[][] matrx2, boolean show) {
  SimpleMatrix m1 = new SimpleMatrix(matrx1);
  SimpleMatrix m2 = new SimpleMatrix(matrx2);
  SimpleMatrix m3 = m1.mult(m2);
  if (show) {
   System.out.println("m1: "+m1);
   System.out.println("m2: "+m2);
   System.out.println("m3: "+m3);
  }
  return m3;
 }
 
 public static long simple_matrix_test1(int dim) {
  long[][] matrx1 = _randomMatrix(dim);
  long[][] matrx2 = _randomMatrix(dim);
  
  simple_matrix_test1(matrx1, matrx2, false);

  long end = System.currentTimeMillis();
  return end;
 }
 public static void simple_test_loop(int max) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
   dim = dim*2;
   long startTime = System.currentTimeMillis();
   long endTime = simple_matrix_test1(dim);
   System.out.println(">> dim: "+dim+", elapsed time: "+(endTime - startTime)+" milli sec");
  }
 }
 
 // rec_matrix_test1
 public static RecMatrix rec_matrix_test1(long[][] matrx1, long[][] matrx2, boolean show) {
  RecMatrix rm1 = new RecMatrix(matrx1);
  RecMatrix rm2 = new RecMatrix(matrx2);
  RecMatrix rm3 = rm1.mult(rm2);
  if (show) {
   System.out.println("m1: "+rm1);
   System.out.println("m2: "+rm2);
   System.out.println("m3: "+rm3);
  }
  return rm3;
 }
 
 public static long rec_matrix_test1(int dim) {
  long[][] matrx1 = _randomMatrix(dim);
  long[][] matrx2 = _randomMatrix(dim);
  
  rec_matrix_test1(matrx1, matrx2, false);

  long end = System.currentTimeMillis();
  return end;
 }
 
 private static String formatBigDecimal(BigDecimal bd){
     DecimalFormat df = new DecimalFormat();
     return df.format(bd);
 }
 
 public static void rec_test_loop(int max) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
   dim = dim*2;
   long startTime = System.currentTimeMillis();
   long endTime = rec_matrix_test1(dim);
   double duration = endTime - startTime;
   BigDecimal b_duration = new BigDecimal(duration);
   BigDecimal dim_p3=new BigDecimal(dim).multiply(new BigDecimal(dim)).multiply(new BigDecimal(dim));
   BigDecimal b_r = b_duration.divide(new BigDecimal(dim)).divide(new BigDecimal(dim)).divide(new BigDecimal(dim)).multiply(new BigDecimal(1000000));
   System.out.println(">> dim: "+dim+", elapsed time: "+duration+" milli sec, dim^p; "+formatBigDecimal(dim_p3)+", ration(time/dim^3)="+formatBigDecimal(b_r));
   //System.out.println(">> dim: "+dim+", elapsed time: "+duration+" milli sec, dim^3: "+dim_p3);
  }
 }
 
 // verify_matrix_test1
 public static void verify_matrix_test1(int dim, boolean show) {
  long[][] matrx1 = _randomMatrix(dim);
  long[][] matrx2 = _randomMatrix(dim);
  
  SimpleMatrix sm3 = simple_matrix_test1(matrx1, matrx2, show);
  RecMatrix rm3 = rec_matrix_test1(matrx1, matrx2, show);
  SimpleMatrix srm3 = new SimpleMatrix(rm3._matrix);
  if (!sm3.equals(srm3)) {
   System.out.println("!!verify_matrix_test1: not equals ");
   System.out.println("m1: "+new SimpleMatrix(matrx1));
   System.out.println("m2: "+new SimpleMatrix(matrx2));
   System.out.println("sm3: "+sm3);
   System.out.println("srm3:"+srm3);
   throw new RuntimeException();
  } else if (show) {
   System.out.println("m1: "+new SimpleMatrix(matrx1));
   System.out.println("m2: "+new SimpleMatrix(matrx2));
   System.out.println("m3: "+sm3);
  }
 }
 
 public static void verify_test_loop(int max, boolean show) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
   dim = dim*2;
   verify_matrix_test1(dim, show);
  }
 }
 
 public static void test1(int max) {
  /*
  System.out.println("verify_test_loop: "+max);
  verify_test_loop(max, false);
  
  System.out.println("simple_test_loop: "+max);
  simple_test_loop(max);
  */
  System.out.println("rec_test_loop: "+max);
  rec_test_loop(max);
 }
 
 public static void main(String[] args) {
  int max = 12;
  test1(max);
  System.out.println(">> done");
  //verify_matrix_test1(2);
  //verify_matrix_test1(4);
  //verify_test_loop(max);
  
  //simple_test_loop(max);
  
  //rec_matrix_test1(4);
  //rec_test_loop(max);
 }
}

Java implemetation of Recursive Matrix Multiplication Algorithm (Part 1)

I tired to make some test case which can be used to evaluate language's capability.
Also I was interested in some concurrent nature of recursive matrix algorithm, I decided to use it as such sample.
But this subject itself is interesting and has  long history and many research on it.

The first time I found the algorithm was from a paper for open source hardware:

http://www.adapteva.com/wp-content/uploads/2013/07/adapteva_ipdps.pdf

http://www.parallella.org/board/

But later I recognized the idea of recursive matrix multiplication are rooted to Strassen's fast matrix multiplication which allows to multiply with O(n^(log_2 7)) versus ordinary O(n^3). (see a book, Modern Computer Algebra)  

But these days, rather than reducing number of multiplication from 8 to 7, it would be more effective if we can employ as many as multi cores.

And this algorithm provides a quite clean separation of jobs to delegate to multi cores. That is something more interesting aspect of the algorithm and actually more effective than reducing the number of multiplications(which may provide 20% improvement).

Originally I was considering writing this for Go, but in fact, Go's syntax is not comfortable to me, so I decided to do it on Java first.
In a sense, the key point to implement  fast  algorithm on Java is to avoid creating intermediate objects. since eventually GC took the most of time than reduced calculation time.
So what we need to do is to write an objectless object oriented program.
I think this is some interesting subject, and it is a bit related Go language object oriented style.

The first implementation in Java was really object oriented. Although I could use more extreme style to avoid casting, this level of implementation will be most common and easy to compare with other language where generics is not so  strong(or not supported). But Later I may add the codes of such an extreme version as well.

This first implementation is following the idea of original algorithm closely.
Namely we treate inner matrix block as element of coefficient ring of the matrix. (2X2 matrix over coefficient ring R).


package org.zen;

import java.util.Random;

public class Matrix_v3 {


 public static interface IRingClass {
  IRing create();
  
  IRing zero();

  IRing unit();
 }

 public static interface IRing {
  IRing add(IRing r);

  IRing subtract(IRing r);

  IRing mult(IRing r);
 }

 public static interface IMatrix extends IRing {
  int dim();

  IRing get(int row, int column);

  void set(int row, int column, IRing r);
 }

 public static class IntRingClass implements IRingClass {
  static final IRing zero = new IntRing(0);
  static final IRing unit = new IntRing(1);

  @Override
  public IRing create() {
   return new IntRing(0);
  }

  public IRing create(int i) {
   return new IntRing(i);
  }

  @Override
  public IRing zero() {
   return zero;
  }

  @Override
  public IRing unit() {
   return unit;
  }
 }

 public static class IntRing implements IRing {
  final int i;

  public IntRing(int i) {
   this.i = i;
  }

  @Override
  public IRing add(IRing r) {
   IntRing ir = (IntRing) r;
   return new IntRing(i + ir.i);
  }

  @Override
  public IRing subtract(IRing r) {
   IntRing ir = (IntRing) r;
   return new IntRing(i - ir.i);
  }

  @Override
  public IRing mult(IRing r) {
   IntRing ir = (IntRing) r;
   return new IntRing(i * ir.i);
  }
  
  @Override
  public String toString() {
   return ""+i;
  }
 }

 public static abstract class AbstractMatrixClass implements IRingClass {
  protected final IRingClass coefficientRing;
  protected final int dim;

  protected AbstractMatrix zero = null;
  protected AbstractMatrix unit = null;

  public AbstractMatrixClass(IRingClass coefficientRing, int dim) {
   this.coefficientRing = coefficientRing;
   this.dim = dim;
  }
  
  void init() {
   this.zero = (AbstractMatrix)create();
   this.unit = (AbstractMatrix)create();
   for (int i = 0; i < dim; i++) {
    for (int j = 0; j < dim; j++) {
     zero.set(i, j, coefficientRing.zero());
     unit.set(i, j, (i == j) ? coefficientRing.unit(): coefficientRing.zero());
    }
   }
  }

  @Override
  public IRing zero() {
   if (zero == null) {
    init();
   }
   return zero;
  }

  @Override
  public IRing unit() {
   if (unit == null) {
    init();
   }
   return unit;
  }

 }

 public static abstract class AbstractMatrix implements IMatrix {
  
  final AbstractMatrixClass _matrixClass;
  
  public AbstractMatrix(final AbstractMatrixClass _matrixClass) {
   this._matrixClass = _matrixClass;
  }

  @Override
  public int dim() {
   return _matrixClass.dim;
  }

  @Override
  public IRing add(IRing m) {
   AbstractMatrix m0 = (AbstractMatrix) this;
   AbstractMatrix m1 = (AbstractMatrix) m;
   AbstractMatrix m2 = (AbstractMatrix)_matrixClass.create();
   for (int i = 0; i < dim(); i++) {
    for (int j = 0; j < dim(); j++) {
     IRing r = m0.get(i, j).add(m1.get(i, j));
     m2.set(i, j, r);
    }
   }
   return m2;
  }

  @Override
  public IRing subtract(IRing m) {
   AbstractMatrix m0 = (AbstractMatrix) this;
   AbstractMatrix m1 = (AbstractMatrix) m;
   AbstractMatrix m2 = (AbstractMatrix)_matrixClass.create();
   for (int i = 0; i < dim(); i++) {
    for (int j = 0; j < dim(); j++) {
     IRing r = m0.get(i, j).subtract(m1.get(i, j));
     m2.set(i, j, r);
    }
   }
   return m2;
  }

  @Override
  public IRing mult(IRing m) {
   AbstractMatrix m0 = (AbstractMatrix) this;
   AbstractMatrix m1 = (AbstractMatrix) m;
   AbstractMatrix m2 = (AbstractMatrix)_matrixClass.create();
   for (int i = 0; i < dim(); i++) {
    for (int j = 0; j < dim(); j++) {
     IRing r = _matrixClass.coefficientRing.zero();
     for (int k = 0; k < dim(); k++) {
      r = r.add(m0.get(i, k).mult(m1.get(k, j)));
     }
     m2.set(i, j, r);
    }
   }
   return m2;
  }

  protected void rangeCheck(int row, int column) {
   if (row < 0 || row >= dim()) {
    throw new RuntimeException();
   }
   if (column < 0 || column >= dim()) {
    throw new RuntimeException();
   }
  }
 }

 public static class SimpleMatrixClass extends AbstractMatrixClass {
  
  public SimpleMatrixClass(final IRingClass coefficientRing, int dim) {
   super(coefficientRing, dim);
  }

  @Override
  public AbstractMatrix create() {
   return new SimpleMatrix(this);
  }
  
  public static SimpleMatrix create(int[][] matrix) {
   int dim = matrix.length;
   SimpleMatrixClass simpeMatrixClass = new SimpleMatrixClass(new IntRingClass(), dim);
   SimpleMatrix sm = new SimpleMatrix(simpeMatrixClass);
   for (int i = 0; i < dim; i++) {
    for (int j = 0; j < dim; j++) {
     sm.set(i, j, new IntRing(matrix[i][j]));
    }
   }
   return sm;
  }
  
  public SimpleMatrix randomMatrix() {
   int[][] matrix = _randomMatrix(dim);
   SimpleMatrix sm = new SimpleMatrix(this);
   for (int i = 0; i < dim; i++) {
    for (int j = 0; j < dim; j++) {
     sm.set(i, j, new IntRing(matrix[i][j]));
    }
   }
   return sm;
  }  
 }
 
 public static class SimpleMatrix extends AbstractMatrix {
  protected final IRing[][] _matrix;

  public SimpleMatrix(final AbstractMatrixClass _matrixClass) {
   super(_matrixClass);
   _matrix = new IRing[_matrixClass.dim][_matrixClass.dim];
   for (int i = 0; i < _matrixClass.dim; i++) {
    for (int j = 0; j < _matrixClass.dim; j++) {
     _matrix[i][j] = _matrixClass.coefficientRing.create();
    }
   }
  }
  public SimpleMatrix(final AbstractMatrixClass _matrixClass, IRing[][] _matrix) {
   super(_matrixClass);
   this._matrix = _matrix;
  }

  @Override
  public IRing get(int row, int column) {
   rangeCheck(row, column);
   return _matrix[row][column];
  }

  @Override
  public void set(int row, int column, IRing r) {
   rangeCheck(row, column);
   _matrix[row][column] = r;
  }
  
  @Override
  public String toString() {
   int dim = _matrix.length;
   StringBuilder sb = new StringBuilder();
   sb.append("(\n");
   for (int i = 0; i < dim; i++) {
    for (int j = 0; j < dim; j++) {
     String s = _matrix[i][j].toString();
     if (j != 0) {
      sb.append(", ");
     }
     sb.append(s);
    }
    sb.append("\n");
   }
   sb.append(")\n");
   return sb.toString();
  }
  
  @Override
  public boolean equals(Object obj) {
   if (obj instanceof SimpleMatrix) {
    SimpleMatrix sm = (SimpleMatrix)obj;
    int dim = sm.dim();
    if (dim() != dim) {
     return false;
    }
    for (int i = 0; i < dim; i++) {
     for (int j = 0; j < dim; j++) {
      if (_matrix[i][j] != sm._matrix[i][j]) return false;
     }
    }
    return true;
   } else {
    return false;
   }
  }
 }

 static Random rand = new Random(); 
 static int[][] _randomMatrix(int dim) {
  final int[][] matrix = new int[dim][dim];
  for (int i = 0; i < dim; i++) {
   for (int j = 0; j < dim; j++) {
    matrix[i][j] = rand.nextInt(50);
   }
  }
  return matrix;
 }
 static int[][] _zeroMatrix(int dim) {
  final int[][] matrix = new int[dim][dim];
  for (int i = 0; i < dim; i++) {
   for (int j = 0; j < dim; j++) {
    matrix[i][j] = 0;
   }
  }
  return matrix;
 }
 
 public static class RecMatrixClass extends SimpleMatrixClass {
  int flat_dim;
  public RecMatrixClass(int flat_dim) {
   super((flat_dim == 2)?new IntRingClass():new RecMatrixClass(flat_dim >> 1), 2);
   this.flat_dim = flat_dim;
   if (flat_dim < 2) {
    throw new RuntimeException(">> RecMatrixClass flat_dim: "+flat_dim);
   }
  }
  
  public RecMatrix randomMatrix() {
   return new RecMatrix(this, _randomMatrix(flat_dim), flat_dim, 0, 0);
  }  
  
  @Override
  public AbstractMatrix create() {
   return new RecMatrix(this, new int[flat_dim][flat_dim], flat_dim, 0, 0);
  }

  public AbstractMatrix create(int[][] matrix, int dim, int row_index, int column_index) {
   return new RecMatrix(this, matrix, dim, row_index, column_index);
  }

 }
 
 public static class RecMatrix extends SimpleMatrix {
  public RecMatrix(final AbstractMatrixClass _matrixClass, int[][] matrix, int dim, int row_index, int column_index) {
   super(_matrixClass);
   
   if (_matrixClass.coefficientRing instanceof SimpleMatrixClass) {
    RecMatrixClass rmCls = (RecMatrixClass)_matrixClass.coefficientRing;
    for (int i = 0; i < 2; i++) {
     for (int j = 0; j < 2; j++) {
      final int dim0 = dim >> 1;
      final int row_index0 = (i == 0)?row_index:row_index|dim0;
      final int column_index0 = (j == 0)?column_index:column_index|dim0;
      set(i, j, rmCls.create(matrix, dim0, row_index0, column_index0));
     }
    }
   } else if (_matrixClass.coefficientRing instanceof IntRingClass) {
    IntRingClass irCls = (IntRingClass)_matrixClass.coefficientRing;
    for (int i = 0; i < 2; i++) {
     for (int j = 0; j < 2; j++) {
      set(i, j, irCls.create(matrix[row_index|i][column_index|j]));
     }
    }
   }
  }

  public String toString() {
   int[][] rep = getMatrixRep();
   int dim = rep.length;
   StringBuilder sb = new StringBuilder();
   sb.append("(\n");
   for (int i = 0; i < dim; i++) {
    for (int j = 0; j < dim; j++) {
     int v = rep[i][j];
     if (j != 0) {
      sb.append(", ");
     }
     sb.append(v);
    }
    sb.append("\n");
   }
   sb.append(")\n");
   return sb.toString();
  }
  
  public int[][] getMatrixRep() {
   int flat_dim = ((RecMatrixClass)_matrixClass).flat_dim;
   int[][] matrix = new int[flat_dim][flat_dim];
   setMatrixRep(matrix, flat_dim, 0, 0);
   return matrix;
  }
  
  void setMatrixRep(int[][] matrix, int dim, int row_index, int column_index) {
   for (int i = 0; i < _matrixClass.dim; i++) {
    for (int j = 0; j < _matrixClass.dim; j++) {
     IRing ring = get(i, j);
     if (ring instanceof RecMatrix) {
      final int dim0 = dim >> 1;
      final int row_index0 = (i == 0)?row_index:row_index|dim0;
      final int column_index0 = (j == 0)?column_index:column_index|dim0;
      ((RecMatrix)ring).setMatrixRep(matrix, dim0, row_index0, column_index0);
     } else if (ring instanceof IntRing) {
      IntRing intRing = (IntRing)ring;
      matrix[row_index+i][column_index+j] = intRing.i;
     } else {
      throw new RuntimeException(">>  setMatrixRep unsupported ring: "+ring);
     }
    }
   }
  }
 }
 
 //
 // Tests
 //
 public static long simple_matrix_test1(int dim) {
  SimpleMatrixClass simleMatrixClass = new SimpleMatrixClass(new IntRingClass(), dim);
  SimpleMatrix sm1 = simleMatrixClass.randomMatrix();
  SimpleMatrix sm2 = simleMatrixClass.randomMatrix();
  SimpleMatrix sm3 = (SimpleMatrix)sm1.mult(sm2);
  
  long end = System.currentTimeMillis();
  /*
  System.out.println("sm1: "+sm1);
  System.out.println("sm2: "+sm2);
  System.out.println("sm3: "+sm3);
  */
  return end;
 }
 public static long rec_matrix_test1(int dim) {
  RecMatrixClass recMatrixClass = new RecMatrixClass(dim);
  RecMatrix rm1 = recMatrixClass.randomMatrix();
  RecMatrix rm2 = recMatrixClass.randomMatrix();
  RecMatrix rm3 = (RecMatrix)rm1.mult(rm2);
  
  long end = System.currentTimeMillis();
  /*
  System.out.println("rm1: "+rm1);
  System.out.println("rm2: "+rm2);
  System.out.println("rm3: "+rm3);
  */
  return end;
 }
 
 public static void simple_test_loop(int max) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
   dim = dim*2;
   long startTime = System.currentTimeMillis();
   long endTime = simple_matrix_test1(dim);
   System.out.println(">> dim: "+dim+", elapsed time: "+(endTime - startTime)+" milli sec");
  }
 }
 public static void rec_test_loop(int max) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
   dim = dim*2;
   long startTime = System.currentTimeMillis();
   long endTime = rec_matrix_test1(dim);
   System.out.println(">> dim: "+dim+", elapsed time: "+(endTime - startTime)+" milli sec");
  }
 }
 
 public static void main(String[] args) {
  int max = 8;
  //simple_test_loop(max);
  rec_test_loop(max);
  
 }

Here is the execution results:
[simple]
>> dim: 2, elapsed time: 2 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 1 milli sec
>> dim: 16, elapsed time: 2 milli sec
>> dim: 32, elapsed time: 6 milli sec
>> dim: 64, elapsed time: 13 milli sec
>> dim: 128, elapsed time: 31 milli sec
>> dim: 256, elapsed time: 227 milli sec

------------------
[rec]
>> dim: 2, elapsed time: 3 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 2 milli sec
>> dim: 16, elapsed time: 32 milli sec
>> dim: 32, elapsed time: 378 milli sec
>> dim: 64, elapsed time: 494 milli sec
>> dim: 128, elapsed time: 1620 milli sec
>> dim: 256, elapsed time: 13757 milli sec
}

The main point of this approach is there is no special implementation for multiplication method for this recursive matrix. the algorithm is reflected in how it creates the cofficient ring element recursively.
But the performance is quite low.

Next, I will show new version which improve performance significantly, and the effectiveness of using 4 threads(4 times faster indeed).