Monday, November 25, 2013

Java Sumatra project for OpenCL

I found an interesting article for Java's Smatra project.
http://pc.watch.impress.co.jp/docs/column/kaigai/20130826_612591.html

I was interested in using OpenCL in Java 3 years ago, at that time there was no such projects, but  2 years ago, a solid project, Sumatra project to support OpenCL in Java seems started by Oracle/AMD.
OpenCL is very important standard to support heterogeneous hardware environment including GPU, CUP, etc. OSX of Apple is now based on LLVM (Since Snow Leopard) and supporting OpenCL. So This move of Java is actually following Apple.

Project Sumatra

This primary goal of this project is to enable Java applications to take advantage of graphics processing units (GPUs) and accelerated processing units (APUs)--whether they are discrete devices or integrated with a CPU--to improve performance.
We expect the initial focus to be on the Hotspot JVM, enabling code generation, runtime support and garbage collection on GPUs. Once basic capabilities are in place in the JVM, we will also examine how to best expose GPU support to application and/or library developers, initially leveraging the new Java 8 Lambda language and library features. As the project progress, we may identify challenges with the Java API and constructs which may lead to new language, JVM or library extensions that will need standardization under the JCP process. While the Java language is an obvious focus, we also anticipate that this project will provide guidance for other JVM-hosted languages such as JavaScript/Nashorn, Scala and JRuby.
You can follow the project and get involved on the sumatra-dev mailing list.

Related projects and material

This Project is sponsored by the HotSpot Group.

This is quite interesting direction, also Java 8's support of Lambda(closure) seems associated with this parallel computing based on GPGPU or OpenCL approach.

I installed two projects in Windows. but it looks like Intel only supports Windows OS(except xeon). In order to use this on Linux, we may need to install AMD's SDK. It seems usable for Intel's graphics core in the CPU(like i5 etc).

1) https://code.google.com/p/aparapi/
2a) http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/
2b) http://software.intel.com/en-us/vcsource/tools/opencl

These are other related projects.

http://openjdk.java.net/groups/hotspot/

http://openjdk.java.net/projects/mlvm/index.html

http://download.java.net/openjdk/jdk7/

Here is a qupte from jdk8 project. It actually seems related to Sumatra project..
 http://mreinhold.org/blog/jdk8-preview

JDK 8 Developer Preview
2013/09/09 10:39:26 -07:00 
The JDK 8 De­vel­oper Pre­view (a.k.a. Mile­stone 8) builds are now avail­able!
This mile­stone is in­tended for broad test­ing by de­vel­op­ers. We’ve run all tests on all Or­a­cle-sup­ported plat­forms and haven’t found any glar­ing is­sues. We’ve also fixed many of the bugs dis­cov­ered since we reached the Fea­ture Com­plete mile­stone back in June.
The prin­ci­pal fea­ture of this re­lease is Pro­ject Lambda (JSR 335), which aims to make it eas­ier to write code for mul­ti­core proces­sors. It adds lambda ex­pres­sions, de­fault meth­ods, and method ref­er­ences to the Java pro­gram­ming lan­guage, and ex­tends the li­braries to sup­port par­al­leliz­able op­er­a­tions upon streamed data.
There are, of course, many other new fea­tures, in­clud­ing a new Date and Time API (JSR 310), Com­pact Pro­files, the Nashorn JavaScript En­gine, and even some anti-fea­tures such as the re­moval of the Per­ma­nent Gen­er­a­tion from the HotSpot vir­tual ma­chine. A com­plete list of new fea­tures is avail­able on the JDK 8 Fea­tures page.
If you’ve been watch­ing JDK 8 evolve from afar then now is an ex­cel­lent time to down­load a build and try it out—the sooner the bet­ter! Let us know if your ex­ist­ing code doesn’t com­pile and run cor­rectly on JDK 8, if it runs slower than be­fore, if it crashes the JVM, or if there are any re­main­ing de­sign is­sues in the new lan­guage and API fea­tures.
We’ll do our best to read, eval­u­ate, and act on all feed­back re­ceived via the usual bug-re­port­ing chan­nel be­tween now and the end of Oc­to­ber. After that we’ll grad­u­ally ramp down the rate of change in order to sta­bi­lize the code, so bugs re­ported later on might not get fixed in time for the GA re­lease.
Other Java 8 related info:

http://www.infoq.com/news/2013/06/jdk8-almost-feature-complete

https://blogs.oracle.com/thejavatutorials/entry/jdk_8_documentation_developer_preview

This kind of direction should be considered in Dart as well.
This will affect both of client and server. Since JavaScript seems going to support OpenGL like architecture, Dart need to handle it also. But server side support would be more natural fit for high performance computation using OpenGL.

In fact, I was surprised to know the direction of Java 8 was shifted to this direction. Previously it was aiming to support of modules etc, but the focus seems changed quite  a lot.

Wednesday, November 20, 2013

Dart vs Java performance comparison with Recursive Maxtrix Multiplication (Part 3)

I ported the last Java implementation of recursive matrix multiplication to Dart.
It is interesting to see the actual performance difference of two languages.
Such comparison will depend on the type of application and level of expected performance.

As we have seen, Java allows considerable fast implementation of recursive matrix multiplication. It is interesting to compare the performance with C++.
But definitely its 1.6 Gaga flop(1 billion multiplication in 0.6 sec) is definitely assembler level performance.

Following is the result of the performance measurement of Dart.
From this, Dart is about 4 times slower than Java if we compare with 1 thread version of Java. but if we compare with 4 threads(4 cores) version of Java, it is about 13 times slower.
So 4 times difference is in a sense close to Java. but Dart does not allow to share memory among isolates, so even we use isolate, it will be more costly to copy matrix elements, but still it is interesting to see the performance using  isolates.
Even Dart does not support array, if the List representation internally uses array(by delegating to external code), it will not cause major performance difference. Other reason will be the representation of int. We may have similar result in Java if we replace long by BigInteger etc.

[Dart]
rec_test_loop: 12
>> dim: 2, elapsed time: 5 milli sec, dim^3; 8, ration(time/dim^3)=625000.0
>> dim: 4, elapsed time: 0 milli sec, dim^3; 64, ration(time/dim^3)=0.0
>> dim: 8, elapsed time: 1 milli sec, dim^3; 512, ration(time/dim^3)=1953.125
>> dim: 16, elapsed time: 32 milli sec, dim^3; 4096, ration(time/dim^3)=7812.5
>> dim: 32, elapsed time: 4 milli sec, dim^3; 32768, ration(time/dim^3)=122.0703125
>> dim: 64, elapsed time: 5 milli sec, dim^3; 262144, ration(time/dim^3)=19.073486328125
>> dim: 128, elapsed time: 16 milli sec, dim^3; 2097152, ration(time/dim^3)=7.62939453125
>> dim: 256, elapsed time: 119 milli sec, dim^3; 16777216, ration(time/dim^3)=7.092952728271484
>> dim: 512, elapsed time: 1239 milli sec, dim^3; 134217728, ration(time/dim^3)=9.231269359588623
>> dim: 1024, elapsed time: 8027 milli sec, dim^3; 1073741824, ration(time/dim^3)=7.475726306438446
>> dim: 2048, elapsed time: 60587 milli sec, dim^3; 8589934592, ration(time/dim^3)=7.0532551035285
>> dim: 4096, elapsed time: 484160 milli sec, dim^3; 68719476736, ration(time/dim^3)=7.045455276966095

[Java with single thread]
>> dim: 2, elapsed time: 1 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 0 milli sec
>> dim: 16, elapsed time: 0 milli sec
>> dim: 32, elapsed time: 5 milli sec
>> dim: 64, elapsed time: 15 milli sec
>> dim: 128, elapsed time: 5 milli sec
>> dim: 256, elapsed time: 30 milli sec
>> dim: 512, elapsed time: 217 milli sec
>> dim: 1024, elapsed time: 1915 milli sec
>> dim: 2048, elapsed time: 14691 milli sec
>> dim: 4096, elapsed time: 115860 milli sec

[Java with 4 threads(4 cores)]
>> dim: 2, elapsed time: 0.0 milli sec, dim^p; 8, ration(time/dim^3)=0
>> dim: 4, elapsed time: 0.0 milli sec, dim^p; 64, ration(time/dim^3)=0
>> dim: 8, elapsed time: 1.0 milli sec, dim^p; 512, ration(time/dim^3)=1,953.125
>> dim: 16, elapsed time: 1.0 milli sec, dim^p; 4,096, ration(time/dim^3)=244.141
>> dim: 32, elapsed time: 3.0 milli sec, dim^p; 32,768, ration(time/dim^3)=91.553
>> dim: 64, elapsed time: 17.0 milli sec, dim^p; 262,144, ration(time/dim^3)=64.85
>> dim: 128, elapsed time: 4.0 milli sec, dim^p; 2,097,152, ration(time/dim^3)=1.907
>> dim: 256, elapsed time: 18.0 milli sec, dim^p; 16,777,216, ration(time/dim^3)=1.073
>> dim: 512, elapsed time: 105.0 milli sec, dim^p; 134,217,728, ration(time/dim^3)=0.782
>> dim: 1024, elapsed time: 602.0 milli sec, dim^p; 1,073,741,824, ration(time/dim^3)=0.561
>> dim: 2048, elapsed time: 4771.0 milli sec, dim^p; 8,589,934,592, ration(time/dim^3)=0.555
>> dim: 4096, elapsed time: 38704.0 milli sec, dim^p; 68,719,476,736, ration(time/dim^3)=0.563

Here is again a puzzling result for the performance degradation of Simple Matrix multiplication in Dart. until dim is 512, simple matrix is faster than recursive matrix, but after that, simple matrix become slower than recursive version, and at 4096, it is 3.3 times slower.

[Simple Matrix]
>> dim: 2, elapsed time: 4 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 0 milli sec
>> dim: 16, elapsed time: 4 milli sec
>> dim: 32, elapsed time: 6 milli sec
>> dim: 64, elapsed time: 2 milli sec
>> dim: 128, elapsed time: 7 milli sec
>> dim: 256, elapsed time: 55 milli sec
>> dim: 512, elapsed time: 483 milli sec
>> dim: 1024, elapsed time: 12269 milli sec
>> dim: 2048, elapsed time: 118535 milli sec
>> dim: 4096, elapsed time: 1597558 milli sec

[Recursive Matrix]
>> dim: 2, elapsed time: 5 milli sec, dim^3; 8, ration(time/dim^3)=625000.0
>> dim: 4, elapsed time: 0 milli sec, dim^3; 64, ration(time/dim^3)=0.0
>> dim: 8, elapsed time: 1 milli sec, dim^3; 512, ration(time/dim^3)=1953.125
>> dim: 16, elapsed time: 32 milli sec, dim^3; 4096, ration(time/dim^3)=7812.5
>> dim: 32, elapsed time: 4 milli sec, dim^3; 32768, ration(time/dim^3)=122.0703125
>> dim: 64, elapsed time: 5 milli sec, dim^3; 262144, ration(time/dim^3)=19.073486328125
>> dim: 128, elapsed time: 16 milli sec, dim^3; 2097152, ration(time/dim^3)=7.62939453125
>> dim: 256, elapsed time: 119 milli sec, dim^3; 16777216, ration(time/dim^3)=7.092952728271484
>> dim: 512, elapsed time: 1239 milli sec, dim^3; 134217728, ration(time/dim^3)=9.231269359588623
>> dim: 1024, elapsed time: 8027 milli sec, dim^3; 1073741824, ration(time/dim^3)=7.475726306438446
>> dim: 2048, elapsed time: 60587 milli sec, dim^3; 8589934592, ration(time/dim^3)=7.0532551035285
>> dim: 4096, elapsed time: 484160 milli sec, dim^3; 68719476736, ration(time/dim^3)=7.045455276966095

Following is the Dart code used for this benchmark test.

library rec_matrix_v6;

import "dart:math";

Random rand = new Random(); 

List<List<int>> _randomMatrix(int dim) {
  //final List<List<int>> matrix = new int[dim][dim];
  final List<List<int>> matrix = new List(dim);
  for (int i = 0; i < dim; i++) {
    matrix[i] = new List(dim);
    for (int j = 0; j < dim; j++) {
      matrix[i][j] = rand.nextInt(50);
    }
  }
  return matrix;
}
List<List<int>> _zeroMatrix(int dim) {
  final List<List<int>> matrix = new List(dim);
  for (int i = 0; i < dim; i++) {
    matrix[i] = new List(dim);
    for (int j = 0; j < dim; j++) {
      matrix[i][j] = 0;
    }
  }
  return matrix;
}

//
// AbstractMatrix
//
abstract class AbstractMatrix {
  int _dim;
  List<List<int>> _matrix;
  
  AbstractMatrix(List<List<int>> matrix) {
    this._dim = matrix.length;
    this._matrix = matrix;
  }
  
  int dim() {
    return _dim;
  }

  SimpleMatrix add(SimpleMatrix m) {
    List<List<int>> m0 = _matrix;
    List<List<int>> m1 = m._matrix;
    List<List<int>> m2 = new List(_dim);
    for (int i = 0; i < _dim; i++) {
      m2[i] = new List(_dim);
      for (int j = 0; j < _dim; j++) {
        m2[i][j] = m0[i][j]+m1[i][j];
      }
    }
    return new SimpleMatrix(m2);
  }

  SimpleMatrix subtract(SimpleMatrix m) {
    List<List<int>> m0 = _matrix;
    List<List<int>> m1 = m._matrix;
    List<List<int>> m2 = new List(_dim);
    for (int i = 0; i < _dim; i++) {
      m2[i] = new List(_dim);
      for (int j = 0; j < _dim; j++) {
        m2[i][j] = m0[i][j]-m1[i][j];
      }
    }
    return new SimpleMatrix(m2);
  }
  
  String toString() {
    return toString1(_matrix);
  }
  
  String toString1(List<List<int>> matrix) {
    int dim = matrix.length;
    StringBuffer sb = new StringBuffer();
    sb.write("(\n");
    for (int i = 0; i < dim; i++) {
      for (int j = 0; j < dim; j++) {
        int v = matrix[i][j];
        if (j != 0) {
          sb.write(", ");
        }
        sb.write(v);
      }
      sb.write("\n");
    }
    sb.write(")\n");
    return sb.toString();
  }
  
  bool equals(Object obj) {
    if (obj is SimpleMatrix) {
      SimpleMatrix m = obj;
      if (_dim != m._dim) {
        return false;
      }
      List<List<int>>matrix1 = m._matrix;
      for (int i = 0; i < _dim; i++) {
        for (int j = 0; j < _dim; j++) {
          if (_matrix[i][j] != matrix1[i][j]) {
            return false;
          }
        }
      }
      return true;
    } else {
      return false;
    }
  }
}

//
// SimpleMatrix
//
class SimpleMatrix extends AbstractMatrix {
  SimpleMatrix(List<List<int>> matrix): super(matrix) {}
  
  SimpleMatrix mult(SimpleMatrix m) {
    List<List<int>> m0 = _matrix;
    List<List<int>> m1 = m._matrix;
    List<List<int>> m2 = new List(_dim);
    for (int i = 0; i < _dim; i++) {
      m2[i] = new List(_dim);
      for (int j = 0; j < _dim; j++) {
        int v = 0;
        for (int k = 0; k < _dim; k++) {
          v += m0[i][k]*m1[k][j];
        }
        m2[i][j] = v;
      }
    }
    return new SimpleMatrix(m2);
  }
}

//
// RecMatrix
//
class RecMatrix extends AbstractMatrix {
  RecMatrix(List<List<int>> matrix): super(matrix) {}
  
  static RecMatrix randomMatrix(int dim) {
    List<List<int>> matrix = _randomMatrix(dim);
    return new RecMatrix(matrix);
  }

  RecMatrix mult(RecMatrix m) {
    RecMatrix m2 = new RecMatrix(_zeroMatrix(_dim));
    m2.mult0(_dim, _matrix, 0, 0, m._matrix, 0, 0);
    return m2;
  }
  
  void mult0(final int dim, final List<List<int>> m1, final int row_index1, final int column_index1, final List<List<int>> m2, final int row_index2, final int column_index2) {
    if (dim == 2) {
      for (int i = 0; i < 2; i++) {
        for (int j = 0; j < 2; j++) {
          int i1 = row_index1 | i;
          int j2 = column_index2 | j;
          int v = 0;
          for (int k = 0; k < 2; k++) {
            int j1 = column_index1 | k;
            int i2 = row_index2 | k;
            v += m1[i1][j1]*m2[i2][j2];
          }
          _matrix[i1][j2] += v;
        }
      }
    } else {
      final int dim0 = dim >> 1;
      for (int i = 0; i < 2; i++) {
        final int r_idx1 = (i == 0) ? row_index1 : (row_index1 | dim0);
        for (int j = 0; j < 2; j++) {
          final int c_idx2 = (j == 0) ? column_index2: (column_index2 | dim0);
          for (int k = 0; k < 2; k++) {
            final int c_idx1 = (k == 0) ? column_index1: (column_index1 | dim0);
            final int r_idx2 = (k == 0) ? row_index2 : (row_index2 | dim0);
            mult0(dim0, m1, r_idx1, c_idx1, m2, r_idx2, c_idx2);
          }
        }
      }
    }
  }
}

//
// Tests
//

// simple_matrix_test1
SimpleMatrix simple_matrix_test1(List<List<int>> matrx1, List<List<int>> matrx2, bool show) {
  SimpleMatrix m1 = new SimpleMatrix(matrx1);
  SimpleMatrix m2 = new SimpleMatrix(matrx2);
  SimpleMatrix m3 = m1.mult(m2);
  if (show) {
    print("m1: ${m1}");
    print("m2: ${m2}");
    print("m3: ${m3}");
  }
  return m3;
}

DateTime simple_matrix_test2(int dim) {
  List<List<int>> matrx1 = _randomMatrix(dim);
  List<List<int>> matrx2 = _randomMatrix(dim);
  
  simple_matrix_test1(matrx1, matrx2, false);

  return new DateTime.now();
}

void simple_test_loop(int max) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
    dim = dim*2;
    DateTime startTime = new DateTime.now();
    DateTime endTime = simple_matrix_test2(dim);
    Duration du = endTime.difference(startTime);
    print(">> dim: ${dim}, elapsed time: ${du.inMilliseconds} milli sec");
  }
}

// rec_matrix_test1
RecMatrix rec_matrix_test1(List<List<int>> matrx1, List<List<int>> matrx2, bool show) {
  RecMatrix rm1 = new RecMatrix(matrx1);
  RecMatrix rm2 = new RecMatrix(matrx2);
  RecMatrix rm3 = rm1.mult(rm2);
  if (show) {
    print("m1: ${rm1}");
    print("m2: ${rm2}");
    print("m3: ${rm3}");
  }
  return rm3;
}

DateTime rec_matrix_test2(int dim) {
  List<List<int>> matrx1 = _randomMatrix(dim);
  List<List<int>> matrx2 = _randomMatrix(dim);
  
  rec_matrix_test1(matrx1, matrx2, false);
  return new DateTime.now();
}

void rec_test_loop(int max) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
    dim = dim*2;
    DateTime startTime = new DateTime.now();
    DateTime endTime = rec_matrix_test2(dim);
    Duration duration = endTime.difference(startTime);
    int b_duration = duration.inMilliseconds;
    int dim_p3 = dim*dim*dim;
    double b_r = (b_duration/dim_p3)*1000000;
    print(">> dim: ${dim}, elapsed time: ${b_duration} milli sec, dim^3; ${dim_p3}, ration(time/dim^3)=${b_r}");
  }
}

// verify_matrix_test1
void verify_matrix_test1(int dim, bool show) {
  List<List<int>> matrx1 = _randomMatrix(dim);
  List<List<int>> matrx2 = _randomMatrix(dim);
  
  SimpleMatrix sm3 = simple_matrix_test1(matrx1, matrx2, show);
  RecMatrix rm3 = rec_matrix_test1(matrx1, matrx2, show);
  SimpleMatrix srm3 = new SimpleMatrix(rm3._matrix);
  if (!sm3.equals(srm3)) {
    print("!!verify_matrix_test1: not equals ");
    print("m1: ${new SimpleMatrix(matrx1)}");
    print("m2: ${new SimpleMatrix(matrx2)}");
    print("sm3: ${sm3}");
    print("srm3:${srm3}");
    throw new Exception();
  } else if (show) {
    print("m1: ${new SimpleMatrix(matrx1)}");
    print("m2: ${new SimpleMatrix(matrx2)}");
    print("m3: ${sm3}");
  }
}

void verify_test_loop(int max, bool show) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
    dim = dim*2;
    verify_matrix_test1(dim, show);
  }
}

void test1(int max) {
  /*
  print("verify_test_loop: "+max);
  verify_test_loop(max, false);
  
  print("simple_test_loop: "+max);
  simple_test_loop(max);
   */
  print("rec_test_loop: ${max}");
  rec_test_loop(max);
}

void main() {
  int max = 12;
  test1(max);
  print(">> done");
  //verify_matrix_test1(2);
  //verify_matrix_test1(4);
  //verify_test_loop(max);
  
  //simple_test_loop(max);
  
  //rec_matrix_test1(4);
  //rec_test_loop(max);
}

Tuesday, November 19, 2013

Java implemetation of Recursive Matrix Multiplication Algorithm (Part 2)

This new implementation of Recursive Matrix Multiplication had a few unexpected observations.
The performance of this recursive implementation (without concurrency) is faster that simple matrix multiplication(without recursion)!
I still do not understand the reason, but it may be related to swapping memory space or something that low level.

As this result shows, at dim 4096, recursive version is 9 times faster than simple version.

[simple_test_loop]
>> dim: 2, elapsed time: 0 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 0 milli sec
>> dim: 16, elapsed time: 0 milli sec
>> dim: 32, elapsed time: 1 milli sec
>> dim: 64, elapsed time: 6 milli sec
>> dim: 128, elapsed time: 6 milli sec
>> dim: 256, elapsed time: 24 milli sec
>> dim: 512, elapsed time: 206 milli sec
>> dim: 1024, elapsed time: 6478 milli sec
>> dim: 2048, elapsed time: 90698 milli sec
>> dim: 4096, elapsed time: 1023754 milli sec

[rec_matrix_test1]
>> dim: 2, elapsed time: 1 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 0 milli sec
>> dim: 16, elapsed time: 0 milli sec
>> dim: 32, elapsed time: 5 milli sec
>> dim: 64, elapsed time: 15 milli sec
>> dim: 128, elapsed time: 5 milli sec
>> dim: 256, elapsed time: 30 milli sec
>> dim: 512, elapsed time: 217 milli sec
>> dim: 1024, elapsed time: 1915 milli sec
>> dim: 2048, elapsed time: 14691 milli sec
>> dim: 4096, elapsed time: 115860 milli sec

Also the result I got in the first implementation was so slow, I thought it seems no point implementing fast algorithm in Java, but new version were surprising fast and no GC is involved during calculation since no such intermediate objects are created. compared to the previous version, at dim 256 case, 13757/20 = 688 time faster. also if we closely examine the unit performance of calculation, they are constant(almost), so kind of linear performance.

[rec]
>> dim: 2, elapsed time: 3 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 2 milli sec
>> dim: 16, elapsed time: 32 milli sec
>> dim: 32, elapsed time: 378 milli sec
>> dim: 64, elapsed time: 494 milli sec
>> dim: 128, elapsed time: 1620 milli sec
>> dim: 256, elapsed time: 13757 milli sec

Further, if we introduce the multiple threads(4 threads(4 cores)), the performance will be quadrupled, or maybe this is a bit exaggerated, but around 3-4 range.
This number of thread should be close to the number of CPU cores to get maximum performance.

>> dim: 2, elapsed time: 0.0 milli sec, dim^p; 8, ration(time/dim^3)=0
>> dim: 4, elapsed time: 0.0 milli sec, dim^p; 64, ration(time/dim^3)=0
>> dim: 8, elapsed time: 1.0 milli sec, dim^p; 512, ration(time/dim^3)=1,953.125
>> dim: 16, elapsed time: 1.0 milli sec, dim^p; 4,096, ration(time/dim^3)=244.141
>> dim: 32, elapsed time: 3.0 milli sec, dim^p; 32,768, ration(time/dim^3)=91.553
>> dim: 64, elapsed time: 17.0 milli sec, dim^p; 262,144, ration(time/dim^3)=64.85
>> dim: 128, elapsed time: 4.0 milli sec, dim^p; 2,097,152, ration(time/dim^3)=1.907
>> dim: 256, elapsed time: 18.0 milli sec, dim^p; 16,777,216, ration(time/dim^3)=1.073
>> dim: 512, elapsed time: 105.0 milli sec, dim^p; 134,217,728, ration(time/dim^3)=0.782
>> dim: 1024, elapsed time: 602.0 milli sec, dim^p; 1,073,741,824, ration(time/dim^3)=0.561
>> dim: 2048, elapsed time: 4771.0 milli sec, dim^p; 8,589,934,592, ration(time/dim^3)=0.555
>> dim: 4096, elapsed time: 38704.0 milli sec, dim^p; 68,719,476,736, ration(time/dim^3)=0.563

dim^3 is the actual number of multiplications of long value. So this took only 0.6 second to calculate 1 billion multiplications. not bad. and even we change the long to double, the performance is almost the same.
And this does involve calculation to traverse matrix locations as well.

The matrix of dimension 1000 can express a network of 1000 nodes, so there would be practical applications.
And in this algorithm, it is very easy to utilize more cores, 4, 16, 256, ... so this number is scalable with the increase of cores in future.

At this moment, I checked a paper on Strassen algorithm implementation on NVidia GPU-GPU.

The number was actually quite impressive.
at dim = 2048, it took only 48 millisec. so 100 time faster than 4 threads(4 cores) version!

BTW, following is the my CPU spec. it is intel I5, 4 cores, 32GB RAM

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Stepping:              9
CPU MHz:               1600.000
BogoMIPS:              6935.22
Virtualization:        VT-x
L1d cache:             32K                                                                                                    
L1i cache:             32K                                                                                                    
L2 cache:              256K                                                                                                   
L3 cache:              6144K                                                                                                  
NUMA node0 CPU(s):     0-3                       

Following code are used for this testing.

package org.zen;

import java.math.BigDecimal;
import java.text.DecimalFormat;
import java.util.Random;

public class Matrix_v6 {
 static Random rand = new Random(); 
 static long[][] _randomMatrix(int dim) {
  final long[][] matrix = new long[dim][dim];
  for (int i = 0; i < dim; i++) {
   for (int j = 0; j < dim; j++) {
    matrix[i][j] = rand.nextInt(50);
   }
  }
  return matrix;
 }
 static long[][] _zeroMatrix(int dim) {
  final long[][] matrix = new long[dim][dim];
  for (int i = 0; i < dim; i++) {
   for (int j = 0; j < dim; j++) {
    matrix[i][j] = 0;
   }
  }
  return matrix;
 }
 
 //
 // AbstractMatrix
 //
 public static abstract class AbstractMatrix {
  protected int _dim;
  protected long[][] _matrix;
  
  public AbstractMatrix(long[][] matrix) {
   this._dim = matrix.length;
   this._matrix = matrix;
  }
  
  public int dim() {
   return _dim;
  }

  public SimpleMatrix add(SimpleMatrix m) {
   long[][] m0 = _matrix;
   long[][] m1 = m._matrix;
   long[][] m2 = new long[_dim][_dim];
   for (int i = 0; i < _dim; i++) {
    for (int j = 0; j < _dim; j++) {
     m2[i][j] = m0[i][j]+m1[i][j];
    }
   }
   return new SimpleMatrix(m2);
  }

  public SimpleMatrix subtract(SimpleMatrix m) {
   long[][] m0 = _matrix;
   long[][] m1 = m._matrix;
   long[][] m2 = new long[_dim][_dim];
   for (int i = 0; i < _dim; i++) {
    for (int j = 0; j < _dim; j++) {
     m2[i][j] = m0[i][j]-m1[i][j];
    }
   }
   return new SimpleMatrix(m2);
  }
  
  public String toString() {
   return toString(_matrix);
  }
  
  public static String toString(long[][] matrix) {
   int dim = matrix.length;
   StringBuilder sb = new StringBuilder();
   sb.append("(\n");
   for (int i = 0; i < dim; i++) {
    for (int j = 0; j < dim; j++) {
     long v = matrix[i][j];
     if (j != 0) {
      sb.append(", ");
     }
     sb.append(v);
    }
    sb.append("\n");
   }
   sb.append(")\n");
   return sb.toString();
  }
  
  public boolean equals(Object obj) {
   if (obj instanceof SimpleMatrix) {
    SimpleMatrix m = (SimpleMatrix)obj;
    if (_dim != m._dim) {
     return false;
    }
    long[][]matrix1 = m._matrix;
    for (int i = 0; i < _dim; i++) {
     for (int j = 0; j < _dim; j++) {
      if (_matrix[i][j] != matrix1[i][j]) {
       return false;
      }
     }
    }
    return true;
   } else {
    return false;
   }
  }
 }
 
 //
 // SimpleMatrix
 //
 public static class SimpleMatrix extends AbstractMatrix {
  public SimpleMatrix(long[][] matrix) {
   super(matrix);
  }
  
  public SimpleMatrix mult(SimpleMatrix m) {
   long[][] m0 = _matrix;
   long[][] m1 = m._matrix;
   long[][] m2 = new long[_dim][_dim];
   for (int i = 0; i < _dim; i++) {
    for (int j = 0; j < _dim; j++) {
     int v = 0;
     for (int k = 0; k < _dim; k++) {
      v += m0[i][k]*m1[k][j];
     }
     m2[i][j] = v;
    }
   }
   return new SimpleMatrix(m2);
  }
 }

 //
 // RecMatrix
 //
 public static class RecMatrix extends AbstractMatrix {
  public RecMatrix(long[][] matrix) {
   super(matrix);
  }
  
  public static RecMatrix randomMatrix(int dim) {
   long[][] matrix = _randomMatrix(dim);
   return new RecMatrix(matrix);
  }

  public RecMatrix mult(RecMatrix m) {
   RecMatrix m2 = new RecMatrix(new long[_dim][_dim]);
   m2.mult0(_dim, _matrix, 0, 0, m._matrix, 0, 0);
   return m2;
  }
  public void mult0(final int dim, final long[][] m1, final int row_index1, final int column_index1, final long[][] m2, final int row_index2, final int column_index2) {
   if (dim == 2) {
    for (int i = 0; i < 2; i++) {
     int i1 = row_index1 | i;
     for (int j = 0; j < 2; j++) {
      int j2 = column_index2 | j;
      long v = 0;
      for (int k = 0; k < 2; k++) {
       int j1 = column_index1 | k;
       int i2 = row_index2 | k;
       v += m1[i1][j1]*m2[i2][j2];
      }
      _matrix[i1][j2] += v;
     }
    }
   } else if (dim == _dim) {
    Thread[] threads = new Thread[4];
    int idx = 0;
    final int dim0 = dim >> 1;
    for (int i = 0; i < 2; i++) {
     final int r_idx1 = (i == 0) ? row_index1 : (row_index1 | dim0);
     for (int j = 0; j < 2; j++) {
      final int c_idx2 = (j == 0) ? column_index2: (column_index2 | dim0);
      threads[idx] = new Thread(new Runnable() {
       public void run() {
              for (int k = 0; k < 2; k++) {
         final int c_idx1 = (k == 0) ? column_index1: (column_index1 | dim0);
         final int r_idx2 = (k == 0) ? row_index2 : (row_index2 | dim0);
         mult0(dim0, m1, r_idx1, c_idx1, m2, r_idx2, c_idx2);
              }
          }
      });
      threads[idx++].start();
     }
    }
    for (int i = 0; i < threads.length; i++) {
        try {
            threads[i].join();
        } catch (InterruptedException e) {
        }
    }
   } else {
    final int dim0 = dim >> 1;
    for (int i = 0; i < 2; i++) {
     final int r_idx1 = (i == 0) ? row_index1 : (row_index1 | dim0);
     for (int j = 0; j < 2; j++) {
      final int c_idx2 = (j == 0) ? column_index2: (column_index2 | dim0);
      for (int k = 0; k < 2; k++) {
       final int c_idx1 = (k == 0) ? column_index1: (column_index1 | dim0);
       final int r_idx2 = (k == 0) ? row_index2 : (row_index2 | dim0);
       mult0(dim0, m1, r_idx1, c_idx1, m2, r_idx2, c_idx2);
      }
     }
    }
   }
  }
 }
 
 //
 // Tests
 //
 
 // simple_matrix_test1
 public static SimpleMatrix simple_matrix_test1(long[][] matrx1, long[][] matrx2, boolean show) {
  SimpleMatrix m1 = new SimpleMatrix(matrx1);
  SimpleMatrix m2 = new SimpleMatrix(matrx2);
  SimpleMatrix m3 = m1.mult(m2);
  if (show) {
   System.out.println("m1: "+m1);
   System.out.println("m2: "+m2);
   System.out.println("m3: "+m3);
  }
  return m3;
 }
 
 public static long simple_matrix_test1(int dim) {
  long[][] matrx1 = _randomMatrix(dim);
  long[][] matrx2 = _randomMatrix(dim);
  
  simple_matrix_test1(matrx1, matrx2, false);

  long end = System.currentTimeMillis();
  return end;
 }
 public static void simple_test_loop(int max) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
   dim = dim*2;
   long startTime = System.currentTimeMillis();
   long endTime = simple_matrix_test1(dim);
   System.out.println(">> dim: "+dim+", elapsed time: "+(endTime - startTime)+" milli sec");
  }
 }
 
 // rec_matrix_test1
 public static RecMatrix rec_matrix_test1(long[][] matrx1, long[][] matrx2, boolean show) {
  RecMatrix rm1 = new RecMatrix(matrx1);
  RecMatrix rm2 = new RecMatrix(matrx2);
  RecMatrix rm3 = rm1.mult(rm2);
  if (show) {
   System.out.println("m1: "+rm1);
   System.out.println("m2: "+rm2);
   System.out.println("m3: "+rm3);
  }
  return rm3;
 }
 
 public static long rec_matrix_test1(int dim) {
  long[][] matrx1 = _randomMatrix(dim);
  long[][] matrx2 = _randomMatrix(dim);
  
  rec_matrix_test1(matrx1, matrx2, false);

  long end = System.currentTimeMillis();
  return end;
 }
 
 private static String formatBigDecimal(BigDecimal bd){
     DecimalFormat df = new DecimalFormat();
     return df.format(bd);
 }
 
 public static void rec_test_loop(int max) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
   dim = dim*2;
   long startTime = System.currentTimeMillis();
   long endTime = rec_matrix_test1(dim);
   double duration = endTime - startTime;
   BigDecimal b_duration = new BigDecimal(duration);
   BigDecimal dim_p3=new BigDecimal(dim).multiply(new BigDecimal(dim)).multiply(new BigDecimal(dim));
   BigDecimal b_r = b_duration.divide(new BigDecimal(dim)).divide(new BigDecimal(dim)).divide(new BigDecimal(dim)).multiply(new BigDecimal(1000000));
   System.out.println(">> dim: "+dim+", elapsed time: "+duration+" milli sec, dim^p; "+formatBigDecimal(dim_p3)+", ration(time/dim^3)="+formatBigDecimal(b_r));
   //System.out.println(">> dim: "+dim+", elapsed time: "+duration+" milli sec, dim^3: "+dim_p3);
  }
 }
 
 // verify_matrix_test1
 public static void verify_matrix_test1(int dim, boolean show) {
  long[][] matrx1 = _randomMatrix(dim);
  long[][] matrx2 = _randomMatrix(dim);
  
  SimpleMatrix sm3 = simple_matrix_test1(matrx1, matrx2, show);
  RecMatrix rm3 = rec_matrix_test1(matrx1, matrx2, show);
  SimpleMatrix srm3 = new SimpleMatrix(rm3._matrix);
  if (!sm3.equals(srm3)) {
   System.out.println("!!verify_matrix_test1: not equals ");
   System.out.println("m1: "+new SimpleMatrix(matrx1));
   System.out.println("m2: "+new SimpleMatrix(matrx2));
   System.out.println("sm3: "+sm3);
   System.out.println("srm3:"+srm3);
   throw new RuntimeException();
  } else if (show) {
   System.out.println("m1: "+new SimpleMatrix(matrx1));
   System.out.println("m2: "+new SimpleMatrix(matrx2));
   System.out.println("m3: "+sm3);
  }
 }
 
 public static void verify_test_loop(int max, boolean show) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
   dim = dim*2;
   verify_matrix_test1(dim, show);
  }
 }
 
 public static void test1(int max) {
  /*
  System.out.println("verify_test_loop: "+max);
  verify_test_loop(max, false);
  
  System.out.println("simple_test_loop: "+max);
  simple_test_loop(max);
  */
  System.out.println("rec_test_loop: "+max);
  rec_test_loop(max);
 }
 
 public static void main(String[] args) {
  int max = 12;
  test1(max);
  System.out.println(">> done");
  //verify_matrix_test1(2);
  //verify_matrix_test1(4);
  //verify_test_loop(max);
  
  //simple_test_loop(max);
  
  //rec_matrix_test1(4);
  //rec_test_loop(max);
 }
}

Java implemetation of Recursive Matrix Multiplication Algorithm (Part 1)

I tired to make some test case which can be used to evaluate language's capability.
Also I was interested in some concurrent nature of recursive matrix algorithm, I decided to use it as such sample.
But this subject itself is interesting and has  long history and many research on it.

The first time I found the algorithm was from a paper for open source hardware:

http://www.adapteva.com/wp-content/uploads/2013/07/adapteva_ipdps.pdf

http://www.parallella.org/board/

But later I recognized the idea of recursive matrix multiplication are rooted to Strassen's fast matrix multiplication which allows to multiply with O(n^(log_2 7)) versus ordinary O(n^3). (see a book, Modern Computer Algebra)  

But these days, rather than reducing number of multiplication from 8 to 7, it would be more effective if we can employ as many as multi cores.

And this algorithm provides a quite clean separation of jobs to delegate to multi cores. That is something more interesting aspect of the algorithm and actually more effective than reducing the number of multiplications(which may provide 20% improvement).

Originally I was considering writing this for Go, but in fact, Go's syntax is not comfortable to me, so I decided to do it on Java first.
In a sense, the key point to implement  fast  algorithm on Java is to avoid creating intermediate objects. since eventually GC took the most of time than reduced calculation time.
So what we need to do is to write an objectless object oriented program.
I think this is some interesting subject, and it is a bit related Go language object oriented style.

The first implementation in Java was really object oriented. Although I could use more extreme style to avoid casting, this level of implementation will be most common and easy to compare with other language where generics is not so  strong(or not supported). But Later I may add the codes of such an extreme version as well.

This first implementation is following the idea of original algorithm closely.
Namely we treate inner matrix block as element of coefficient ring of the matrix. (2X2 matrix over coefficient ring R).


package org.zen;

import java.util.Random;

public class Matrix_v3 {


 public static interface IRingClass {
  IRing create();
  
  IRing zero();

  IRing unit();
 }

 public static interface IRing {
  IRing add(IRing r);

  IRing subtract(IRing r);

  IRing mult(IRing r);
 }

 public static interface IMatrix extends IRing {
  int dim();

  IRing get(int row, int column);

  void set(int row, int column, IRing r);
 }

 public static class IntRingClass implements IRingClass {
  static final IRing zero = new IntRing(0);
  static final IRing unit = new IntRing(1);

  @Override
  public IRing create() {
   return new IntRing(0);
  }

  public IRing create(int i) {
   return new IntRing(i);
  }

  @Override
  public IRing zero() {
   return zero;
  }

  @Override
  public IRing unit() {
   return unit;
  }
 }

 public static class IntRing implements IRing {
  final int i;

  public IntRing(int i) {
   this.i = i;
  }

  @Override
  public IRing add(IRing r) {
   IntRing ir = (IntRing) r;
   return new IntRing(i + ir.i);
  }

  @Override
  public IRing subtract(IRing r) {
   IntRing ir = (IntRing) r;
   return new IntRing(i - ir.i);
  }

  @Override
  public IRing mult(IRing r) {
   IntRing ir = (IntRing) r;
   return new IntRing(i * ir.i);
  }
  
  @Override
  public String toString() {
   return ""+i;
  }
 }

 public static abstract class AbstractMatrixClass implements IRingClass {
  protected final IRingClass coefficientRing;
  protected final int dim;

  protected AbstractMatrix zero = null;
  protected AbstractMatrix unit = null;

  public AbstractMatrixClass(IRingClass coefficientRing, int dim) {
   this.coefficientRing = coefficientRing;
   this.dim = dim;
  }
  
  void init() {
   this.zero = (AbstractMatrix)create();
   this.unit = (AbstractMatrix)create();
   for (int i = 0; i < dim; i++) {
    for (int j = 0; j < dim; j++) {
     zero.set(i, j, coefficientRing.zero());
     unit.set(i, j, (i == j) ? coefficientRing.unit(): coefficientRing.zero());
    }
   }
  }

  @Override
  public IRing zero() {
   if (zero == null) {
    init();
   }
   return zero;
  }

  @Override
  public IRing unit() {
   if (unit == null) {
    init();
   }
   return unit;
  }

 }

 public static abstract class AbstractMatrix implements IMatrix {
  
  final AbstractMatrixClass _matrixClass;
  
  public AbstractMatrix(final AbstractMatrixClass _matrixClass) {
   this._matrixClass = _matrixClass;
  }

  @Override
  public int dim() {
   return _matrixClass.dim;
  }

  @Override
  public IRing add(IRing m) {
   AbstractMatrix m0 = (AbstractMatrix) this;
   AbstractMatrix m1 = (AbstractMatrix) m;
   AbstractMatrix m2 = (AbstractMatrix)_matrixClass.create();
   for (int i = 0; i < dim(); i++) {
    for (int j = 0; j < dim(); j++) {
     IRing r = m0.get(i, j).add(m1.get(i, j));
     m2.set(i, j, r);
    }
   }
   return m2;
  }

  @Override
  public IRing subtract(IRing m) {
   AbstractMatrix m0 = (AbstractMatrix) this;
   AbstractMatrix m1 = (AbstractMatrix) m;
   AbstractMatrix m2 = (AbstractMatrix)_matrixClass.create();
   for (int i = 0; i < dim(); i++) {
    for (int j = 0; j < dim(); j++) {
     IRing r = m0.get(i, j).subtract(m1.get(i, j));
     m2.set(i, j, r);
    }
   }
   return m2;
  }

  @Override
  public IRing mult(IRing m) {
   AbstractMatrix m0 = (AbstractMatrix) this;
   AbstractMatrix m1 = (AbstractMatrix) m;
   AbstractMatrix m2 = (AbstractMatrix)_matrixClass.create();
   for (int i = 0; i < dim(); i++) {
    for (int j = 0; j < dim(); j++) {
     IRing r = _matrixClass.coefficientRing.zero();
     for (int k = 0; k < dim(); k++) {
      r = r.add(m0.get(i, k).mult(m1.get(k, j)));
     }
     m2.set(i, j, r);
    }
   }
   return m2;
  }

  protected void rangeCheck(int row, int column) {
   if (row < 0 || row >= dim()) {
    throw new RuntimeException();
   }
   if (column < 0 || column >= dim()) {
    throw new RuntimeException();
   }
  }
 }

 public static class SimpleMatrixClass extends AbstractMatrixClass {
  
  public SimpleMatrixClass(final IRingClass coefficientRing, int dim) {
   super(coefficientRing, dim);
  }

  @Override
  public AbstractMatrix create() {
   return new SimpleMatrix(this);
  }
  
  public static SimpleMatrix create(int[][] matrix) {
   int dim = matrix.length;
   SimpleMatrixClass simpeMatrixClass = new SimpleMatrixClass(new IntRingClass(), dim);
   SimpleMatrix sm = new SimpleMatrix(simpeMatrixClass);
   for (int i = 0; i < dim; i++) {
    for (int j = 0; j < dim; j++) {
     sm.set(i, j, new IntRing(matrix[i][j]));
    }
   }
   return sm;
  }
  
  public SimpleMatrix randomMatrix() {
   int[][] matrix = _randomMatrix(dim);
   SimpleMatrix sm = new SimpleMatrix(this);
   for (int i = 0; i < dim; i++) {
    for (int j = 0; j < dim; j++) {
     sm.set(i, j, new IntRing(matrix[i][j]));
    }
   }
   return sm;
  }  
 }
 
 public static class SimpleMatrix extends AbstractMatrix {
  protected final IRing[][] _matrix;

  public SimpleMatrix(final AbstractMatrixClass _matrixClass) {
   super(_matrixClass);
   _matrix = new IRing[_matrixClass.dim][_matrixClass.dim];
   for (int i = 0; i < _matrixClass.dim; i++) {
    for (int j = 0; j < _matrixClass.dim; j++) {
     _matrix[i][j] = _matrixClass.coefficientRing.create();
    }
   }
  }
  public SimpleMatrix(final AbstractMatrixClass _matrixClass, IRing[][] _matrix) {
   super(_matrixClass);
   this._matrix = _matrix;
  }

  @Override
  public IRing get(int row, int column) {
   rangeCheck(row, column);
   return _matrix[row][column];
  }

  @Override
  public void set(int row, int column, IRing r) {
   rangeCheck(row, column);
   _matrix[row][column] = r;
  }
  
  @Override
  public String toString() {
   int dim = _matrix.length;
   StringBuilder sb = new StringBuilder();
   sb.append("(\n");
   for (int i = 0; i < dim; i++) {
    for (int j = 0; j < dim; j++) {
     String s = _matrix[i][j].toString();
     if (j != 0) {
      sb.append(", ");
     }
     sb.append(s);
    }
    sb.append("\n");
   }
   sb.append(")\n");
   return sb.toString();
  }
  
  @Override
  public boolean equals(Object obj) {
   if (obj instanceof SimpleMatrix) {
    SimpleMatrix sm = (SimpleMatrix)obj;
    int dim = sm.dim();
    if (dim() != dim) {
     return false;
    }
    for (int i = 0; i < dim; i++) {
     for (int j = 0; j < dim; j++) {
      if (_matrix[i][j] != sm._matrix[i][j]) return false;
     }
    }
    return true;
   } else {
    return false;
   }
  }
 }

 static Random rand = new Random(); 
 static int[][] _randomMatrix(int dim) {
  final int[][] matrix = new int[dim][dim];
  for (int i = 0; i < dim; i++) {
   for (int j = 0; j < dim; j++) {
    matrix[i][j] = rand.nextInt(50);
   }
  }
  return matrix;
 }
 static int[][] _zeroMatrix(int dim) {
  final int[][] matrix = new int[dim][dim];
  for (int i = 0; i < dim; i++) {
   for (int j = 0; j < dim; j++) {
    matrix[i][j] = 0;
   }
  }
  return matrix;
 }
 
 public static class RecMatrixClass extends SimpleMatrixClass {
  int flat_dim;
  public RecMatrixClass(int flat_dim) {
   super((flat_dim == 2)?new IntRingClass():new RecMatrixClass(flat_dim >> 1), 2);
   this.flat_dim = flat_dim;
   if (flat_dim < 2) {
    throw new RuntimeException(">> RecMatrixClass flat_dim: "+flat_dim);
   }
  }
  
  public RecMatrix randomMatrix() {
   return new RecMatrix(this, _randomMatrix(flat_dim), flat_dim, 0, 0);
  }  
  
  @Override
  public AbstractMatrix create() {
   return new RecMatrix(this, new int[flat_dim][flat_dim], flat_dim, 0, 0);
  }

  public AbstractMatrix create(int[][] matrix, int dim, int row_index, int column_index) {
   return new RecMatrix(this, matrix, dim, row_index, column_index);
  }

 }
 
 public static class RecMatrix extends SimpleMatrix {
  public RecMatrix(final AbstractMatrixClass _matrixClass, int[][] matrix, int dim, int row_index, int column_index) {
   super(_matrixClass);
   
   if (_matrixClass.coefficientRing instanceof SimpleMatrixClass) {
    RecMatrixClass rmCls = (RecMatrixClass)_matrixClass.coefficientRing;
    for (int i = 0; i < 2; i++) {
     for (int j = 0; j < 2; j++) {
      final int dim0 = dim >> 1;
      final int row_index0 = (i == 0)?row_index:row_index|dim0;
      final int column_index0 = (j == 0)?column_index:column_index|dim0;
      set(i, j, rmCls.create(matrix, dim0, row_index0, column_index0));
     }
    }
   } else if (_matrixClass.coefficientRing instanceof IntRingClass) {
    IntRingClass irCls = (IntRingClass)_matrixClass.coefficientRing;
    for (int i = 0; i < 2; i++) {
     for (int j = 0; j < 2; j++) {
      set(i, j, irCls.create(matrix[row_index|i][column_index|j]));
     }
    }
   }
  }

  public String toString() {
   int[][] rep = getMatrixRep();
   int dim = rep.length;
   StringBuilder sb = new StringBuilder();
   sb.append("(\n");
   for (int i = 0; i < dim; i++) {
    for (int j = 0; j < dim; j++) {
     int v = rep[i][j];
     if (j != 0) {
      sb.append(", ");
     }
     sb.append(v);
    }
    sb.append("\n");
   }
   sb.append(")\n");
   return sb.toString();
  }
  
  public int[][] getMatrixRep() {
   int flat_dim = ((RecMatrixClass)_matrixClass).flat_dim;
   int[][] matrix = new int[flat_dim][flat_dim];
   setMatrixRep(matrix, flat_dim, 0, 0);
   return matrix;
  }
  
  void setMatrixRep(int[][] matrix, int dim, int row_index, int column_index) {
   for (int i = 0; i < _matrixClass.dim; i++) {
    for (int j = 0; j < _matrixClass.dim; j++) {
     IRing ring = get(i, j);
     if (ring instanceof RecMatrix) {
      final int dim0 = dim >> 1;
      final int row_index0 = (i == 0)?row_index:row_index|dim0;
      final int column_index0 = (j == 0)?column_index:column_index|dim0;
      ((RecMatrix)ring).setMatrixRep(matrix, dim0, row_index0, column_index0);
     } else if (ring instanceof IntRing) {
      IntRing intRing = (IntRing)ring;
      matrix[row_index+i][column_index+j] = intRing.i;
     } else {
      throw new RuntimeException(">>  setMatrixRep unsupported ring: "+ring);
     }
    }
   }
  }
 }
 
 //
 // Tests
 //
 public static long simple_matrix_test1(int dim) {
  SimpleMatrixClass simleMatrixClass = new SimpleMatrixClass(new IntRingClass(), dim);
  SimpleMatrix sm1 = simleMatrixClass.randomMatrix();
  SimpleMatrix sm2 = simleMatrixClass.randomMatrix();
  SimpleMatrix sm3 = (SimpleMatrix)sm1.mult(sm2);
  
  long end = System.currentTimeMillis();
  /*
  System.out.println("sm1: "+sm1);
  System.out.println("sm2: "+sm2);
  System.out.println("sm3: "+sm3);
  */
  return end;
 }
 public static long rec_matrix_test1(int dim) {
  RecMatrixClass recMatrixClass = new RecMatrixClass(dim);
  RecMatrix rm1 = recMatrixClass.randomMatrix();
  RecMatrix rm2 = recMatrixClass.randomMatrix();
  RecMatrix rm3 = (RecMatrix)rm1.mult(rm2);
  
  long end = System.currentTimeMillis();
  /*
  System.out.println("rm1: "+rm1);
  System.out.println("rm2: "+rm2);
  System.out.println("rm3: "+rm3);
  */
  return end;
 }
 
 public static void simple_test_loop(int max) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
   dim = dim*2;
   long startTime = System.currentTimeMillis();
   long endTime = simple_matrix_test1(dim);
   System.out.println(">> dim: "+dim+", elapsed time: "+(endTime - startTime)+" milli sec");
  }
 }
 public static void rec_test_loop(int max) {
  int dim = 1;
  for (int i = 1; i <= max; i++) {
   dim = dim*2;
   long startTime = System.currentTimeMillis();
   long endTime = rec_matrix_test1(dim);
   System.out.println(">> dim: "+dim+", elapsed time: "+(endTime - startTime)+" milli sec");
  }
 }
 
 public static void main(String[] args) {
  int max = 8;
  //simple_test_loop(max);
  rec_test_loop(max);
  
 }

Here is the execution results:
[simple]
>> dim: 2, elapsed time: 2 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 1 milli sec
>> dim: 16, elapsed time: 2 milli sec
>> dim: 32, elapsed time: 6 milli sec
>> dim: 64, elapsed time: 13 milli sec
>> dim: 128, elapsed time: 31 milli sec
>> dim: 256, elapsed time: 227 milli sec

------------------
[rec]
>> dim: 2, elapsed time: 3 milli sec
>> dim: 4, elapsed time: 0 milli sec
>> dim: 8, elapsed time: 2 milli sec
>> dim: 16, elapsed time: 32 milli sec
>> dim: 32, elapsed time: 378 milli sec
>> dim: 64, elapsed time: 494 milli sec
>> dim: 128, elapsed time: 1620 milli sec
>> dim: 256, elapsed time: 13757 milli sec
}

The main point of this approach is there is no special implementation for multiplication method for this recursive matrix. the algorithm is reflected in how it creates the cofficient ring element recursively.
But the performance is quite low.

Next, I will show new version which improve performance significantly, and the effectiveness of using 4 threads(4 times faster indeed).

Dart 1.0

Dart released 1.0, but it looks like some political gesture, it seems having nothing to do with the quality of the system.
There are still many basic unresolved problems, like mirror, await etc.
Essentially certain power user cannot use Dart at this level of quality.
Also editor sometimes causing suppressing internal error, and showing bad error messages.
Anyway, they would have needed to show some progress for public, that would be the only the reason for they have released 1.0.

Also the last minutes announcement of deprecation web-component and replacing it by unproven polymer is also nonsenses move.
If it is experimental nature, it is better to avoid pushing such technology. This also would have come from political decision to provide some sort of solution to support component based web development which dart supposed to address. But like JavaScript does not provide such framework, Dart actually does not have to provide such 'solution'. It doesn't have to be a part of Dart release at all.

Since Dart is new language, it may take sometime to understand the best way to utilize new language features for component based web developments.
These thing may be delegated to communities, and wait to see emerging new solutions.

Also in general, if we develop with JavaScript, we don't see any fragment of HTML. So such focus on HTML file on Polymer looks very much old fashioned approach.
It will be an obstacle for utilizing full language capabilities to design component based library.

So I feel a lot of questionable thing for Dart right now.
But if I wait Dart 2.0, it may be something I expected as 1.0.
So just I may need to wait another year to mature.

At least, if Dart does not provide proper synchronization feature for server side application, it seems a waste of time to develop something on server side right now.

Thursday, November 7, 2013

bad choice for Server side Dart?

I investigated the isolate to see if it can support waiting another isolate to finish, then continue the rest of job.
But Dart's concurrency is really restrictive, there seems no way to achieve this.
Considering many poor design of asynchronity of current Dart, server side Dart seems not good option right now.

Maybe I may try 'Go' as well..
Rob pike and Thompson who are behind Go had been working on this area.

http://doc.cat-v.org/plan_9/4th_edition/papers/sleep

Of cause Java is still OK..
If we define some mapping tool from Dart annotation to JPA annotation, it is still not so bad option.But even we use java in server side, lazy evaluation on Dart client is still not possible. As long as the expected job in client side is low, this may not be a big problem though..

Tuesday, November 5, 2013

video:Dart Today and Beyond with Google's Gilad Bracha

Dart Today and Beyond with Google's Gilad Bracha 

This video was taken a year ago, just after m1 release. It explains some of the idea behind Dart , so it is worth watching.  

in that video, he was telling he want to reduce asynchronity. in fact, mirror dropped future from the api, but io lib added much more async api. so totally dart increased a lot of asynchronity.

Anyway, he looks kind of language proper guy, it would be better to couple with  another guy with broader application development experiences for leading language design so that it won't go to romanticist's dream world.

 

Monday, November 4, 2013

Dart's Future evaluation order

Dart's future evaluation order is not so clear from source code, so I tested using sample code.
Also this test when stream evaluation happen. listen method also started after the end of main code line and after running all futures. this means listen may create future, but not at the time listen is invoked.

import "dart:async";

main() {
    Stream stream = new Stream.periodic(const Duration(milliseconds: 1), (x) => x);
    int receivedCount = 0;
    var subscription;
    print("<1>");
    int a = 0;
    new Future.value(1).then((v) {
      print(">> 1 v: ${v}");
      new Future.value(10).then((v) {
        print(">> 10 v: ${v}");
        
      });
    });
    
    bool finished = false;
    subscription = stream.listen((data) {
      //expect(data, receivedCount);
      receivedCount++;
      print("receivedCount: ${receivedCount}");
      finished = true;
      new Future.value(30).then((v) {
        print(">> 30 v: ${v}");
        
      });
      if (receivedCount == 5) subscription.cancel();
      
    });
    new Future.value(2).then((v) {
      print(">> 2 v: ${v}");
      new Future.value(20).then((v) {
        print(">> 20 v: ${v}");
        
      });
    });
    /*
    while (!finished) {
      
    }
    */
    print("<2>");
}


<1>
<2>
>> 1 v: 1
>> 2 v: 2
>> 10 v: 10
>> 20 v: 20
receivedCount: 1
receivedCount: 2
>> 30 v: 30
>> 30 v: 30
receivedCount: 3
receivedCount: 4
>> 30 v: 30
>> 30 v: 30
receivedCount: 5
>> 30 v: 30

If we add while loop:

import "dart:async";

main() {
    Stream stream = new Stream.periodic(const Duration(milliseconds: 1), (x) => x);
    int receivedCount = 0;
    var subscription;
    print("<1>");
    int a = 0;
    new Future.value(1).then((v) {
      print(">> 1 v: ${v}");
      new Future.value(10).then((v) {
        print(">> 10 v: ${v}");
        
      });
    });
    
    bool finished = false;
    subscription = stream.listen((data) {
      //expect(data, receivedCount);
      receivedCount++;
      print("receivedCount: ${receivedCount}");
      finished = true;
      new Future.value(30).then((v) {
        print(">> 30 v: ${v}");
        
      });
      if (receivedCount == 5) subscription.cancel();
      
    });
    new Future.value(2).then((v) {
      print(">> 2 v: ${v}");
      new Future.value(20).then((v) {
        print(">> 20 v: ${v}");
        
      });
    });

    while (!finished) {
      
    }

    print("<2>");
}

The output will be just <1>.
<1>


These things means future evaluation will happen only main code line was ended.
And the order of future evaluation is the order it was created.

Friday, November 1, 2013

Big restriction: Dart's Stream does not support synchronous mode

When I tried to support getter property of another referenced entity, I came to recognize that it is NOT possible in current Dart.

This is ridiculous restriction.
And it was supported until M3 deprecated the previous synchronous Stream classes.
I think this is pretty much nutty stupid decision.
It must have been done by young limited experienced people who can believe such single bullet approach can solve every thing. Probably it may solve every thing they knew.
That is the problem of the lack of experience.

Whole Dart development somehow has resemblance to early stage of Java around 1995. Probably many of developer of Dart does not remember.

Java has been introduced as a language for Applet. And marketed as such, then it was mainly used for server side. The main reason for the adaption can be contributed several reasons. one aspect is it was free, and it provided a lot of library. Compared to C/C++, these are big difference. Also from language point of view, Java supported class, object oriented features and garbage collection.(and lack of pointer). So compared to C/C++, it was significantly high level language, it was good enough for the most of server side application.

(but it was really backward step compared to other advanced moduler languages, like Mesa/Cedar, Ada, Modula-2, etc. but they never became known to wider audiences, that may have helped Java to proliferated also..)

For Dart,  the current marketing Buz is for browser side language(to replace JavaScript (eventually..)), but as a language, it has also unique feature integrating object/functional idioms in a single language.
Although there are many of such languages(Like Scala), there are no other language which can support both browser and server side.

I think  these two aspects are quite important. Although current marketing focus is in browser side, like polymer, dart:io etc, but it will be used more on server side.
Since the combination of object/functional style is very powerful, once people realized how it is productive, they will never want to write huge amount of equivalent codes in Java. Also sharing the same class from both sides is big plus.

Bottom line is  Dart should target such server side development in much higher priority. Right now, many people seems not developing significant application, they will not complain, but eventually, more people start using Dart, this kind of biased API design will bring a lot of complaints.