[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user bdrillard commented on the issue: https://github.com/apache/spark/pull/19518 This PR was addressed by #19811, closing this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19518 @bdrillard can you close this pr? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19518 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 Good new is [the issue](https://github.com/janino-compiler/janino/issues/33) in janino has been quickly fixed. Bad new is no official date to release the next version now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 > I think ldc is 2 bytes and ldc_w is 3 bytes? You are right, thanks, updated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19518 > 4 or 5 bytes: others (ldc or ldc_w) I think ldc is 2 bytes and ldc_w is 3 bytes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 In #19811, first, I will take a hybrid approach of outer (flat) global variable and arrays. The threshold for # of global variables would be configurable. I will give high priority to primitive variables to place them at the outer class due to performance. > I think it would be great if we were able to declare variables which are used only in an inner class in that inner class. Unfortunately, I think also that this is not trivial to achieve at all. It would be great if we could do this. For now, #19811 will not address this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 I created and ran another synthetic benchmark program for comparing flat global variables, inner global variables, and array using janinoc for target Java file. The performance is not much different from the previous one. In summary, the followings are performance results (**small number is better**). - 1: array - 0.90: inner global variables - 0.81: flat global variables WDYT? Any comments are very appreciated. Here are [Test.java](https://gist.github.com/kiszk/63c2829488cb777d7ca78d45d20c021f) and [myInsntance.py](https://gist.github.com/kiszk/049a62f5d1259481c400a86299bd0228) that I used. ``` $ cat /proc/cpuinfo | grep "model name" | uniq model name : Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz $ java -version openjdk version "1.8.0_131" OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11) OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode) $ python myInstance.py > MyInstance.java && janinoc MyInstance.java && javac Test.java && java -Xmx16g Test Result(us): Array 0: 484251.446 1: 483374.255 2: 483956.692 3: 482498.241 4: 483602.261 5: 482654.567 6: 482896.671 7: 483458.625 8: 483194.317 9: 483387.234 10: 484103.729 11: 483536.493 12: 483790.828 13: 483590.991 14: 483993.488 15: 483455.164 16: 484040.009 17: 483225.837 18: 483126.520 19: 484105.989 20: 484988.935 21: 483766.245 22: 483667.930 23: 483271.499 24: 483071.606 25: 483174.438 26: 483602.474 27: 483210.405 28: 483907.061 29: 483071.964 BEST: 482498.241000, AVG: 483532.530 Result(us): InnerVars 0: 437016.533 1: 436125.481 2: 436360.534 3: 435857.758 4: 436166.243 5: 437089.913 6: 436168.359 7: 435570.397 8: 435550.848 9: 435256.088 10: 435252.679 11: 435765.156 12: 435646.739 13: 437303.993 14: 435315.530 15: 435752.545 16: 434857.606 17: 436776.190 18: 435444.877 19: 435657.649 20: 436248.147 21: 436322.998 22: 437214.262 23: 435907.223 23: 435907.223 24: 435431.025 25: 435274.317 26: 435412.202 27: 435670.321 28: 436494.045 29: 436347.838 BEST: 434857.606, AVG: 435975.250 Result(us): Vars 0: 353983.048 1: 354067.690 2: 353138.178 3: 354093.115 4: 354067.180 5: 352750.571 6: 353672.510 7: 355179.115 8: 353296.750 9: 354522.113 10: 355221.301 11: 355178.172 12: 353859.319 13: 353539.817 14: 352703.352 15: 353923.981 16: 354442.744 17: 355523.145 18: 354849.122 19: 354082.888 20: 354673.504 21: 355526.218 22: 355264.029 23: 355455.492 24: 355520.322 25: 353923.520 26: 353796.600 27: 355021.849 28: 355800.387 29: 353810.567 BEST: 352703.352, AVG: 354362.887 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 In int case, the followings are the number of java byte code length for a value 1 byte: -1, 0, 1, 2, 3, 4, 5 (iconst_?) 2 byte: -128 ~ -2, 6 ~ 127 (bipush) 3 bytes: -32768 ~ -129, 128 ~ 32767 (sipush) 4 or 5 bytes: others (ldc or ldc_w) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19518 The PR looks good. Although it can solve the constant pool pressure, however, I have a question. Does it mean every time a constant falling in short range is used in codes, we increase 1 byte in bytecodes because sipush is followed by two byte value. I'm afraid it may be negative to some cases like many short constants in a java program. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 Here is [a PR](https://github.com/janino-compiler/janino/pull/34) to fix a problem regarding `sipush`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 Next, I analyzed usage of constant pool entries and java bytecode ops using `janinoc`. The summary is as follows: ``` array[4] : 6 + 0 * n entries, 6-8 java bytecode ops / read access outerInstance: 3 + 3 * n entries, 5 java bytecode ops / read access innerInsntace: 9 + 3 * n entries, 6 java bytecode ops / read access ``` ``` public class CP { int[] a = new int[100]; int globalVar0; int globalVar1; private Inner inner = new Inner(); private class Inner { int nestedVar0; int nestedVar1; } void access() { a[4] = 0; a[5] = 0; globalVar0 = 0; globalVar1 = 0; inner.nestedVar0 = 0; inner.nestedVar1 = 0; } static public void main(String[] argv) { CP cp = new CP(); cp.access(); } } ``` Java bytecode ``` void access(); descriptor: ()V Code: stack=3, locals=1, args_size=1 0: aload_0 1: getfield #12 // Field a:[I 4: iconst_4 5: iconst_0 6: iastore 7: aload_0 8: getfield #12 // Field a:[I 11: iconst_5 12: iconst_0 13: iastore 14: aload_0 15: iconst_0 16: putfield #16 // Field globalVar0:I 19: aload_0 20: iconst_0 21: putfield #19 // Field globalVar1:I 24: aload_0 25: getfield #23 // Field inner:LCP$Inner; 28: iconst_0 29: putfield #28 // Field CP$Inner.nestedVar0:I 32: aload_0 33: getfield #23 // Field inner:LCP$Inner; 36: iconst_0 37: putfield #31 // Field CP$Inner.nestedVar1:I 40: return ``` Constant pool ``` #1 = Utf8 CP #2 = Class #1 // CP #9 = Utf8 a #10 = Utf8 [I #11 = NameAndType#9:#10 // a:[I #12 = Fieldref #2.#11 // CP.a:[I #13 = Utf8 globalVar0 #14 = Utf8 I #15 = NameAndType#13:#14// globalVar0:I #16 = Fieldref #2.#15 // CP.globalVar0:I #17 = Utf8 globalVar1 #18 = NameAndType#17:#14// globalVar1:I #19 = Fieldref #2.#18 // CP.globalVar1:I #20 = Utf8 inner #21 = Utf8 LCP$Inner; #22 = NameAndType#20:#21// inner:LCP$Inner; #23 = Fieldref #2.#22 // CP.inner:LCP$Inner; #24 = Utf8 CP$Inner #25 = Class #24// CP$Inner #26 = Utf8 nestedVar0 #27 = NameAndType#26:#14// nestedVar0:I #28 = Fieldref #25.#27// CP$Inner.nestedVar0:I #29 = Utf8 nestedVar1 #30 = NameAndType#29:#14// nestedVar1:I #31 = Fieldref #25.#30// CP$Inner.nestedVar1:I #32 = Utf8 LineNumberTable #33 = Utf8 Code #34 = Utf8 main #35 = Utf8 ([Ljava/lang/String;)V #36 = Utf8 #37 = NameAndType#36:#8 // "":()V #38 = Methodref #2.#37 // CP."":()V #39 = NameAndType#7:#8 // access:()V #40 = Methodref #2.#39 // CP.access:()V #41 = Methodref #4.#37 // java/lang/Object."":()V #42 = Integer100 #43 = Utf8 (LCP;)V #44 = NameAndType#36:#43// "":(LCP;)V #45 = Methodref #25.#44// CP$Inner."":(LCP;)V #46 = Utf8 Inner #47 = Utf8 InnerClasses ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 First of all, I have to share sad news with you. `janino` does not use `sipush` for values from 128 to 32767. Current `janino` 3.0.7 uses `iconst...`, `bipush`, or `ldc`. `javac` uses `sipush` for values from 128 to 32767. In other words, if index is greater than 127, one constant pool is used by bytecode compiled by `janino`. It should be fixed. ``` public class Array { int[] a = new int[100]; void access() { a[5] = 0; a[6] = 0; a[127] = 0; a[128] = 0; a[1023] = 0; a[16383] = 0; a[32767] = 0; a[32768] = 0; } static public void main(String[] argv) { Array a = new Array(); a.access(); } } ``` ``` void access(); descriptor: ()V flags: Code: stack=3, locals=1, args_size=1 0: aload_0 1: getfield #12 // Field a:[I 4: iconst_5 5: iconst_0 6: iastore 7: aload_0 8: getfield #12 // Field a:[I 11: bipush6 13: iconst_0 14: iastore 15: aload_0 16: getfield #12 // Field a:[I 19: bipush127 21: iconst_0 22: iastore 23: aload_0 24: getfield #12 // Field a:[I 27: ldc #13 // int 128 29: iconst_0 30: iastore 31: aload_0 32: getfield #12 // Field a:[I 35: ldc #14 // int 1023 37: iconst_0 38: iastore 39: aload_0 40: getfield #12 // Field a:[I 43: ldc #15 // int 16383 45: iconst_0 46: iastore 47: aload_0 48: getfield #12 // Field a:[I 51: ldc #16 // int 32767 53: iconst_0 54: iastore 55: aload_0 56: getfield #12 // Field a:[I 59: ldc #17 // int 32768 61: iconst_0 62: iastore 63: return Constant pool: #1 = Utf8 Array #2 = Class #1 // Array #9 = Utf8 a #10 = Utf8 [I #11 = NameAndType#9:#10 // a:[I #12 = Fieldref #2.#11 // Array.a:[I #13 = Integer128 #14 = Integer1023 #15 = Integer16383 #16 = Integer32767 #17 = Integer32768 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19518 For the strategy, I'd like to give priority to primitive values to be in flat global variables. We also need to decide the priority between primitive types, according to which type has largest performance difference between flat global variable and array, and which type is used more frequently(may be boolean). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19518 @viirya in general there is no benefit. There will be a benefit if we manage to declare them where they are used, but I am not sure this is feasible. In this way, they do not add any entry to the constant pool. For instance, if we have a inner class `InnerClass1` and we use `isNull_1` only there, if we define `isNull_1` as a variable of `InnerClass1` instead of a variable of the outer class we have no entry about it in the outer class. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19518 @mgaido91 Thanks. I looked at the constant pool you posted. It's clear. Any benefit to declare the variables in the inner classes? Looks like they still occupy constant pool entries? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19518 @viirya if you take a look at the example I posted, you can see that we are not saving either `NameAndType` or `Fieldref`, thus think the only solution to save constant pool entries we have found so far is to use arrays. What may be interesting IMHO, is to evaluate where we are using a variable. Since when we have a lot of instance variables we are very likely to have also several inner classes (for splitting the methods), I think it would be great if we were able to declare variables which are used only in an inner class in that inner class. Unfortunately, I think also that this is not trivial to achieve at all. @kiszk what do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19518 > When we use an inner class approach, we still require constant pool entry for accessing instance variables (e.g. `this.inner001.globalVar5) in one class. @kiszk But we still can save the name/type and field name for global variable? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19518 Btw, can we config the maximum number of global variables? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19518 The hybrid approach sounds reasonable to me. Any special strategy to use to decide which fields are global variables and which are in array? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 @mgaido91 Thank you very much. I did not know such a difference. I will validate with `janinoc` 3.0.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19518 @kiszk I meant that `janinoc` creates a slightly different constant pool from `javac`. I am not sure about performances, but the number of constant pool entries is definitely different. For instance, let's take this example and analyze the `Outer` class constant pool: ``` class Outer { private Inner innerInstance = new Inner(); private void outerMethod(){ innerInstance.b = 1; innerInstance.a = 1; } private boolean outerMethod2(){ return innerInstance.b > innerInstance.a; } private class Inner { int b =2; private int a = 0; } } ``` if you compile it with `javac`, the constant pool will be: ``` Constant pool: #1 = Methodref #9.#25 // java/lang/Object."":()V #2 = Class #26// Outer$Inner #3 = Methodref #2.#27 // Outer$Inner."":(LOuter;LOuter$1;)V #4 = Fieldref #8.#28 // Outer.innerInstance:LOuter$Inner; #5 = Fieldref #2.#29 // Outer$Inner.b:I #6 = Methodref #2.#30 // Outer$Inner.access$102:(LOuter$Inner;I)I #7 = Methodref #2.#31 // Outer$Inner.access$100:(LOuter$Inner;)I #8 = Class #32// Outer #9 = Class #33// java/lang/Object #10 = Class #34// Outer$1 #11 = Utf8 InnerClasses #12 = Utf8 Inner #13 = Utf8 innerInstance #14 = Utf8 LOuter$Inner; #15 = Utf8 #16 = Utf8 ()V #17 = Utf8 Code #18 = Utf8 LineNumberTable #19 = Utf8 outerMethod #20 = Utf8 outerMethod2 #21 = Utf8 ()Z #22 = Utf8 StackMapTable #23 = Utf8 SourceFile #24 = Utf8 Outer.java #25 = NameAndType#15:#16// "":()V #26 = Utf8 Outer$Inner #27 = NameAndType#15:#35// "":(LOuter;LOuter$1;)V #28 = NameAndType#13:#14// innerInstance:LOuter$Inner; #29 = NameAndType#36:#37// b:I #30 = NameAndType#38:#39// access$102:(LOuter$Inner;I)I #31 = NameAndType#40:#41// access$100:(LOuter$Inner;)I #32 = Utf8 Outer #33 = Utf8 java/lang/Object #34 = Utf8 Outer$1 #35 = Utf8 (LOuter;LOuter$1;)V #36 = Utf8 b #37 = Utf8 I #38 = Utf8 access$102 #39 = Utf8 (LOuter$Inner;I)I #40 = Utf8 access$100 #41 = Utf8 (LOuter$Inner;)I ``` (please note that it creates a fake getter and a fake setter method entries for the `private` inner variable `a`). If you compile the same class with `janinoc`, instead, the constant pool will be: ``` Constant pool: #1 = Utf8 Outer #2 = Class #1 // Outer #3 = Utf8 java/lang/Object #4 = Class #3 // java/lang/Object #5 = Utf8 SourceFile #6 = Utf8 Outer.java #7 = Utf8 outerMethod$ #8 = Utf8 (LOuter;)V #9 = Utf8 innerInstance #10 = Utf8 LOuter$Inner; #11 = NameAndType#9:#10 // innerInstance:LOuter$Inner; #12 = Fieldref #2.#11 // Outer.innerInstance:LOuter$Inner; #13 = Utf8 Outer$Inner #14 = Class #13// Outer$Inner #15 = Utf8 b #16 = Utf8 I #17 = NameAndType#15:#16// b:I #18 = Fieldref #14.#17// Outer$Inner.b:I #19 = Utf8 a #20 = NameAndType#19:#16// a:I #21 = Fieldref #14.#20// Outer$Inner.a:I #22 = Utf8 LineNumberTable #23 = Utf8 Code #24 = Utf8 outerMethod2$ #25 = Utf8 (LOuter;)Z #26 = Utf8 #27 = Utf8 ()V #28 = NameAndType#26:#27// "":()V #29 = Methodref #4.#28 // java/lang/Object."":()V #30 = NameAndType#26:#8 // "":(LOuter;)V #31 = Methodref #14.#30// Outer$Inner."":(LOuter;)V #32 = Utf8 Inner #33 = Utf8 InnerClasses ``` (note that `a` now is considered as a regular field). Thus in all our
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19518 I like the latest @kiszk hybrid idea in terms of performance and readability. Also, this is a corner case, so I don't want affect most regular small queries. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 @mgaido91 Thank you for your questions. 1. I am using `javac` as shown. I am sorry that I cannot understand what you are pointing out. In this benchmark, what are differences between `javac` and `janinoc`? 2. I agree with you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 Based on performance results and usage of constant pool entry, I would like to use hybrid approach with flat global variable and array. For example, first 500 variables are stored into flat global variables, then others are stored into arrays with 32767 elements. I think that most of non-extreme cases can enjoy simple code without array accesses and good performance. WDYT? ``` class Foo { int globalVars1; int globalVars2; ... int globalVars499; int globalVars500; int[] globalArrays1 = new int[32767]; int[] globalArrays2 = new int[32767]; int[] globalArrays3 = new int[32767]; ... void apply1(InternalRow i) { globalVars1 = 1; globalVars2 = 1; ... globalVars499 = 1; globalVars500 = 1; } void apply2(InternalRow i) { globalArrays1[0] = 1; globalArrays1[1] = 1; ... } void apply(InternalRow i) { apply0(i); apply1(i); apply2(i); ... } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 I created and ran another synthetic benchmark program for comparing flat global variables, inner global variables, and array. In summary, the followings are performance results (**small number is better**). - 0.65: flat global variables - 0.91: inner global variables - 1: array WDYT? Any comments are very appreciated. Here are [Test.java](https://gist.github.com/kiszk/63c2829488cb777d7ca78d45d20c021f) and [myInsntance.py](https://gist.github.com/kiszk/049a62f5d1259481c400a86299bd0228) that I used. ``` $ cat /proc/cpuinfo | grep "model name" | uniq model name : Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz $ java -version openjdk version "1.8.0_131" OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11) OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode) $ python myInstance.py > MyInstance.java && javac Test.java && java Test Result(us): Array 0: 462848.969 1: 461978.693 2: 463174.459 3: 461422.763 4: 460563.915 5: 460112.262 6: 460059.957 7: 460376.230 8: 460245.445 9: 460308.775 10: 460154.955 11: 460005.629 12: 460330.584 13: 460277.612 14: 460181.360 15: 460168.843 16: 459790.137 17: 460248.481 18: 460344.471 19: 460084.529 20: 459987.263 21: 459961.639 22: 459952.447 23: 460128.518 24: 460025.783 25: 459874.303 26: 459932.685 27: 460065.736 28: 459954.526 29: 459972.679 BEST: 459790.137000, AVG: 460417.788 Result(us): InnerVars 0: 421013.480 1: 420279.235 2: 419366.157 3: 421015.934 4: 419540.049 5: 420316.650 6: 419816.612 7: 420211.140 8: 420215.864 9: 421104.657 10: 421836.430 11: 420866.894 12: 421457.850 13: 421734.506 14: 420796.010 15: 419832.910 16: 420012.167 17: 420821.800 18: 420962.178 19: 421981.676 20: 421721.257 21: 419996.594 22: 419742.884 23: 420158.066 24: 420156.773 25: 420325.231 26: 420966.914 27: 420787.147 28: 420296.789 29: 420520.843 BEST: 419366.157, AVG: 420595.157 Result(us): Vars 0: 343490.797 1: 342849.079 2: 341990.967 3: 342844.044 4: 343484.681 5: 342586.419 6: 342468.883 7: 343113.300 8: 343516.875 9: 343002.395 10: 341499.538 11: 342192.102 12: 341847.383 13: 342533.215 14: 341376.556 15: 342018.111 16: 341316.445 17: 342043.378 18: 341969.932 19: 343415.854 20: 343103.133 21: 342084.686 22: 341555.293 23: 342984.355 24: 342302.336 25: 341994.372 26: 342475.639 27: 342281.214 28: 342205.175 29: 342462.032 BEST: 341316.445, AVG: 342433.606 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19518 can we use `32767` as array size upper bound? e.g. ``` class Foo { int[] globalVars1 = new int[32767]; int[] globalVars2 = new int[32767]; int[] globalVars3 = new int[32767]; ... void apply0(InternalRow i) { globalVars1[0] = 1; globalVars1[1] = 1; ... } void apply1(InternalRow i) { globalVars2[0] = 1; globalVars2[1] = 1; ... } void apply(InternalRow i) { apply0(i); apply1(i); ... } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19518 @kiszk thanks for your great analysis. May I have just a couple of additional questions? 1 - In all your tests, which compiler are you using? Because I see that you are linking to the Oracle docs and maybe you are using `javac` for your tests, but in my tests (made for other cases) I realized that `janinoc` works a bit differently and what is true for `javac` may not be for `janinoc`. 2 - If the problem with array occurs when we go beyond 32767, what about creating many arrays with max size 32767? I see that this is not a definitive solution and still we have some limitations, but dividing the number of constant pool entries by 32767 looks a very good achievement to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 | what do you mean by this? using array can't reduce constant pool size at all? Not at all. However, when array index for an access is greater than 32768, the access requires constant pool entry. This is because integer constant of 32768 or greater uses `ldc` java bytecode instruction [[ref]](https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.ldc)[[ref]](https://cs.au.dk/~mis/dOvs/jvmspec/ref-_ldc.html). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19518 > Each access to an array element requires one constant pool entry. what do you mean by this? using array can't reduce constant pool size at all? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 @cloud-fan you are right, I am updating benchmark program and results. I realized that one issue of the array approach: **limitation of constant pool entries due to array access index** When we use an array approach, a global variable will be accessed by `this.globalVar[12345]`. Here is a bytecode sequence. Each access to an array element (index is greater than 5 since iconst_0 ... iconst_5 do not use constant pool entry) requires one constant pool entry. While we reduce one constant pool entry for global variable, we require one constant pool entry. @bdrillard did your implementation (probably around [here](https://github.com/apache/spark/pull/19518/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR218)) avoid this issue? WDYT? cc @viirya @maropu ``` aload 0 // load this getfield [constant pool index] // load this.globalVar ldc [constant pool index] // load 12345 from constant pool and push it iaload ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19518 You are comparing array vs member variables, can we compare array vs inner class member variable? And too many classes will have overhead on the classloader, we should test some extreme cases like 1 million variables. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19518 I'd prefer inner class approach. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 @bdrillard @cloud-fan @maropu I created and run a benchmark program. I think that to use an array for a compaction is slower than to use scalar instance variables. In the following my case, 20% slower in the best time. Thus, I would like to use an approach to create inner classes to keep in scalar instance variables. WDYT? Any comments are very appreciated. Here are [Test.java](https://gist.github.com/kiszk/63c2829488cb777d7ca78d45d20c021f) and [myInsntance.py](https://gist.github.com/kiszk/049a62f5d1259481c400a86299bd0228) that I used. ``` $ cat /proc/cpuinfo | grep "model name" | uniq model name : Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz $ java -version openjdk version "1.8.0_131" OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11) OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode) $ python myInstance.py > MyInstance.java && javac Test.java && java Test Result(us): Array 0: 145333.227 1: 144288.262 2: 144233.871 3: 144536.350 4: 144503.269 5: 144836.117 6: 18.053 7: 144744.725 8: 144688.652 9: 144727.823 10: 17.789 11: 144500.638 12: 144641.592 13: 144464.106 14: 144518.914 15: 144844.639 16: 144780.464 17: 144617.363 18: 144463.271 19: 144508.170 20: 144929.451 21: 144529.697 22: 144273.167 23: 144362.926 24: 144296.854 25: 144398.665 26: 144490.813 27: 144435.732 28: 144675.997 29: 144483.581 BEST: 144233.871000, AVG: 144566.806 Result(us): Vars 0: 120375.384 1: 119800.238 2: 119822.842 3: 119830.761 4: 119836.781 5: 120185.751 6: 120208.140 7: 120274.925 8: 120112.109 9: 120082.120 10: 120063.456 11: 120112.493 12: 120144.937 13: 119964.356 14: 119941.633 15: 119825.758 16: 119677.506 17: 119833.236 18: 119749.781 19: 119723.932 20: 120197.394 21: 120052.820 22: 120006.650 23: 119939.335 24: 119857.469 25: 120176.229 26: 120153.605 27: 120345.581 28: 120163.129 29: 120038.673 BEST: 119677.506, AVG: 120016.567 ``` Small MyInstance.java (N = 16, M = 4) ``` class MyInstance { final int N = 16; int[] instance = new int[N]; void accessArrays0() { instance[8] = instance[0]; instance[9] = instance[1]; instance[10] = instance[2]; instance[11] = instance[3]; } void accessArrays1() { instance[12] = instance[4]; instance[13] = instance[5]; instance[14] = instance[6]; instance[15] = instance[7]; } void accessArrays2() { instance[0] = instance[8]; instance[1] = instance[9]; instance[2] = instance[10]; instance[3] = instance[11]; } void accessArrays3() { instance[4] = instance[12]; instance[5] = instance[13]; instance[6] = instance[14]; instance[7] = instance[15]; } void accessArray() { accessArrays0(); accessArrays1(); accessArrays2(); accessArrays3(); } int instance0; int instance1; int instance2; int instance3; int instance4; int instance5; int instance6; int instance7; int instance8; int instance9; int instance00010; int instance00011; int instance00012; int instance00013; int instance00014; int instance00015; void accessVars0() { instance8 = instance0; instance9 = instance1; instance00010 = instance2; instance00011 = instance3; } void accessVars1() { instance00012 = instance4; instance00013 = instance5; instance00014 = instance6; instance00015 = instance7; } void accessVars2() { instance0 = instance8; instance1 = instance9; instance2 = instance00010; instance3 = instance00011; } void accessVars3() { instance4 = instance00012; instance5 = instance00013; instance6 = instance00014; instance7 = instance00015; } void accessVars() { accessVars0(); accessVars1(); accessVars2(); accessVars3(); } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19518 @bdrillard thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user bdrillard commented on the issue: https://github.com/apache/spark/pull/19518 Thanks for giving this the attention to shepard it on through. I haven't had the time to do the additional coding work necessary to properly benchmark it in the last few weeks. @kiszk, if there are any questions in regards to my earlier implementation as you make/review the second PR, I'm happy to make clarifications and would be able to respond to those in writing quickly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 OK, I will create a new PR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19518 let's create a new PR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 ping @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 @cloud-fan Is it better to use this PR? Or, create a new PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19518 yea, ok @kiszk I'll review your work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 @cloud-fan I want to take this over if possible cc @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19518 @kiszk @maropu any of you wanna take this over? This patch becomes important as we now split codes more aggressively. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19518 ping @bdrillard --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19518 @bdrillard since my PR and other get merged now there are some conflicts, may you please fix them? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user bdrillard commented on the issue: https://github.com/apache/spark/pull/19518 @kiszk Ah, thanks for the link back to that discussion. I'll make modifications to the trials for better data. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 @bdrillard I remember that we had the similar discussion about benchmarking. Could you see [this discussion](https://github.com/apache/spark/pull/16648#discussion_r118043056)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user bdrillard commented on the issue: https://github.com/apache/spark/pull/19518 @kiszk You are correct that the current implementation compacts all mutable state (where the state does not have to be _explicitly_ inlined). To your last question, I'd attempted some analysis of the JVM bytecode of array versus inlined state initialized either through method calls or in loops. I'd posted the experiment and results: https://github.com/bdrillard/bytecode-poc If Spark has its own benchmarking tools, I'd be happy to use those to compare Catalyst-generated classes further. To the general question of _when_ we compact state, I think some kind of threshold still does makes sense. It would be best to ensure that the typical code path (for typical Dataset schemas) remains un-impacted by the changes (as was the aim when generating nested classes in #18075). I've found trying to set a global threshold for when to compact mutable state can be hard. Some state _has_ to be inlined (state that uses parameterized constructors that can't be easily initialized with loops, like the `BufferHolder` and `UnsafeRowWriter`). I've found situations where, due to code generator flow, we began by inlining an amount of state that _could have been_ compacted, then started compacting state as after a set threshold, but then began inlining state again that _could not be_ compacted, forcing us over the constant pool limit. It's difficult to tell when a certain piece of state will be referenced frequently or infrequently. For example, we do know some pieces of primitive mutable state, like global booleans that are part of conditional checks, are initialized globally, assigned once in one method, and then referenced only once in a separate caller method. These are excellent candidates for compaction, since they proliferate very quickly and are, in a sense, "only used once" (declared, initialized, re-assigned in a method, accessed in another method, never used again). Other pieces of state, like row objects, and JavaBean objects, will be accessed a number of times relative to how many fields they have, which isn't necessarily easy info to retrieve during code generation (we'd have to reflect or do inspection of the initialization code to know how many fields such an object has). But these items are probably still good candidates for compaction in general because of how many of a given type there could be. I'm inclined to use a threshold against the name/types of the state, rather than a global threshold. Since `freshName` is always monotonically increasing from 1 for a given variable prefix, we could know when a threshold for state of that type was reached, and when we could begin compacting that type of state, independently/concurrently with the other types of state. Such a scheme would allow us to ensure the usual flow of code-generation remains as it is now, with no state-compaction for typical operations, and then with state-compaction in the more extreme cases that would threaten to blow the Constant Pool limit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 Thank you for creating a PR for the latest Spark. I think that it is great to reduce # of constant pool entries. I have one high level comment. IIUC, this PR **always** perform mutable state compaction. In other words, mutable states are in arrays. I am afraid about possible performance degradation due to increasing access cost by putting states in arrays. What do you think about putting mutable states into arrays (i.e. performing mutable state compaction) only when there are many mutable states or only for certain mutable states that are rarely accessed? Or, can we say there is no performance degradation due to mutable state compaction? What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19518 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org