Re: [DISCUSS][Java] Design of the algorithm module
mselves, rather than openning a >>> JIRA in the community. It takes a long time to provide a PR, review the >>> code, merge the code, and wait for the next release. >>> >>> Therefore, I think what we should do is to try all means to make Arrow >>> better, by providing general functionalities, by making them performant, by >>> making them functionally complete and making them easier to use. By making >>> Arrow better, I believe more users will chose Arrow. When trust is >>> established, more users will switch to Arrow. >>> >>> Best, >>> Liya Fan >>> >>> [1] >>> https://docs.google.com/document/d/1cUHb-_Pbe4NMU3Igwt4tytEmI66jQxev00IL99e2wFY/edit#heading=h.50xdeg1htedb >>> >>> [2] https://issues.apache.org/jira/browse/FLINK-13053 >>> >>> >>> On Mon, Oct 14, 2019 at 5:46 AM Micah Kornfield >>> wrote: >>> >>>> Hi Liya Fan, >>>> >>>>> I think the algorithms should be better termed "micro-algorithms". >>>>> They are termed "micro" in the sense that they do not directly compose a >>>>> query engine, because they only provide primitive functionalities (e.g. >>>>> vector sort). >>>>> Instead, they can be used as building blocks for query engines. The >>>>> major benefit of the micro-algorithms is their generality: they can be >>>>> used >>>>> in wide ranges of common scenarios. They can be used in more than one >>>>> query >>>>> engine. In addition, there are other common scenarios, like vector data >>>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as >>>>> we >>>>> have already supported/discussed), IPC communication, data analysis, data >>>>> mining, etc. >>>> >>>> >>>> I agree the algorithm can be generally useful. But I still have >>>> concerns about who is going to use them. >>>> >>>> I think there are two categories the algorithms fall into: >>>> 1. Algorithms directly related to Arrow specification features. For >>>> these, I agree some of functionality will be needed as a reference >>>> implementation. At least for existing functionality I think there is >>>> already sufficient coverage and in some cases (i.e. dictionary there is >>>> already) duplicate coverage. >>>> >>>> 2. Other algorithms - I think these fall into "data analysis, data >>>> mining, etc.", and for these I think it goes back to the question, of >>>> whether developers/users would use the given algorithms to build there own >>>> one-off analysis or use already existing tools like Apache Spark or >>>> SQL-engine that already incorporates the algorithms. >>>> >>>> I'm little disappointed that more maintainers/developers haven't given >>>> there input on this topic. I hope some will help with the work involved in >>>> reviewing them if they find them valuable. >>>> >>>> Thanks, >>>> Micah >>>> >>>> >>>> On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya wrote: >>>> >>>>> Hi Micah and Praveen, >>>>> >>>>> Thanks a lot for your valuable feedback. >>>>> >>>>> My thoughts on the problems: >>>>> >>>>> 1. About audiance of the algorithms: >>>>> >>>>> I think the algorithms should be better termed "micro-algorithms". >>>>> They are termed "micro" in the sense that they do not directly compose a >>>>> query engine, because they only provide primitive functionalities (e.g. >>>>> vector sort). >>>>> Instead, they can be used as building blocks for query engines. The >>>>> major benefit of the micro-algorithms is their generality: they can be >>>>> used >>>>> in wide ranges of common scenarios. They can be used in more than one >>>>> query >>>>> engine. In addition, there are other common scenarios, like vector data >>>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as >>>>> we >>>>> have already supported/discussed), IPC communication, data analysis, data >>>>> mining, etc. >>>>> >>>>> 2. About performance improvments: >>>>> >>>>&
Re: [DISCUSS][Java] Design of the algorithm module
ctor >>>> sort). >>>> Instead, they can be used as building blocks for query engines. The >>>> major benefit of the micro-algorithms is their generality: they can be used >>>> in wide ranges of common scenarios. They can be used in more than one query >>>> engine. In addition, there are other common scenarios, like vector data >>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as we >>>> have already supported/discussed), IPC communication, data analysis, data >>>> mining, etc. >>> >>> >>> I agree the algorithm can be generally useful. But I still have >>> concerns about who is going to use them. >>> >>> I think there are two categories the algorithms fall into: >>> 1. Algorithms directly related to Arrow specification features. For >>> these, I agree some of functionality will be needed as a reference >>> implementation. At least for existing functionality I think there is >>> already sufficient coverage and in some cases (i.e. dictionary there is >>> already) duplicate coverage. >>> >>> 2. Other algorithms - I think these fall into "data analysis, data >>> mining, etc.", and for these I think it goes back to the question, of >>> whether developers/users would use the given algorithms to build there own >>> one-off analysis or use already existing tools like Apache Spark or >>> SQL-engine that already incorporates the algorithms. >>> >>> I'm little disappointed that more maintainers/developers haven't given >>> there input on this topic. I hope some will help with the work involved in >>> reviewing them if they find them valuable. >>> >>> Thanks, >>> Micah >>> >>> >>> On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya wrote: >>> >>>> Hi Micah and Praveen, >>>> >>>> Thanks a lot for your valuable feedback. >>>> >>>> My thoughts on the problems: >>>> >>>> 1. About audiance of the algorithms: >>>> >>>> I think the algorithms should be better termed "micro-algorithms". They >>>> are termed "micro" in the sense that they do not directly compose a query >>>> engine, because they only provide primitive functionalities (e.g. vector >>>> sort). >>>> Instead, they can be used as building blocks for query engines. The >>>> major benefit of the micro-algorithms is their generality: they can be used >>>> in wide ranges of common scenarios. They can be used in more than one query >>>> engine. In addition, there are other common scenarios, like vector data >>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as we >>>> have already supported/discussed), IPC communication, data analysis, data >>>> mining, etc. >>>> >>>> 2. About performance improvments: >>>> >>>> Code generation and template types are powerful tools. In addition, JIT >>>> is also a powerful tool, as it can inline megamorphic virtual functions for >>>> many scenarios, if the algorithm is implemented appropriately. >>>> IMO, code generation is applicable to almost all scenarios to achieve >>>> good performance, if we are willing to pay the price of code readability. >>>> I will try to detail the principles for choosing these tools for >>>> performance improvements later. >>>> >>>> Best, >>>> Liya Fan >>>> >>>> -- >>>> 发件人:Praveen Kumar >>>> 发送时间:2019年10月4日(星期五) 19:20 >>>> 收件人:Micah Kornfield >>>> 抄 送:Fan Liya ; dev >>>> 主 题:Re: [DISCUSS][Java] Design of the algorithm module >>>> >>>> Hi Micah, >>>> >>>> >>>> I agree with 1., i think as an end user, what they would really want is a >>>> query/data processing engine. I am not sure how easy/relevant the >>>> algorithms will be in the absence of the engine. For e.g. most of these >>>> operators would need to pipelined, handle memory, distribution etc. So >>>> bundling this along with engine makes a lot more sense, the interfaces >>>> required might be a bit different too for that. >>>> >>>> Thx. >>>> >>>> >>>> >>>> On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield >>>> wr
Re: [DISCUSS][Java] Design of the algorithm module
the lesson learned is that, we should provide some features > proactively (at least the general features), and make them good enough. > Apache Flink was started around 2015, and Arrow's Java project was started > in 2016. If Arrow were made available earlier, maybe Flink would have > chosen it in the first place. > > When a users needs a algorithm, it may be already too late. AFAIK, most > users will choose to implement one by themselves, rather than openning a > JIRA in the community. It takes a long time to provide a PR, review the > code, merge the code, and wait for the next release. > > Therefore, I think what we should do is to try all means to make Arrow > better, by providing general functionalities, by making them performant, by > making them functionally complete and making them easier to use. By making > Arrow better, I believe more users will chose Arrow. When trust is > established, more users will switch to Arrow. > > Best, > Liya Fan > > [1] > https://docs.google.com/document/d/1cUHb-_Pbe4NMU3Igwt4tytEmI66jQxev00IL99e2wFY/edit#heading=h.50xdeg1htedb > > [2] https://issues.apache.org/jira/browse/FLINK-13053 > > > On Mon, Oct 14, 2019 at 5:46 AM Micah Kornfield > wrote: > >> Hi Liya Fan, >> >>> I think the algorithms should be better termed "micro-algorithms". They >>> are termed "micro" in the sense that they do not directly compose a query >>> engine, because they only provide primitive functionalities (e.g. vector >>> sort). >>> Instead, they can be used as building blocks for query engines. The >>> major benefit of the micro-algorithms is their generality: they can be used >>> in wide ranges of common scenarios. They can be used in more than one query >>> engine. In addition, there are other common scenarios, like vector data >>> compression/decompression (e.g. dictionary encoding and RLE encoding, as we >>> have already supported/discussed), IPC communication, data analysis, data >>> mining, etc. >> >> >> I agree the algorithm can be generally useful. But I still have concerns >> about who is going to use them. >> >> I think there are two categories the algorithms fall into: >> 1. Algorithms directly related to Arrow specification features. For >> these, I agree some of functionality will be needed as a reference >> implementation. At least for existing functionality I think there is >> already sufficient coverage and in some cases (i.e. dictionary there is >> already) duplicate coverage. >> >> 2. Other algorithms - I think these fall into "data analysis, data >> mining, etc.", and for these I think it goes back to the question, of >> whether developers/users would use the given algorithms to build there own >> one-off analysis or use already existing tools like Apache Spark or >> SQL-engine that already incorporates the algorithms. >> >> I'm little disappointed that more maintainers/developers haven't given >> there input on this topic. I hope some will help with the work involved in >> reviewing them if they find them valuable. >> >> Thanks, >> Micah >> >> >> On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya wrote: >> >>> Hi Micah and Praveen, >>> >>> Thanks a lot for your valuable feedback. >>> >>> My thoughts on the problems: >>> >>> 1. About audiance of the algorithms: >>> >>> I think the algorithms should be better termed "micro-algorithms". They >>> are termed "micro" in the sense that they do not directly compose a query >>> engine, because they only provide primitive functionalities (e.g. vector >>> sort). >>> Instead, they can be used as building blocks for query engines. The >>> major benefit of the micro-algorithms is their generality: they can be used >>> in wide ranges of common scenarios. They can be used in more than one query >>> engine. In addition, there are other common scenarios, like vector data >>> compression/decompression (e.g. dictionary encoding and RLE encoding, as we >>> have already supported/discussed), IPC communication, data analysis, data >>> mining, etc. >>> >>> 2. About performance improvments: >>> >>> Code generation and template types are powerful tools. In addition, JIT >>> is also a powerful tool, as it can inline megamorphic virtual functions for >>> many scenarios, if the algorithm is implemented appropriately. >>> IMO, code generation is applicable to almost all scenarios to achieve >>> good performance, if
Re: [DISCUSS][Java] Design of the algorithm module
eeded as a reference > implementation. At least for existing functionality I think there is > already sufficient coverage and in some cases (i.e. dictionary there is > already) duplicate coverage. > > 2. Other algorithms - I think these fall into "data analysis, data > mining, etc.", and for these I think it goes back to the question, of > whether developers/users would use the given algorithms to build there own > one-off analysis or use already existing tools like Apache Spark or > SQL-engine that already incorporates the algorithms. > > I'm little disappointed that more maintainers/developers haven't given > there input on this topic. I hope some will help with the work involved in > reviewing them if they find them valuable. > > Thanks, > Micah > > > On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya wrote: > >> Hi Micah and Praveen, >> >> Thanks a lot for your valuable feedback. >> >> My thoughts on the problems: >> >> 1. About audiance of the algorithms: >> >> I think the algorithms should be better termed "micro-algorithms". They >> are termed "micro" in the sense that they do not directly compose a query >> engine, because they only provide primitive functionalities (e.g. vector >> sort). >> Instead, they can be used as building blocks for query engines. The >> major benefit of the micro-algorithms is their generality: they can be used >> in wide ranges of common scenarios. They can be used in more than one query >> engine. In addition, there are other common scenarios, like vector data >> compression/decompression (e.g. dictionary encoding and RLE encoding, as we >> have already supported/discussed), IPC communication, data analysis, data >> mining, etc. >> >> 2. About performance improvments: >> >> Code generation and template types are powerful tools. In addition, JIT >> is also a powerful tool, as it can inline megamorphic virtual functions for >> many scenarios, if the algorithm is implemented appropriately. >> IMO, code generation is applicable to almost all scenarios to achieve >> good performance, if we are willing to pay the price of code readability. >> I will try to detail the principles for choosing these tools for >> performance improvements later. >> >> Best, >> Liya Fan >> >> -- >> 发件人:Praveen Kumar >> 发送时间:2019年10月4日(星期五) 19:20 >> 收件人:Micah Kornfield >> 抄 送:Fan Liya ; dev >> 主 题:Re: [DISCUSS][Java] Design of the algorithm module >> >> Hi Micah, >> >> I agree with 1., i think as an end user, what they would really want is a >> query/data processing engine. I am not sure how easy/relevant the >> algorithms will be in the absence of the engine. For e.g. most of these >> operators would need to pipelined, handle memory, distribution etc. So >> bundling this along with engine makes a lot more sense, the interfaces >> required might be a bit different too for that. >> >> Thx. >> >> >> >> On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield >> wrote: >> >> > Hi Liya Fan, >> > Thanks again for writing this up. I think it provides a road-map for >> >> > intended features. I commented on the document but I wanted to raise a few >> >> > high-level concerns here as well to get more feedback from the community. >> > >> >> > 1. It isn't clear to me who the users will of this will be. My perception >> > is that in the Java ecosystem there aren't use-cases for the algorithms >> >> > outside of specific compute engines. I'm not super involved in open-source >> >> > Java these days so I would love to hear others opinions. For instance, I'm >> > not sure if Dremio would switch to using these algorithms instead of the >> >> > ones they've already open-sourced [1] and Apache Spark I believe is only >> > using Arrow for interfacing with Python (they similarly have there own >> >> > compute pipeline). I think you mentioned in the past that these are being >> >> > used internally on an engine that your company is working on, but if that >> >> > is the only consumer it makes me wonder if the algorithm development might >> > be better served as part of that engine. >> > >> > 2. If we do move forward with this, we also need a plan for how to >> >> > optimize the algorithms to avoid virtual calls. There are two high-level >> > approaches template-based and (byte)code generation based. Bot
Re: [DISCUSS][Java] Design of the algorithm module
Hi Liya Fan, > I think the algorithms should be better termed "micro-algorithms". They > are termed "micro" in the sense that they do not directly compose a query > engine, because they only provide primitive functionalities (e.g. vector > sort). > Instead, they can be used as building blocks for query engines. The major > benefit of the micro-algorithms is their generality: they can be used in > wide ranges of common scenarios. They can be used in more than one query > engine. In addition, there are other common scenarios, like vector data > compression/decompression (e.g. dictionary encoding and RLE encoding, as we > have already supported/discussed), IPC communication, data analysis, data > mining, etc. I agree the algorithm can be generally useful. But I still have concerns about who is going to use them. I think there are two categories the algorithms fall into: 1. Algorithms directly related to Arrow specification features. For these, I agree some of functionality will be needed as a reference implementation. At least for existing functionality I think there is already sufficient coverage and in some cases (i.e. dictionary there is already) duplicate coverage. 2. Other algorithms - I think these fall into "data analysis, data mining, etc.", and for these I think it goes back to the question, of whether developers/users would use the given algorithms to build there own one-off analysis or use already existing tools like Apache Spark or SQL-engine that already incorporates the algorithms. I'm little disappointed that more maintainers/developers haven't given there input on this topic. I hope some will help with the work involved in reviewing them if they find them valuable. Thanks, Micah On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya wrote: > Hi Micah and Praveen, > > Thanks a lot for your valuable feedback. > > My thoughts on the problems: > > 1. About audiance of the algorithms: > > I think the algorithms should be better termed "micro-algorithms". They > are termed "micro" in the sense that they do not directly compose a query > engine, because they only provide primitive functionalities (e.g. vector > sort). > Instead, they can be used as building blocks for query engines. The major > benefit of the micro-algorithms is their generality: they can be used in > wide ranges of common scenarios. They can be used in more than one query > engine. In addition, there are other common scenarios, like vector data > compression/decompression (e.g. dictionary encoding and RLE encoding, as we > have already supported/discussed), IPC communication, data analysis, data > mining, etc. > > 2. About performance improvments: > > Code generation and template types are powerful tools. In addition, JIT is > also a powerful tool, as it can inline megamorphic virtual functions for > many scenarios, if the algorithm is implemented appropriately. > IMO, code generation is applicable to almost all scenarios to achieve good > performance, if we are willing to pay the price of code readability. > I will try to detail the principles for choosing these tools for > performance improvements later. > > Best, > Liya Fan > > ---------- > 发件人:Praveen Kumar > 发送时间:2019年10月4日(星期五) 19:20 > 收件人:Micah Kornfield > 抄 送:Fan Liya ; dev > 主 题:Re: [DISCUSS][Java] Design of the algorithm module > > Hi Micah, > > I agree with 1., i think as an end user, what they would really want is a > query/data processing engine. I am not sure how easy/relevant the > algorithms will be in the absence of the engine. For e.g. most of these > operators would need to pipelined, handle memory, distribution etc. So > bundling this along with engine makes a lot more sense, the interfaces > required might be a bit different too for that. > > Thx. > > > > On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield > wrote: > > > Hi Liya Fan, > > Thanks again for writing this up. I think it provides a road-map for > > > intended features. I commented on the document but I wanted to raise a few > > high-level concerns here as well to get more feedback from the community. > > > > > 1. It isn't clear to me who the users will of this will be. My perception > > is that in the Java ecosystem there aren't use-cases for the algorithms > > > outside of specific compute engines. I'm not super involved in open-source > > > Java these days so I would love to hear others opinions. For instance, I'm > > not sure if Dremio would switch to using these algorithms instead of the > > ones they've already open-sourced [1] and Apache Spark I believe is only > > using Arrow for interfacing wi
Re: [DISCUSS][Java] Design of the algorithm module
Dear all, I have added the draft for the fourth part of the document. This part contains discussion of more algorithms, some of which are already in progress. Please pay special attention to Section 4.2.1, as it contains a general discussion about the representation of integer vectors. Please take a look, and give your valuable feedback: https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing Thanks a lot for your attention. Best, Liya Fan On Sat, Oct 5, 2019 at 2:59 PM fan_li_ya wrote: > Hi Micah and Praveen, > > Thanks a lot for your valuable feedback. > > My thoughts on the problems: > > 1. About audiance of the algorithms: > > I think the algorithms should be better termed "micro-algorithms". They > are termed "micro" in the sense that they do not directly compose a query > engine, because they only provide primitive functionalities (e.g. vector > sort). > Instead, they can be used as building blocks for query engines. The major > benefit of the micro-algorithms is their generality: they can be used in > wide ranges of common scenarios. They can be used in more than one query > engine. In addition, there are other common scenarios, like vector data > compression/decompression (e.g. dictionary encoding and RLE encoding, as we > have already supported/discussed), IPC communication, data analysis, data > mining, etc. > > 2. About performance improvments: > > Code generation and template types are powerful tools. In addition, JIT is > also a powerful tool, as it can inline megamorphic virtual functions for > many scenarios, if the algorithm is implemented appropriately. > IMO, code generation is applicable to almost all scenarios to achieve good > performance, if we are willing to pay the price of code readability. > I will try to detail the principles for choosing these tools for > performance improvements later. > > Best, > Liya Fan > > ---------- > 发件人:Praveen Kumar > 发送时间:2019年10月4日(星期五) 19:20 > 收件人:Micah Kornfield > 抄 送:Fan Liya ; dev > 主 题:Re: [DISCUSS][Java] Design of the algorithm module > > Hi Micah, > > I agree with 1., i think as an end user, what they would really want is a > query/data processing engine. I am not sure how easy/relevant the > algorithms will be in the absence of the engine. For e.g. most of these > operators would need to pipelined, handle memory, distribution etc. So > bundling this along with engine makes a lot more sense, the interfaces > required might be a bit different too for that. > > Thx. > > > > On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield > wrote: > > > Hi Liya Fan, > > Thanks again for writing this up. I think it provides a road-map for > > > intended features. I commented on the document but I wanted to raise a few > > high-level concerns here as well to get more feedback from the community. > > > > > 1. It isn't clear to me who the users will of this will be. My perception > > is that in the Java ecosystem there aren't use-cases for the algorithms > > > outside of specific compute engines. I'm not super involved in open-source > > > Java these days so I would love to hear others opinions. For instance, I'm > > not sure if Dremio would switch to using these algorithms instead of the > > ones they've already open-sourced [1] and Apache Spark I believe is only > > using Arrow for interfacing with Python (they similarly have there own > > > compute pipeline). I think you mentioned in the past that these are being > > used internally on an engine that your company is working on, but if that > > > is the only consumer it makes me wonder if the algorithm development might > > be better served as part of that engine. > > > > 2. If we do move forward with this, we also need a plan for how to > > optimize the algorithms to avoid virtual calls. There are two high-level > > approaches template-based and (byte)code generation based. Both aren't > > > applicable in all situations but it would be good to come consensus on when > > (and when not to) use each. > > > > Thanks, > > Micah > > > > [1] > > > > > https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external > > > > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya wrote: > > > > > Hi Micah, > > > > > > Thanks for your effort and precious time. > > > Looking forward to receiving more valuable feedback from you. > > > > > > Best, > > > Liya Fan > > > > > > On Tue, Sep 24, 2019 at 2:1
回复:[DISCUSS][Java] Design of the algorithm module
Hi Micah and Praveen, Thanks a lot for your valuable feedback. My thoughts on the problems: 1. About audiance of the algorithms: I think the algorithms should be better termed "micro-algorithms". They are termed "micro" in the sense that they do not directly compose a query engine, because they only provide primitive functionalities (e.g. vector sort). Instead, they can be used as building blocks for query engines. The major benefit of the micro-algorithms is their generality: they can be used in wide ranges of common scenarios. They can be used in more than one query engine. In addition, there are other common scenarios, like vector data compression/decompression (e.g. dictionary encoding and RLE encoding, as we have already supported/discussed), IPC communication, data analysis, data mining, etc. 2. About performance improvments: Code generation and template types are powerful tools. In addition, JIT is also a powerful tool, as it can inline megamorphic virtual functions for many scenarios, if the algorithm is implemented appropriately. IMO, code generation is applicable to almost all scenarios to achieve good performance, if we are willing to pay the price of code readability. I will try to detail the principles for choosing these tools for performance improvements later. Best, Liya Fan -- 发件人:Praveen Kumar 发送时间:2019年10月4日(星期五) 19:20 收件人:Micah Kornfield 抄 送:Fan Liya ; dev 主 题:Re: [DISCUSS][Java] Design of the algorithm module Hi Micah, I agree with 1., i think as an end user, what they would really want is a query/data processing engine. I am not sure how easy/relevant the algorithms will be in the absence of the engine. For e.g. most of these operators would need to pipelined, handle memory, distribution etc. So bundling this along with engine makes a lot more sense, the interfaces required might be a bit different too for that. Thx. On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield wrote: > Hi Liya Fan, > Thanks again for writing this up. I think it provides a road-map for > intended features. I commented on the document but I wanted to raise a few > high-level concerns here as well to get more feedback from the community. > > 1. It isn't clear to me who the users will of this will be. My perception > is that in the Java ecosystem there aren't use-cases for the algorithms > outside of specific compute engines. I'm not super involved in open-source > Java these days so I would love to hear others opinions. For instance, I'm > not sure if Dremio would switch to using these algorithms instead of the > ones they've already open-sourced [1] and Apache Spark I believe is only > using Arrow for interfacing with Python (they similarly have there own > compute pipeline). I think you mentioned in the past that these are being > used internally on an engine that your company is working on, but if that > is the only consumer it makes me wonder if the algorithm development might > be better served as part of that engine. > > 2. If we do move forward with this, we also need a plan for how to > optimize the algorithms to avoid virtual calls. There are two high-level > approaches template-based and (byte)code generation based. Both aren't > applicable in all situations but it would be good to come consensus on when > (and when not to) use each. > > Thanks, > Micah > > [1] > > https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external > > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya wrote: > > > Hi Micah, > > > > Thanks for your effort and precious time. > > Looking forward to receiving more valuable feedback from you. > > > > Best, > > Liya Fan > > > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield > > wrote: > > > >> Hi Liya Fan, > >> I started reviewing but haven't gotten all the way through it. I will > try > >> to leave more comments over the next few days. > >> > >> Thanks again for the write-up I think it will help frame a productive > >> conversation. > >> > >> -Micah > >> > >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya wrote: > >> > >>> Hi Micah, > >>> > >>> Thanks for your kind reminder. Comments are enabled now. > >>> > >>> Best, > >>> Liya Fan > >>> > >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield < > emkornfi...@gmail.com> > >>> wrote: > >>> > >>>> Hi Liya Fan, > >>>> Thank you for this writeup, it doesn't look like comments are enabled > on > >>>> the document. Could you allow for them? &g
Re: [DISCUSS][Java] Design of the algorithm module
Hi Micah, I agree with 1., i think as an end user, what they would really want is a query/data processing engine. I am not sure how easy/relevant the algorithms will be in the absence of the engine. For e.g. most of these operators would need to pipelined, handle memory, distribution etc. So bundling this along with engine makes a lot more sense, the interfaces required might be a bit different too for that. Thx. On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield wrote: > Hi Liya Fan, > Thanks again for writing this up. I think it provides a road-map for > intended features. I commented on the document but I wanted to raise a few > high-level concerns here as well to get more feedback from the community. > > 1. It isn't clear to me who the users will of this will be. My perception > is that in the Java ecosystem there aren't use-cases for the algorithms > outside of specific compute engines. I'm not super involved in open-source > Java these days so I would love to hear others opinions. For instance, I'm > not sure if Dremio would switch to using these algorithms instead of the > ones they've already open-sourced [1] and Apache Spark I believe is only > using Arrow for interfacing with Python (they similarly have there own > compute pipeline). I think you mentioned in the past that these are being > used internally on an engine that your company is working on, but if that > is the only consumer it makes me wonder if the algorithm development might > be better served as part of that engine. > > 2. If we do move forward with this, we also need a plan for how to > optimize the algorithms to avoid virtual calls. There are two high-level > approaches template-based and (byte)code generation based. Both aren't > applicable in all situations but it would be good to come consensus on when > (and when not to) use each. > > Thanks, > Micah > > [1] > > https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external > > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya wrote: > > > Hi Micah, > > > > Thanks for your effort and precious time. > > Looking forward to receiving more valuable feedback from you. > > > > Best, > > Liya Fan > > > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield > > wrote: > > > >> Hi Liya Fan, > >> I started reviewing but haven't gotten all the way through it. I will > try > >> to leave more comments over the next few days. > >> > >> Thanks again for the write-up I think it will help frame a productive > >> conversation. > >> > >> -Micah > >> > >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya wrote: > >> > >>> Hi Micah, > >>> > >>> Thanks for your kind reminder. Comments are enabled now. > >>> > >>> Best, > >>> Liya Fan > >>> > >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield < > emkornfi...@gmail.com> > >>> wrote: > >>> > Hi Liya Fan, > Thank you for this writeup, it doesn't look like comments are enabled > on > the document. Could you allow for them? > > Thanks, > Micah > > On Sat, Sep 14, 2019 at 6:57 AM Fan Liya > wrote: > > > Dear all, > > > > We have prepared a document for discussing the requirements, design > and > > implementation issues for the algorithm module of Java: > > > > > > > > https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing > > > > So far, we have finished the initial draft for sort, search and > dictionary > > encoding algorithms. Discussions for more algorithms may be added in > the > > future. This document will keep evolving to reflect the latest > discussion > > results in the community and the latest code changes. > > > > Please give your valuable feedback. > > > > Best, > > Liya Fan > > > > >>> >
Re: [DISCUSS][Java] Design of the algorithm module
Hi Liya Fan, Thanks again for writing this up. I think it provides a road-map for intended features. I commented on the document but I wanted to raise a few high-level concerns here as well to get more feedback from the community. 1. It isn't clear to me who the users will of this will be. My perception is that in the Java ecosystem there aren't use-cases for the algorithms outside of specific compute engines. I'm not super involved in open-source Java these days so I would love to hear others opinions. For instance, I'm not sure if Dremio would switch to using these algorithms instead of the ones they've already open-sourced [1] and Apache Spark I believe is only using Arrow for interfacing with Python (they similarly have there own compute pipeline). I think you mentioned in the past that these are being used internally on an engine that your company is working on, but if that is the only consumer it makes me wonder if the algorithm development might be better served as part of that engine. 2. If we do move forward with this, we also need a plan for how to optimize the algorithms to avoid virtual calls. There are two high-level approaches template-based and (byte)code generation based. Both aren't applicable in all situations but it would be good to come consensus on when (and when not to) use each. Thanks, Micah [1] https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external On Tue, Sep 24, 2019 at 6:48 AM Fan Liya wrote: > Hi Micah, > > Thanks for your effort and precious time. > Looking forward to receiving more valuable feedback from you. > > Best, > Liya Fan > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield > wrote: > >> Hi Liya Fan, >> I started reviewing but haven't gotten all the way through it. I will try >> to leave more comments over the next few days. >> >> Thanks again for the write-up I think it will help frame a productive >> conversation. >> >> -Micah >> >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya wrote: >> >>> Hi Micah, >>> >>> Thanks for your kind reminder. Comments are enabled now. >>> >>> Best, >>> Liya Fan >>> >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield >>> wrote: >>> Hi Liya Fan, Thank you for this writeup, it doesn't look like comments are enabled on the document. Could you allow for them? Thanks, Micah On Sat, Sep 14, 2019 at 6:57 AM Fan Liya wrote: > Dear all, > > We have prepared a document for discussing the requirements, design and > implementation issues for the algorithm module of Java: > > > https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing > > So far, we have finished the initial draft for sort, search and dictionary > encoding algorithms. Discussions for more algorithms may be added in the > future. This document will keep evolving to reflect the latest discussion > results in the community and the latest code changes. > > Please give your valuable feedback. > > Best, > Liya Fan > >>>
Re: [DISCUSS][Java] Design of the algorithm module
Hi Micah, Thanks for your effort and precious time. Looking forward to receiving more valuable feedback from you. Best, Liya Fan On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield wrote: > Hi Liya Fan, > I started reviewing but haven't gotten all the way through it. I will try > to leave more comments over the next few days. > > Thanks again for the write-up I think it will help frame a productive > conversation. > > -Micah > > On Tue, Sep 17, 2019 at 1:47 AM Fan Liya wrote: > >> Hi Micah, >> >> Thanks for your kind reminder. Comments are enabled now. >> >> Best, >> Liya Fan >> >> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield >> wrote: >> >>> Hi Liya Fan, >>> Thank you for this writeup, it doesn't look like comments are enabled on >>> the document. Could you allow for them? >>> >>> Thanks, >>> Micah >>> >>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya wrote: >>> >>> > Dear all, >>> > >>> > We have prepared a document for discussing the requirements, design and >>> > implementation issues for the algorithm module of Java: >>> > >>> > >>> > >>> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing >>> > >>> > So far, we have finished the initial draft for sort, search and >>> dictionary >>> > encoding algorithms. Discussions for more algorithms may be added in >>> the >>> > future. This document will keep evolving to reflect the latest >>> discussion >>> > results in the community and the latest code changes. >>> > >>> > Please give your valuable feedback. >>> > >>> > Best, >>> > Liya Fan >>> > >>> >>
Re: [DISCUSS][Java] Design of the algorithm module
Hi Liya Fan, I started reviewing but haven't gotten all the way through it. I will try to leave more comments over the next few days. Thanks again for the write-up I think it will help frame a productive conversation. -Micah On Tue, Sep 17, 2019 at 1:47 AM Fan Liya wrote: > Hi Micah, > > Thanks for your kind reminder. Comments are enabled now. > > Best, > Liya Fan > > On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield > wrote: > >> Hi Liya Fan, >> Thank you for this writeup, it doesn't look like comments are enabled on >> the document. Could you allow for them? >> >> Thanks, >> Micah >> >> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya wrote: >> >> > Dear all, >> > >> > We have prepared a document for discussing the requirements, design and >> > implementation issues for the algorithm module of Java: >> > >> > >> > >> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing >> > >> > So far, we have finished the initial draft for sort, search and >> dictionary >> > encoding algorithms. Discussions for more algorithms may be added in the >> > future. This document will keep evolving to reflect the latest >> discussion >> > results in the community and the latest code changes. >> > >> > Please give your valuable feedback. >> > >> > Best, >> > Liya Fan >> > >> >
Re: [DISCUSS][Java] Design of the algorithm module
Hi Micah, Thanks for your kind reminder. Comments are enabled now. Best, Liya Fan On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield wrote: > Hi Liya Fan, > Thank you for this writeup, it doesn't look like comments are enabled on > the document. Could you allow for them? > > Thanks, > Micah > > On Sat, Sep 14, 2019 at 6:57 AM Fan Liya wrote: > > > Dear all, > > > > We have prepared a document for discussing the requirements, design and > > implementation issues for the algorithm module of Java: > > > > > > > https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing > > > > So far, we have finished the initial draft for sort, search and > dictionary > > encoding algorithms. Discussions for more algorithms may be added in the > > future. This document will keep evolving to reflect the latest discussion > > results in the community and the latest code changes. > > > > Please give your valuable feedback. > > > > Best, > > Liya Fan > > >
Re: [DISCUSS][Java] Design of the algorithm module
Hi Liya Fan, Thank you for this writeup, it doesn't look like comments are enabled on the document. Could you allow for them? Thanks, Micah On Sat, Sep 14, 2019 at 6:57 AM Fan Liya wrote: > Dear all, > > We have prepared a document for discussing the requirements, design and > implementation issues for the algorithm module of Java: > > > https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing > > So far, we have finished the initial draft for sort, search and dictionary > encoding algorithms. Discussions for more algorithms may be added in the > future. This document will keep evolving to reflect the latest discussion > results in the community and the latest code changes. > > Please give your valuable feedback. > > Best, > Liya Fan >