Re: A new result set format

2019-09-07 Thread Rui, Lei
Sorry, pictures cannot be attached in the last email I sent. So I supplement 
them here in text.
The "wide" table is:
| time | root.sg_1.device_1.sensor_1 | root.sg_1.device_1.sensor_2 | 
root.sg_1.device_2.sensor_1 | root.sg_1.device_2.sensor_2 |
| 1 | 100 | 2.5 | 99 | 1.3 |
| ... | ... | ... | ... | ... |


The "narrow" table is :
| time | device_Id | sensor_1 | sensor_2 |
| 1 | root.sg_1.device_1 | 100 | 2.5 |
| ... | ... | ... | ... |
| 1 | root.sg_1.device_2 | 99 | 1.3 |
| ... | ... | ... | ... |
On 9/7/2019 15:51,Rui, Lei wrote:
Hi,


I try to make this proposal more concrete from a semantic perspective.


Consider the sql "select * from root.sg_1". The following format is the "wide" 
table: 


The following format is the "narrow" table:


The levels of data from low to high are:
- sensor data, or series data, e.g., from root.sg_1.device_1.sensor_1
- device data, e.g., from root.sg_1.device_1
- storage group data , e.g., from root.sg_1


So, the sql "select * from root.sg_1" queries data at the storage group level. 
To present the results,
the wide table aligns all series data across multiple devices in the storage 
group by timestamp, 
while the narrow table aligns series data in a single device by timestamp, and 
does the same for other devices in the storage group.


By the way, I guess the "narrowest" table is for a single sensor's data, 
without the need to align with any other series data.


I have one question: 
Why not make full use of sql and just use "select * from root.sg_1.device_1" to 
specify the device (or the data level) they care about?
Why use "select * from root.sg_1" with a narrow table format?


Lastly, I think the better query execution efficiency that a narrow table may 
sometimes has is not the drive purpose, 
because presenting the query result in a wide table and in a narrow table are 
two different tasks.


Sincerely,
Lei Rui


From: Jialin Qiao 
Date: 9/7/2019 15:26
To: 
Subject: Re: A new result set format
Hi Julian,

He is my friend and contacted me offline, because I advertise IoTDB in my 
weChat(like facebook or twitter). 

Next time I will try to let him put issue in the mail list himself :)

Best,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

-原始邮件-
发件人: "Julian Feinauer" 
发送时间: 2019-09-07 13:52:17 (星期六)
收件人: "dev@iotdb.apache.org" 
抄送: 
主题: Re: A new result set format

Hi Jialin,

perhaps one question about "wanted by users" means (as I didn’t see anything on 
the list).
How do these users get in contact with you?

Julian

Am 07.09.19, 04:29 schrieb "Jialin Qiao" :

Hi,

As described in this issue, a new result set format is wanted by users. I'd 
like to open a discussion here.

For simplicity, I refer this format "time, root.sg1.d1.s1, root.sg1.d2.s1" to 
wide table, and "time, deviceId, s1" as narrow table. 

This issue is not only about how to organize the results, but also the query 
process. 

There are some advantages about narrow table.

(1) For wide table, we need to open a SeriesReader for each series at the same 
time, each SeriesReader holds some ChunkMetadatas. For narrow table, we only 
need to open SeriesReaders for one device at one time, then return results and 
open SeriesReaders for the next device, which occupies less memory compared to 
the wide table. 
(2) Avoid reading all series at once may also improve the query latency.

There is also a question:

(1) If we show result in the narrow table format for users, do we need to 
highlight the concept of table and device? 
(2) If the answer of the first question is yes, do we need to support sql: 
"select time, deviceId, s1, s2, s3 from root.sg1 where deviceId=d1"? This may 
involve a lot of work...

From my side, I prefer the answers of the two questions are all NO. Then we do 
not need to change the sql grammar and only use a new query process to organize 
the result set.

Best,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

-原始邮件-
发件人: "Jialin Qiao (Jira)" 
发送时间: 2019-09-07 09:40:00 (星期六)
收件人: dev@iotdb.apache.org
抄送: 
主题: [jira] [Created] (IOTDB-203) A new result set format

Jialin Qiao created IOTDB-203:
-

Summary: A new result set format
Key: IOTDB-203
URL: https://issues.apache.org/jira/browse/IOTDB-203
Project: Apache IoTDB
Issue Type: New Feature
Reporter: Jialin Qiao


When executing a SQL like "select d1.s1, d2.s1 from root.sg1", the default 
result set format in IoTDB is 

"time, root.sg1.d1.s1, root.sg1.d2.s1"

1 , 1, 1

2, 2, 2

However, some users want to get another format, The results could be grouped by 
device, then sorted by time.

"time, deviceId, s1".

1, root.sg1.d1, 1

2, root.sg1.d2, 2



This can be done in the client, but it would be better if we support this 
format in the server.





--
This message was sent by Atlassian Jira
(v8.3.2#803003)



Re: A new result set format

2019-09-07 Thread Rui, Lei
Sorry, pictures cannot be attached in the last email I sent. So I supplement 
them here in text.
The "wide" table is:
| time | root.sg_1.device_1.sensor_1 | root.sg_1.device_1.sensor_2 | 
root.sg_1.device_2.sensor_1 | root.sg_1.device_2.sensor_2 |
| 1 | 100 | 2.5 | 99 | 1.3 |
| ... | ... | ... | ... | ... |
On 9/7/2019 15:51,Rui, Lei wrote:
Hi,


I try to make this proposal more concrete from a semantic perspective.


Consider the sql "select * from root.sg_1". The following format is the "wide" 
table: 


The following format is the "narrow" table:


The levels of data from low to high are:
- sensor data, or series data, e.g., from root.sg_1.device_1.sensor_1
- device data, e.g., from root.sg_1.device_1
- storage group data , e.g., from root.sg_1


So, the sql "select * from root.sg_1" queries data at the storage group level. 
To present the results,
the wide table aligns all series data across multiple devices in the storage 
group by timestamp, 
while the narrow table aligns series data in a single device by timestamp, and 
does the same for other devices in the storage group.


By the way, I guess the "narrowest" table is for a single sensor's data, 
without the need to align with any other series data.


I have one question: 
Why not make full use of sql and just use "select * from root.sg_1.device_1" to 
specify the device (or the data level) they care about?
Why use "select * from root.sg_1" with a narrow table format?


Lastly, I think the better query execution efficiency that a narrow table may 
sometimes has is not the drive purpose, 
because presenting the query result in a wide table and in a narrow table are 
two different tasks.


Sincerely,
Lei Rui


From: Jialin Qiao 
Date: 9/7/2019 15:26
To: 
Subject: Re: A new result set format
Hi Julian,

He is my friend and contacted me offline, because I advertise IoTDB in my 
weChat(like facebook or twitter). 

Next time I will try to let him put issue in the mail list himself :)

Best,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

-原始邮件-
发件人: "Julian Feinauer" 
发送时间: 2019-09-07 13:52:17 (星期六)
收件人: "dev@iotdb.apache.org" 
抄送: 
主题: Re: A new result set format

Hi Jialin,

perhaps one question about "wanted by users" means (as I didn’t see anything on 
the list).
How do these users get in contact with you?

Julian

Am 07.09.19, 04:29 schrieb "Jialin Qiao" :

Hi,

As described in this issue, a new result set format is wanted by users. I'd 
like to open a discussion here.

For simplicity, I refer this format "time, root.sg1.d1.s1, root.sg1.d2.s1" to 
wide table, and "time, deviceId, s1" as narrow table. 

This issue is not only about how to organize the results, but also the query 
process. 

There are some advantages about narrow table.

(1) For wide table, we need to open a SeriesReader for each series at the same 
time, each SeriesReader holds some ChunkMetadatas. For narrow table, we only 
need to open SeriesReaders for one device at one time, then return results and 
open SeriesReaders for the next device, which occupies less memory compared to 
the wide table. 
(2) Avoid reading all series at once may also improve the query latency.

There is also a question:

(1) If we show result in the narrow table format for users, do we need to 
highlight the concept of table and device? 
(2) If the answer of the first question is yes, do we need to support sql: 
"select time, deviceId, s1, s2, s3 from root.sg1 where deviceId=d1"? This may 
involve a lot of work...

From my side, I prefer the answers of the two questions are all NO. Then we do 
not need to change the sql grammar and only use a new query process to organize 
the result set.

Best,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

-原始邮件-
发件人: "Jialin Qiao (Jira)" 
发送时间: 2019-09-07 09:40:00 (星期六)
收件人: dev@iotdb.apache.org
抄送: 
主题: [jira] [Created] (IOTDB-203) A new result set format

Jialin Qiao created IOTDB-203:
-

Summary: A new result set format
Key: IOTDB-203
URL: https://issues.apache.org/jira/browse/IOTDB-203
Project: Apache IoTDB
Issue Type: New Feature
Reporter: Jialin Qiao


When executing a SQL like "select d1.s1, d2.s1 from root.sg1", the default 
result set format in IoTDB is 

"time, root.sg1.d1.s1, root.sg1.d2.s1"

1 , 1, 1

2, 2, 2

However, some users want to get another format, The results could be grouped by 
device, then sorted by time.

"time, deviceId, s1".

1, root.sg1.d1, 1

2, root.sg1.d2, 2



This can be done in the client, but it would be better if we support this 
format in the server.





--
This message was sent by Atlassian Jira
(v8.3.2#803003)



Re: A new result set format

2019-09-07 Thread Rui, Lei
Hi,


I try to make this proposal more concrete from a semantic perspective.


Consider the sql "select * from root.sg_1". The following format is the "wide" 
table: 


The following format is the "narrow" table:


The levels of data from low to high are:
- sensor data, or series data, e.g., from root.sg_1.device_1.sensor_1
- device data, e.g., from root.sg_1.device_1
- storage group data , e.g., from root.sg_1


So, the sql "select * from root.sg_1" queries data at the storage group level. 
To present the results,
the wide table aligns all series data across multiple devices in the storage 
group by timestamp, 
while the narrow table aligns series data in a single device by timestamp, and 
does the same for other devices in the storage group.


By the way, I guess the "narrowest" table is for a single sensor's data, 
without the need to align with any other series data.


I have one question: 
Why not make full use of sql and just use "select * from root.sg_1.device_1" to 
specify the device (or the data level) they care about?
Why use "select * from root.sg_1" with a narrow table format?


Lastly, I think the better query execution efficiency that a narrow table may 
sometimes has is not the drive purpose, 
because presenting the query result in a wide table and in a narrow table are 
two different tasks.


Sincerely,
Lei Rui


From: Jialin Qiao 
Date: 9/7/2019 15:26
To: 
Subject: Re: A new result set format
Hi Julian,

He is my friend and contacted me offline, because I advertise IoTDB in my 
weChat(like facebook or twitter). 

Next time I will try to let him put issue in the mail list himself :)

Best,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

-原始邮件-
发件人: "Julian Feinauer" 
发送时间: 2019-09-07 13:52:17 (星期六)
收件人: "dev@iotdb.apache.org" 
抄送: 
主题: Re: A new result set format

Hi Jialin,

perhaps one question about "wanted by users" means (as I didn’t see anything on 
the list).
How do these users get in contact with you?

Julian

Am 07.09.19, 04:29 schrieb "Jialin Qiao" :

Hi,

As described in this issue, a new result set format is wanted by users. I'd 
like to open a discussion here.

For simplicity, I refer this format "time, root.sg1.d1.s1, root.sg1.d2.s1" to 
wide table, and "time, deviceId, s1" as narrow table. 

This issue is not only about how to organize the results, but also the query 
process. 

There are some advantages about narrow table.

(1) For wide table, we need to open a SeriesReader for each series at the same 
time, each SeriesReader holds some ChunkMetadatas. For narrow table, we only 
need to open SeriesReaders for one device at one time, then return results and 
open SeriesReaders for the next device, which occupies less memory compared to 
the wide table. 
(2) Avoid reading all series at once may also improve the query latency.

There is also a question:

(1) If we show result in the narrow table format for users, do we need to 
highlight the concept of table and device? 
(2) If the answer of the first question is yes, do we need to support sql: 
"select time, deviceId, s1, s2, s3 from root.sg1 where deviceId=d1"? This may 
involve a lot of work...

From my side, I prefer the answers of the two questions are all NO. Then we do 
not need to change the sql grammar and only use a new query process to organize 
the result set.

Best,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

-原始邮件-
发件人: "Jialin Qiao (Jira)" 
发送时间: 2019-09-07 09:40:00 (星期六)
收件人: dev@iotdb.apache.org
抄送: 
主题: [jira] [Created] (IOTDB-203) A new result set format

Jialin Qiao created IOTDB-203:
-

Summary: A new result set format
Key: IOTDB-203
URL: https://issues.apache.org/jira/browse/IOTDB-203
Project: Apache IoTDB
Issue Type: New Feature
Reporter: Jialin Qiao


When executing a SQL like "select d1.s1, d2.s1 from root.sg1", the default 
result set format in IoTDB is 

"time, root.sg1.d1.s1, root.sg1.d2.s1"

1 , 1, 1

2, 2, 2

However, some users want to get another format, The results could be grouped by 
device, then sorted by time.

"time, deviceId, s1".

1, root.sg1.d1, 1

2, root.sg1.d2, 2



This can be done in the client, but it would be better if we support this 
format in the server.





--
This message was sent by Atlassian Jira
(v8.3.2#803003)



回复: Binary Release of IoTDB

2019-07-17 Thread RUI, LEI
Hi, I'm here to suggest another structure like this :)


(Structure 3):
.
├── LICENSE
├── NOTICE
├── changes.txt
│
├── bin
│   ├── client
│   │   ├── export-csv.bat
│   │   ├── export-csv.sh
│   │   ├── import-csv.bat
│   │   ├── import-csv.sh
│   │   ├── run-client.bat
│   │   ├── start-client.bat
│   │   └── start-client.sh
│   └── server
│├── start-WalChecker.bat
│├── start-WalChecker.sh
│├── start-server.bat
│├── start-server.sh
│├── start-sync-client.bat
│├── start-sync-client.sh
│├── stop-server.bat
│├── stop-server.sh
│├── stop-sync-client.bat
│└── stop-sync-client.sh
│
├── conf
│   ├── error_info_cn.properties
│   ├── error_info_en.properties
│   ├── iotdb-engine.properties
│   ├── iotdb-env.bat
│   ├── iotdb-env.sh
│   ├── iotdb-sync-client.properties
│   ├── logback.xml
│   └── tsfile-format.properties
│
├──  lib
│   ├── client
│   │   └── *.jar
│   ├── server
│   │   └── *.jar

│   └── common
│   └── *.jar

│
├── licenses
│   └── LICENCES
│
└── grafana-connector
 ├── bin
 │   ├── start-grafana-connector.bat
 │   └── start-grafana-connector.sh
 └── iotdb-grafana-0.8.0-SNAPSHOT.war
 




-- 原始邮件 --
发件人: "Justin Mclean";
发送时间: 2019年7月18日(星期四) 上午10:38
收件人: "dev";

主题: Re: Binary Release of IoTDB



Hi,
There should be no need to vote on something like this try to reach
consensus by discussion.
Thanks.
Justin

On Thu, 18 Jul 2019, 12:35 Xiangdong Huang,  wrote:

> Hi,
>
> any other opinion?
>
> We need to make a decision asap
>
> Because there is some divergence, do we need a vote?
>
> Best,
> ---
> Xiangdong Huang
> School of Software, Tsinghua University
>
>  黄向东
> 清华大学 软件学院
>
>
> Jialin Qiao  于2019年7月17日周三 下午3:39写道:
>
> > Hi,
> >
> > I prefer the first structure that assembles all scripts in the "bin"
> > folder and all jars in the "lib" folder.
> >
> > Suppose I am a user, I would expect that the structure is as clear and
> > simple as possible.
> >
> > Thanks,
> > --
> > Jialin Qiao
> > School of Software, Tsinghua University
> >
> > 乔嘉林
> > 清华大学 软件学院
> >
> > > -原始邮件-
> > > 发件人: "Xiangdong Huang" 
> > > 发送时间: 2019-07-17 14:18:10 (星期三)
> > > 收件人: dev@iotdb.apache.org
> > > 抄送:
> > > 主题: Re: Binary Release of IoTDB
> > >
> > > Hi,
> > >
> > > Though I also think the second structure is more clear, many databases
> > > projects use the structure 1... e.g, Cassandra.
> > >
> > > When using structure 2, there are some jars both in the client/lib/ and
> > the
> > > server/lib/, which will enlarge the binary file.
> > >
> > > Do we need to extract them out and put them into another folder?
> > Something
> > > like:
> > > .
> > > ├── client
> > > │   └── lib
> > > ├── common
> > > │   └── lib
> > > └── server
> > > └── lib
> > >
> > > Best,
> > > ---
> > > Xiangdong Huang
> > > School of Software, Tsinghua University
> > >
> > >  黄向东
> > > 清华大学 软件学院
> > >
> > >
> > > Julian Feinauer  于2019年7月16日周二
> 下午11:27写道:
> > >
> > > > Hi,
> > > >
> > > > I would prefer structure 2 and I really like it.
> > > > And we should add a readme.txt with short usage instructions.
> > > >
> > > > Julian
> > > >
> > > > Am 16.07.19, 13:58 schrieb "Xiangdong Huang" :
> > > >
> > > > Hi,
> > > >
> > > > I think the structure of the binaries can be:
> > > >
> > > > (Structure 1):
> > > > .
> > > > ├── LICENSE
> > > > ├── NOTICE
> > > > ├── bin
> > > > │   ├── export-csv.bat
> > > > │   ├── export-csv.sh
> > > > │   ├── import-csv.bat
> > > > │   ├── import-csv.sh
> > > > │   ├── run-client.bat
> > > > │   ├── start-WalChecker.bat
> > > > │   ├── start-WalChecker.sh
> > > > │   ├── start-client.bat
> > > > │   ├── start-client.sh
> > > > │   ├── start-grafana-connector.bat
> > > > │   ├── start-grafana-connector.sh
> > > > │   ├── start-server.bat
> > > > │   ├── start-server.sh
> > > > │   ├── start-sync-client.bat
> > > > │   ├── start-sync-client.sh
> > > > │   ├── stop-server.bat
> > > > │   ├── stop-server.sh
> > > > │   ├── stop-sync-client.bat
> > > > │   └── stop-sync-client.sh
> > > > ├── changes.txt
> > > > ├── conf
> > > > │   ├── error_info_cn.properties
> > > > │   ├── error_info_en.properties
> > > > │   ├── iotdb-engine.properties
> > > > │   ├── iotdb-env.bat
> > > > │   ├── iotdb-env.sh
> > > > │   ├── iotdb-sync-client.properties
> > > > │   ├── logback.xml
> > > > │   └── tsfile-format.properties
> > > > ├── lib
> > > > │   └── *.jar
> > > > └── licenses
> > > > └── LICENCES
> > > >
> > > > (Structure 2):
> > > > .
> > > > ├── LICENSE
> > > > ├── NOTICE
> > > > ├── changes.txt
> > > > ├── client
> > > > │   ├── bin
> > > > │   │   ├── export-csv.bat
> > > > │   │   ├── export-csv.sh
> > > > │   │   

What Is a Good Git Workflow?

2019-07-10 Thread RUI, LEI
Hi all,


I think it is worthwhile to spend some time discussing and hoping to reach a 
consensus on what a good Git workflow should be.


Here is the thing. The branch 'feature_async_close_tsfile' that I have recently 
been working on with others was merged into the master branch a few days ago. 
When I try to examine the Git history of some code, I find that the squash 
merge was used and thus all commit history on the branch 
'feature_async_close_tsfile' is squashed into a single commit.


I understand that squash merge keeps the master branch history clean and easy 
to follow. However, is it too clean for a NOT lightweight feature branch like 
'feature_async_close_tsfile'?


Is squash merge a standard practice in any situation? Should we make each 
develop branch small enough so that it can be squashed comfortably before 
merged to the master branch?


If the develop branch is inevitably large, in order to make the code history as 
simple as possible but not simpler, would rebase merge be a better choice, 
compared with merge and squash merge?


Apart from the final merge choice, I think it is as important that an 
individual looks closely at his/her Git workflow to keep the commit history 
both clean and meaningful.


Sincerely,
Lei Rui

回复: Discussion: IoTDB Query on Value Columns

2019-06-24 Thread RUI, LEI
This is the picture (bmp format) in 2.1.






-- 原始邮件 --
发件人: "suyue";
发送时间: 2019年6月24日(星期一) 晚上10:14
收件人: "dev";

主题: Re: Discussion: IoTDB Query on Value Columns



This is the picture in 2.1.
 

在 2019年6月24日,下午9:58,RUI, LEI <1010953...@qq.com> 写道:


1. Problem Description

Consider four data points (t,v) are written to IoTDB in the following order:

(1,1)

(2,2)

(3,3)

(1,100)

Then, given a query “select * from root where v<10”, the expected result is 
(2,2)(3,3). This is because the later inserted data point (1,100) should cover 
the earlier inserted data point (1,1). 

However, we find that in IoTDB the queried result is (1,100),(2,2),(3,3).

More details see JIRA-121.




2. IoTDB Background

2.1 data organization

In IoTDB, the above data points will be divided into sequential data source and 
unsequential data source separately, as is shown below.



2.2 query process

The execution process of sql “select * from root where v<10” is as follows:

(1) Create a timeGenerator for the value filter “v<10”. It will return 
statisfied timestamps iteratively.

(2) Fetch the value by the timestamp generated by the TimeGenerator.

 

3. Analysis

3.1 Annotation Description
 
s: data source​

s1ss, i.e., unsequential data source always has 
higher priority than sequential data source.

merge(s1,s2): union data points from s1 and s2. When two data points from s1 
and s2 respectively have the same timestamp, keep the data point from the 
higher priority source.

query(s): apply the query pushdown on the data source s and return the query 
result 

 

3.2 Current Query Plan

   The current query plan in IoTDB goes like this: 
timeGenerator=merge(query(ss),query(us))



   Explain using the above example:

ss=((1,1),(2,2),(3,3))

us=(1,100)

query(ss)=((1,1),(2,2),(3,3))

query(us)=ϕ

timeGenerator=merge(query(ss),query(us))=((1,1),(2,2),(3,3))







   Then fetch the value by the timestamp generated by the above 
timeGenerator. Note that in this step, we fetch value from merged data source, 
i.e., merge(ss,us). The final result is ((1,100),(2,2),(3,3)). This is how the 
bug comes from: there is no post-filter applied on the false positives in the 
timeGenerator.

 

3.3 Possibile Solutions

We come up with several alternative solutions.

(1) timeGenerator=query(merge(ss,us))

(2) timeGenerator=query(merge(query(ss),us))

(3) timeGenerator=query(merge(query(ss),query(us)))



(1) is a simple solution. 

(2) and (3) have different advantages. 

(2): The query condition is pushed down to ss first and then applied to the 
merged result of query(ss) and us. When the selection query (corresponding to 
timeGenerator) and the projection query have the same series in common, we can 
use values of those series cached in timeGenerator to speed up the projection 
process.

(3): The query condition is pushed down to the unsequential data source too. 
Thus, data not satisfying the query condition can be filtered out at an early 
stage.




3.4 Discussion

   Does anyone know of any mature solutions in other systems? Or which 
solution do you think is better, (2) or (3)?

   Looking forward to your advice.




Sincerely,

Lei Rui, Yue Su

Discussion: IoTDB Query on Value Columns

2019-06-24 Thread RUI, LEI
1. Problem Description
 
Consider four data points (t,v) are written to IoTDB in the following order:
 
(1,1)
 
(2,2)
 
(3,3)
 
(1,100)
 
Then, given a query “select * from root where v<10”, the expected result is 
(2,2)(3,3). This is because the later inserted data point (1,100) should cover 
the earlier inserted data point (1,1). 
 
However, we find that in IoTDB the queried result is (1,100),(2,2),(3,3).

More details see JIRA-121.



 
2. IoTDB Background
 
2.1 data organization
 
In IoTDB, the above data points will be divided into sequential data source and 
unsequential data source separately, as is shown below.


 
2.2 query process
 
The execution process of sql “select * from root where v<10” is as follows:
 
(1) Create a timeGenerator for the value filter “v<10”. It will return 
statisfied timestamps iteratively.
 
(2) Fetch the value by the timestamp generated by the TimeGenerator.
 
 
 
3. Analysis
 
3.1 Annotation Description
 
s: data source​

s1ss, i.e., unsequential data source always has 
higher priority than sequential data source.

merge(s1,s2): union data points from s1 and s2. When two data points from s1 
and s2 respectively have the same timestamp, keep the data point from the 
higher priority source.

query(s): apply the query pushdown on the data source s and return the query 
result 
 
 
 
3.2 Current Query Plan
 
   The current query plan in IoTDB goes like this: 
timeGenerator=merge(query(ss),query(us))
 

 
   Explain using the above example:

ss=((1,1),(2,2),(3,3))

us=(1,100)

query(ss)=((1,1),(2,2),(3,3))

query(us)=ϕ

timeGenerator=merge(query(ss),query(us))=((1,1),(2,2),(3,3))
 

 

 

 
   Then fetch the value by the timestamp generated by the above 
timeGenerator. Note that in this step, we fetch value from merged data source, 
i.e., merge(ss,us). The final result is ((1,100),(2,2),(3,3)). This is how the 
bug comes from: there is no post-filter applied on the false positives in the 
timeGenerator.
 
 
 
3.3 Possibile Solutions
 
We come up with several alternative solutions.
 
(1) timeGenerator=query(merge(ss,us))

(2) timeGenerator=query(merge(query(ss),us))

(3) timeGenerator=query(merge(query(ss),query(us)))


 
(1) is a simple solution. 
 
(2) and (3) have different advantages. 
 
(2): The query condition is pushed down to ss first and then applied to the 
merged result of query(ss) and us. When the selection query (corresponding to 
timeGenerator) and the projection query have the same series in common, we can 
use values of those series cached in timeGenerator to speed up the projection 
process.
 
(3): The query condition is pushed down to the unsequential data source too. 
Thus, data not satisfying the query condition can be filtered out at an early 
stage.



 
3.4 Discussion
 
   Does anyone know of any mature solutions in other systems? Or which 
solution do you think is better, (2) or (3)?
 
   Looking forward to your advice.
 



Sincerely,

Lei Rui, Yue Su