LSM storage refactoring

2017-09-18 Thread Ildar Absalyamov
Hi Devs,

In line with earlier major structural refactorings of storage/index-related 
code [1] I would like to propose a next step in this cleanup [2].
The main problem that I tried to solve with this patch is that code responsible 
for LSM disk/memory component lifecycle (creation, destruction, bulkloading, 
etc) is smeared across fabric methods in appropriate index implementations, 
while much of it is duplicated between various types of index components 
(bTrees, externalBTrees, externalBTreesWithBuddyBTree, rTrees, 
antimatterRTrees, invertedIndexes, etc). Moreover all these different 
disk\memory component implementations have a lot of commonality in how they 
manage lifecycle of their parts (main indexes, bloom filters, 
buddyBTrees\deletedKeysBTrees).

This change removes much of boilerplate from LSM component-handling code and 
relies on more object-oriented design to bring in the logic of a particular 
element of the component into one place.
It also introduces a composable method of assembling bulkload pipelines, 
allowing to create a chain of operators,  responsible for bulkloading a piece 
of component, and easily extend this pipeline with additional operations 
(calculating stats\inferring schema\etc).

If your are interested or have an opinion on how this part of the codebase 
should be structured (or it will break all your code in a private branch ;)), 
please have a look [2].

[1] https://asterix-gerrit.ics.uci.edu/#/c/1728/ 

[2] https://asterix-gerrit.ics.uci.edu/#/c/2014/ 

Best regards,
Ildar



Re: [VOTE] Release Apache AsterixDB 0.9.2 and Hyracks 0.3.2 (RC1)

2017-08-18 Thread Ildar Absalyamov
Sorry, that was a problem in my local dev version.
+1 in this case.

> On Aug 18, 2017, at 16:35, Ildar Absalyamov <ildar.absalya...@gmail.com> 
> wrote:
> 
> -1
> 
> Source artifacts:
> - Verified signatures and hashes
> - Verified LICENSE and NOTICE
> - Verified compilation
> - Verified license headers
> 
> Binary artifacts:
> - Verified Managix installer
> - Went thought SQL++ Primer and noticed that the wed UI does not print 
> results correctly with in JSON Output format (which is a default option!). 
> The result is empty with "application/json” Accept header using 
> http://localhost:19002/query <http://localhost:19002/query> endpoint as well 
> as http://localhost:19002/query/service 
> <http://localhost:19002/query/service>.
> Turns out we already have an issue for that [1]. Feels like this issue is a 
> blocker for the RC.
> 
> Also the source Asterix artifact has pom.xml.versionsBackup for each project 
> (leftover from SNAPSHOT version pom), but I guess that does not affect he 
> overall validity.
> 
> [1] https://issues.apache.org/jira/browse/ASTERIXDB-1977 
> <https://issues.apache.org/jira/browse/ASTERIXDB-1977>
>> On Aug 18, 2017, at 13:15, Ian Maxon <ima...@uci.edu 
>> <mailto:ima...@uci.edu>> wrote:
>> 
>> I've replaced the zip in the dist with the one in the maven repository
>> and the signature checks out now for me, at least. Also note the link
>> for the repository is off, it should be
>> https://repository.apache.org/content/repositories/orgapacheasterix-1038/ 
>> <https://repository.apache.org/content/repositories/orgapacheasterix-1038/>
>> , sorry about that.
>> 
>> On Thu, Aug 17, 2017 at 12:42 PM, Ian Maxon <ima...@uci.edu> wrote:
>>> Hmm. Very interesting. I'll have to investigate. Somehow the zip must
>>> have become corrupt.
>>> 
>>> On Thu, Aug 17, 2017 at 11:24 AM, Preston Carman <prest...@apache.org> 
>>> wrote:
>>>> -1
>>>> 
>>>> I went through checking the signature and SHA1 for all the files. All
>>>> but one are correct.
>>>> 
>>>> Asterix Installer (asterix-installer-0.9.2-binary-assembly.zip) has a
>>>> BAD signature and does not match the SHA1.
>>>> 
>>>> On Fri, Aug 11, 2017 at 2:56 PM, Mike Carey <dtab...@gmail.com> wrote:
>>>>> +1 from me on the release.  I grabbed the NCService version and
>>>>> installed/ran it and went through the SQL++ Primer on it to make sure it 
>>>>> all
>>>>> works as advertised (including checking that the hinted queries indeed had
>>>>> the hinted query plans).
>>>>> 
>>>>> 
>>>>> 
>>>>> On 8/7/17 5:50 PM, Ian Maxon wrote:
>>>>>> 
>>>>>> Hi everyone,
>>>>>> 
>>>>>> Please verify and vote on the 3rd release of Apache AsterixDB
>>>>>> 
>>>>>> The change that produced this release and the change to advance the
>>>>>> version are
>>>>>> up for review here:
>>>>>> 
>>>>>> https://asterix-gerrit.ics.uci.edu/#/c/1924/
>>>>>> https://asterix-gerrit.ics.uci.edu/#/c/1925/
>>>>>> 
>>>>>> To check out the release, simply fetch the review and check out the
>>>>>> fetch head like so:
>>>>>> 
>>>>>> git fetch https://asterix-gerrit.ics.uci.edu:29418/asterixdb
>>>>>> refs/changes/24/1924/1 && git checkout FETCH_HEAD
>>>>>> 
>>>>>> 
>>>>>> AsterixDB Source
>>>>>> 
>>>>>> https://dist.apache.org/repos/dist/dev/asterixdb/apache-asterixdb-0.9.2-source-release.zip
>>>>>> 
>>>>>> https://dist.apache.org/repos/dist/dev/asterixdb/apache-asterixdb-0.9.2-source-release.zip.asc
>>>>>> 
>>>>>> https://dist.apache.org/repos/dist/dev/asterixdb/apache-asterixdb-0.9.2-source-release.zip.sha1
>>>>>> 
>>>>>> SHA1:36fae3394755e86d97540b892cda6b80ee02a770
>>>>>> 
>>>>>> Hyracks Source
>>>>>> 
>>>>>> https://dist.apache.org/repos/dist/dev/asterixdb/apache-hyracks-0.3.2-source-release.zip
>>>>>> 
>>>>>> https://dist.apache.org/repos/dist/dev/asterixdb/apache-hyracks-0.3.2-source-release.zip.asc
>>>>>> 
>>>>>> https://

Re: Nested type + open-enforced-index question.

2017-07-14 Thread Ildar Absalyamov
However, there should be a way to deal with this issue when the top-level type 
is open.

create type DBLPType as open {id: int32}
create index title_index_DBLP on DBLP(nested.one.title: string?) enforced;

When we are encounter a field (“nested”) for which the is no compile-time 
information we should assume that the type of this field is completely open, 
i.e., {}, and pass it down the chain.

> On Jul 14, 2017, at 00:09, Ildar Absalyamov <ildar.absalya...@gmail.com> 
> wrote:
> 
> Taewoo,
> 
> You’ve correctly identified the issue here: to make use of an enforced index 
> we must cast the record to a particular type, which is imposed by the index.
> 
> So, using your example, if we have an index on path “nested.one.title” the 
> indexed record must be castable to {…, “nested”: {…,”one”: {…,”title”: 
> string, …}, ...},…}.
> As you have observed a case when there is no “nested” field in the top-level 
> type leads to exception, because it relies of a fact that there is a 
> compile-time type information for a field “nested”. This type information is 
> used to build a type for aforementioned cast operator.
> Form the perspective of current implementation a runtime exception is a bug, 
> instead it should have caught this issue during compile time.
> 
>> On Jul 13, 2017, at 23:10, Taewoo Kim <wangs...@gmail.com> wrote:
>> 
>> @Yingyi: thanks.
>> 
>> @Mike: Yeah. My problem is how to associate the field type information.
>> Ideally, the leaf level has the field to type hash map and the parent of it
>> has that hashmap in its record type. And its parent needs to have the
>> necessary information to reach to this record type. If we don't need any
>> pre-defined type at each level to create a multi-level enforced index, then
>> things will become more complex to me. :-) Anyway, we can discuss further
>> to finalize the field type propagation implementation.
>> 
>> Best,
>> Taewoo
>> 
>> On Thu, Jul 13, 2017 at 11:02 PM, Mike Carey <dtab...@gmail.com> wrote:
>> 
>>> Taewoo,
>>> 
>>> To clarify further what should work:
>>> - We should support nested indexes that go down multiple levels.
>>> - We should (ideally) support their use in index-NL joins.
>>> 
>>> Reflecting on our earlier conversation(s), I think I can see why you're
>>> asking this. :-) The augmented type information that'll be needed to do
>>> this completely/properly will actually have to associate types with field
>>> paths (not just with fields by name) - which is a slightly more complicated
>>> association.
>>> 
>>> Cheers,
>>> Mike
>>> 
>>> 
>>> On 7/13/17 10:54 PM, Yingyi Bu wrote:
>>> 
>>>> Hi Taewoo,
>>>> 
>>>> The first query shouldn't fail because indexnl is just a hint.
>>>> The second query should succeed because it's a valid indexing statement.
>>>> High nesting levels in open record like JSON is not uncommon.
>>>> 
>>>> Best,
>>>> Yingyi
>>>> 
>>>> 
>>>> On Thu, Jul 13, 2017 at 10:51 PM, Taewoo Kim <wangs...@gmail.com> wrote:
>>>> 
>>>> @Mike: In order to properly deal with the enforced index on a nested-type
>>>>> field, I need to make sure that whether my understanding (each nested
>>>>> type
>>>>> (except the leaf level0 has a record type for the next level) is correct
>>>>> or
>>>>> not. Which one is a bug? The first one (without index) should fail? Or
>>>>> the
>>>>> second one (with an index) should succeed?
>>>>> 
>>>>> Best,
>>>>> Taewoo
>>>>> 
>>>>> On Thu, Jul 13, 2017 at 9:58 PM, Yingyi Bu <buyin...@gmail.com> wrote:
>>>>> 
>>>>> Indeed, it's a bug!
>>>>>> 
>>>>>> Best,
>>>>>> Yingyi
>>>>>> 
>>>>>> On Thu, Jul 13, 2017 at 9:52 PM, Mike Carey <dtab...@gmail.com> wrote:
>>>>>> 
>>>>>> Sounds like a bug to me.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 7/13/17 7:59 PM, Taewoo Kim wrote:
>>>>>>> 
>>>>>>> Currently, I am working on a field type propagation without using
>>>>>>>> initializing the OptimizableSubTree in the current index access
>>>>>>>> 
>>>>>>> method.
>>>>> 
>

Re: Nested type + open-enforced-index question.

2017-07-14 Thread Ildar Absalyamov
Taewoo,

You’ve correctly identified the issue here: to make use of an enforced index we 
must cast the record to a particular type, which is imposed by the index.

So, using your example, if we have an index on path “nested.one.title” the 
indexed record must be castable to {…, “nested”: {…,”one”: {…,”title”: string, 
…}, ...},…}.
As you have observed a case when there is no “nested” field in the top-level 
type leads to exception, because it relies of a fact that there is a 
compile-time type information for a field “nested”. This type information is 
used to build a type for aforementioned cast operator.
Form the perspective of current implementation a runtime exception is a bug, 
instead it should have caught this issue during compile time.

> On Jul 13, 2017, at 23:10, Taewoo Kim  wrote:
> 
> @Yingyi: thanks.
> 
> @Mike: Yeah. My problem is how to associate the field type information.
> Ideally, the leaf level has the field to type hash map and the parent of it
> has that hashmap in its record type. And its parent needs to have the
> necessary information to reach to this record type. If we don't need any
> pre-defined type at each level to create a multi-level enforced index, then
> things will become more complex to me. :-) Anyway, we can discuss further
> to finalize the field type propagation implementation.
> 
> Best,
> Taewoo
> 
> On Thu, Jul 13, 2017 at 11:02 PM, Mike Carey  wrote:
> 
>> Taewoo,
>> 
>> To clarify further what should work:
>> - We should support nested indexes that go down multiple levels.
>> - We should (ideally) support their use in index-NL joins.
>> 
>> Reflecting on our earlier conversation(s), I think I can see why you're
>> asking this. :-) The augmented type information that'll be needed to do
>> this completely/properly will actually have to associate types with field
>> paths (not just with fields by name) - which is a slightly more complicated
>> association.
>> 
>> Cheers,
>> Mike
>> 
>> 
>> On 7/13/17 10:54 PM, Yingyi Bu wrote:
>> 
>>> Hi Taewoo,
>>> 
>>> The first query shouldn't fail because indexnl is just a hint.
>>> The second query should succeed because it's a valid indexing statement.
>>> High nesting levels in open record like JSON is not uncommon.
>>> 
>>> Best,
>>> Yingyi
>>> 
>>> 
>>> On Thu, Jul 13, 2017 at 10:51 PM, Taewoo Kim  wrote:
>>> 
>>> @Mike: In order to properly deal with the enforced index on a nested-type
 field, I need to make sure that whether my understanding (each nested
 type
 (except the leaf level0 has a record type for the next level) is correct
 or
 not. Which one is a bug? The first one (without index) should fail? Or
 the
 second one (with an index) should succeed?
 
 Best,
 Taewoo
 
 On Thu, Jul 13, 2017 at 9:58 PM, Yingyi Bu  wrote:
 
 Indeed, it's a bug!
> 
> Best,
> Yingyi
> 
> On Thu, Jul 13, 2017 at 9:52 PM, Mike Carey  wrote:
> 
> Sounds like a bug to me.
>> 
>> 
>> 
>> On 7/13/17 7:59 PM, Taewoo Kim wrote:
>> 
>> Currently, I am working on a field type propagation without using
>>> initializing the OptimizableSubTree in the current index access
>>> 
>> method.
 
> I
> 
>> am encountering an issue with an open-type enforced index. So, I just
>>> 
>> want
> 
>> to make sure that my understanding is correct. It looks like we can't
>>> 
>> have
> 
>> an enforced-index on a completely schemaless nested field. For
>>> 
>> example,
 
> the
>>> following doesn't generate any issue.
>>> 
>>> //
>>> create type DBLPType as open {id: int32}
>>> create type CSXType as closed {id: int32}
>>> 
>>> create dataset DBLP(DBLPType) primary key id;
>>> create dataset CSX(CSXType) primary key id;
>>> 
>>> for $a in dataset('DBLP')
>>> for $b in dataset('CSX')
>>> where $a.nested.one.title /*+ indexnl */ = $b.nested.one.title
>>> return {"arec": $a, "brec": $b}
>>> //
>>> 
>>> However, the following generates an exception. So, can we assume that
>>> 
>> to
 
> create an enforced-index, except the leaf level, there should be a
>>> 
>> defined
> 
>> record type. For example, for this example, there should be "nested"
>>> 
>> type
> 
>> and "one" type.
>>> 
>>> //
>>> create type DBLPType as open {id: int32}
>>> create type CSXType as closed {id: int32}
>>> 
>>> create dataset DBLP(DBLPType) primary key id;
>>> create dataset CSX(CSXType) primary key id;
>>> 
>>> create index title_index_DBLP on DBLP(nested.one.title: string?)
>>> 
>> enforced;
> 
>> create index title_index_CSX on CSX(nested.one.title: string?)
>>> 
>> enforced;
 
> for $a in dataset('DBLP')
>>> for $b in dataset('CSX')

Re: Nested type + open-enforced-index question.

2017-07-14 Thread Ildar Absalyamov
Maybe I missed something, but how nested access on a closed type without a 
proper nested field is ever valid?

create type CSXType as closed {id: int32}
create index title_index_CSX on CSX(nested.one.title: string?) enforced;

Will this index every be anything but empty?

for $a in dataset('DBLP')
for $b in dataset('CSX')
where $a.nested.one.title /*+ indexnl */ = $b.nested.one.title
return {"arec": $a, "brec": $b}

Will this query return anything, but empty result?

To me it feels like that should be compile time error in both cases: during 
index DDL and during the query.

> On Jul 13, 2017, at 22:51, Taewoo Kim  wrote:
> 
> @Mike: In order to properly deal with the enforced index on a nested-type
> field, I need to make sure that whether my understanding (each nested type
> (except the leaf level0 has a record type for the next level) is correct or
> not. Which one is a bug? The first one (without index) should fail? Or the
> second one (with an index) should succeed?
> 
> Best,
> Taewoo
> 
> On Thu, Jul 13, 2017 at 9:58 PM, Yingyi Bu  wrote:
> 
>> Indeed, it's a bug!
>> 
>> Best,
>> Yingyi
>> 
>> On Thu, Jul 13, 2017 at 9:52 PM, Mike Carey  wrote:
>> 
>>> Sounds like a bug to me.
>>> 
>>> 
>>> 
>>> On 7/13/17 7:59 PM, Taewoo Kim wrote:
>>> 
 Currently, I am working on a field type propagation without using
 initializing the OptimizableSubTree in the current index access method.
>> I
 am encountering an issue with an open-type enforced index. So, I just
>> want
 to make sure that my understanding is correct. It looks like we can't
>> have
 an enforced-index on a completely schemaless nested field. For example,
 the
 following doesn't generate any issue.
 
 //
 create type DBLPType as open {id: int32}
 create type CSXType as closed {id: int32}
 
 create dataset DBLP(DBLPType) primary key id;
 create dataset CSX(CSXType) primary key id;
 
 for $a in dataset('DBLP')
 for $b in dataset('CSX')
 where $a.nested.one.title /*+ indexnl */ = $b.nested.one.title
 return {"arec": $a, "brec": $b}
 //
 
 However, the following generates an exception. So, can we assume that to
 create an enforced-index, except the leaf level, there should be a
>> defined
 record type. For example, for this example, there should be "nested"
>> type
 and "one" type.
 
 //
 create type DBLPType as open {id: int32}
 create type CSXType as closed {id: int32}
 
 create dataset DBLP(DBLPType) primary key id;
 create dataset CSX(CSXType) primary key id;
 
 create index title_index_DBLP on DBLP(nested.one.title: string?)
>> enforced;
 create index title_index_CSX on CSX(nested.one.title: string?) enforced;
 
 for $a in dataset('DBLP')
 for $b in dataset('CSX')
 where $a.nested.one.title /*+ indexnl */ = $b.nested.one.title
 return {"arec": $a, "brec": $b}
 //
 
 Best,
 Taewoo
 
 
>>> 
>> 

Best regards,
Ildar



Re: [COMP] Few questions about Query Optimizer

2017-06-24 Thread Ildar Absalyamov
If I remember correctly we eliminated FieldAccessNested function in favor in 
chained FieldAccessByName\ByIndex. @Steven, correct me if I am wrong.

> On Jun 24, 2017, at 18:00, Yingyi Bu  wrote:
> 
> Hi Wail,
> 
>$22 should be a harmless bug -- it's related to the ordering of rules.
>For $19:  we could potentially have a rule for that.
> 
> Best,
> Yingyi
> 
> On Sat, Jun 24, 2017 at 5:50 PM, Wail Alkowaileet 
> wrote:
> 
>> Hi Devs,
>> 
>> I have few questions about the query optimizer.
>> 
>> *- Given the query:*
>> use dataverse TwitterDataverse
>> 
>> for $x in dataset Tweets
>> where $x.name = "trump"
>> let $geo := $x.geo
>> group by $name:=$x.name with $geo
>> return {"name": $name, "geo":$geo[0].coordinates.coordinates}
>> 
>> *- Logical Plan:*
>> distribute result [$$10] -- |UNPARTITIONED|
>>  project ([$$10]) -- |UNPARTITIONED|
>>assign [$$10] <- [{"name": $$name, "geo": get-item($$9,
>> 0).getField("coordinates").getField("coordinates")}] -- |UNPARTITIONED|
>>  group by ([$$name := $$x.getField("name")]) decor ([]) {
>>aggregate [$$9] <- [listify($$geo)] -- |UNPARTITIONED|
>>  nested tuple source -- |UNPARTITIONED|
>> } -- |UNPARTITIONED|
>>assign [$$geo] <- [$$x.getField("geo")] -- |UNPARTITIONED|
>>  select (eq($$x.getField("name"), "Alice")) -- |UNPARTITIONED|
>>unnest $$x <- dataset("Tweets") -- |UNPARTITIONED|
>>  empty-tuple-source -- |UNPARTITIONED|
>> 
>> *- Optimized Logical Plan:*
>> distribute result [$$10]
>> -- DISTRIBUTE_RESULT  |PARTITIONED|
>>  exchange
>>  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>project ([$$10])
>>-- STREAM_PROJECT  |PARTITIONED|
>>  assign [$$10] <- [{"name": $$name, "geo":
>> $$19.getField("coordinates")
>> }]
>>  -- ASSIGN  |PARTITIONED|
>>project ([$$name, $$19])
>>-- STREAM_PROJECT  |PARTITIONED|
>>  assign [$$19, $$22] <- [get-item($$9,
>> 0).getField("coordinates"), get-item($$9,
>> 0)]
>>  -- ASSIGN  |PARTITIONED|
>>exchange
>>-- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>  group by ([$$name := $$15]) decor ([]) {
>>aggregate [$$9] <- [listify($$geo)]
>>-- AGGREGATE  |LOCAL|
>>  nested tuple source
>>  -- NESTED_TUPLE_SOURCE  |LOCAL|
>> }
>>  -- PRE_CLUSTERED_GROUP_BY[$$15]  |PARTITIONED|
>>exchange
>>-- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>  order (ASC, $$15)
>>  -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
>>exchange
>>-- HASH_PARTITION_EXCHANGE [$$15]  |PARTITIONED|
>>  select (eq($$15, "Alice"))
>>  -- STREAM_SELECT  |PARTITIONED|
>>project ([$$geo, $$15])
>>-- STREAM_PROJECT  |PARTITIONED|
>>  assign [$$geo, $$15] <- [$$x.getField("geo"),
>> $$x.getField("name")]
>>  -- ASSIGN  |PARTITIONED|
>>project ([$$x])
>>-- STREAM_PROJECT  |PARTITIONED|
>>  exchange
>>  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>data-scan []<-[$$16, $$x] <-
>> TwitterDataverse.Tweets
>>-- DATASOURCE_SCAN  |PARTITIONED|
>>  exchange
>>  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>empty-tuple-source
>>-- EMPTY_TUPLE_SOURCE  |PARTITIONED|
>> 
>> *- Questions:*
>> $$22:
>> 
>>   - Why the variable $22 is produced ? Although there is no use for it. Is
>>   it just a harmless bug or there's some intuition I might be missing?
>> 
>> $$19:
>> 
>>   - It seems (sometimes) getField function calls are splitted. Is there a
>>   reason why is that the case? (There's another example that reproduces
>> the
>>   same behavior)
>>   - That leads to my next question, I see no rule for "FieldAccessNested"
>>   which can be exploited here to save few function calls. Can this
>> function
>>   interfere with other functions/access methods?
>> 
>> 
>> --
>> 
>> *Regards,.*
>> Wail Alkowaileet
>> 

Best regards,
Ildar



Re: Modules that could be removed?

2017-05-31 Thread Ildar Absalyamov
I found Young-Seok’s asterix-experiment package useful for everyone, who is 
doing any kind of experiments.
Can we instead make an ‘asterix-contrib’ repo and move it there, the same way 
we did with asterix-bad? 
We can also launch an automated build in Jenkins to verify it builds against 
master, again the same way BAD is working. This package does not have a lot of 
dependencies, so it will be fairly painless to maintain it.

> On May 31, 2017, at 09:06, Yingyi Bu  wrote:
> 
> Hi dev,
> 
>I wonder if the following potentially obsolete modules could be moved
> out of the AsterixDB code base:
>-- asterix-experiment
>-- asterix-tools
>-- hyracks-dist
>-- hyracks-sever
> 
>Any thoughts?
> 
> Best,
> Yingyi

Best regards,
Ildar



Parallel feed ingestion

2017-05-17 Thread Ildar Absalyamov
In light of Steven’s discussion about feeds in parallel thread I was wondering 
what would be a correct way to push parallel ingestion as far as possible in 
multinode\multipartition environment.
In one of my experiments I am trying to saturate the ingestion to see the 
effect of computing stats in background.
Several things I’ve tried:
1) Open a socket adapter on all NC: 
create feed Feed using socket_adapter
(
("sockets”="NC1:10001,NC2:10001,…”),
…)

2) Connect several Feeds to a single dataset.
create feed Feed1 using socket_adapter
(
("sockets”="NC1:10001”),
…)
create feed Feed2 using socket_adapter
(
("sockets”="NC2:10001”),
…)

3) Have several nodes sending data into a single socket.

In my previous experiments the parallelization did not quite show that the 
bottleneck was on the sender side, but I am wondering if that will still be the 
case, since a lot of things happened under the hood since the last time.

Best regards,
Ildar

Re: Searching for duplicates during feed ingestion.

2017-05-08 Thread Ildar Absalyamov
I believe we already support upsert feeds ;)
https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/test/resources/runtimets/queries/feeds/upsert-feed/upsert-feed.1.ddl.aql
 

> On May 8, 2017, at 12:04, Jianfeng Jia  wrote:
> 
> I also observe this getting slower problem every-time when we re-ingest the 
> twitter data. One difference is that the duplicate key could happen, and we 
> know that is indeed duplicate record. To skip the search, we would expect an  
> “upsert” logic ( just replace the old one :-) ) instead of an insert. 
> 
> Then maybe we can add some configuration in feed configuration like
> 
> create feed MessageFeed using localfs(
> ("format"="adm"),
> ("type-name"="typeX"),
> ("upsert"="true")
> );
> 
> to indicate that this feed using the upsert logic instead of insert. 
> 
> One thing we need to confirm is that if “upsert” is actually implemented in a 
> no-search fashion? 
> Based on the way we searching the components, only the most recent one will 
> be popped out. Then blindly insert should be OK logically. Correct me if I 
> missed some other cases (highly likely :-)).
> 
> 
>> On May 8, 2017, at 11:05 AM, Mike Carey  wrote:
>> 
>> +0.99 from me.
>> 
>> 
>> On 5/8/17 9:50 AM, Taewoo Kim wrote:
>>> +1 for auto-generated ID case
>>> 
>>> Best,
>>> Taewoo
>>> 
>>> On Mon, May 8, 2017 at 8:57 AM, Yingyi Bu  wrote:
>>> 
 Abdullah has a pending change that disables searches if there's no
 secondary indexes [1].
 Auto-generated ID could be another case for which we can disable searches
 as well.
 
 Best,
 Yingyi
 
 [1] https://asterix-gerrit.ics.uci.edu/#/c/1711/
 
 
 On Mon, May 8, 2017 at 4:30 AM, Wail Alkowaileet 
 wrote:
 
> Hi Devs,
> 
> I'm noticing a behavior during the ingestion is that it's getting slower
 by
> time. I know that is an expected behavior in LSM-indexes. But what I'm
> seeing is that I can notice the drop in ingestion rate roughly after
 having
> 10 components (around ~13 GB). That's what I'm not sure if it's expected?
> 
> I tried multiple setups (increasing Memory component size +
> max-mergable-component-size). All of which delayed the problem but not
> solved it. The only part I've never changed is the bloom-filter
> false-positive rate (1%). Which I want to investigate next.
> 
> So..
> What I want to suggest is that when the primary key is auto-generated,
 why
> AsterixDB looks for duplicates? it seems a wasteful operation to me.
 Also,
> can we give the user the ability to tell the index that all keys are
 unique
> ? I know I should not trust the user .. but in certain cases, probably
 the
> user is certain that the key is unique. Or a more elegant solution can
> shine in the end :-)
> 
> --
> 
> *Regards,*
> Wail Alkowaileet
> 
>> 
> 

Best regards,
Ildar



Re: When is it appropirate to add reserved words to the AQL/SQL++ Grammar?

2017-04-12 Thread Ildar Absalyamov
Ian,

There is an existing API, which does exactly that. 
https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/main/java/org/apache/asterix/api/http/server/ConnectorApiServlet.java#L123
 

I have been successfully using it in my experiments.

> On Apr 12, 2017, at 13:41, Ian Maxon  wrote:
> 
> Hey all,
> I was working on a patch that would add a 'flush dataset' DDL command
> (mainly for testing), and of course this would require adding 'flush'
> as a new reserved word. What is the consensus on when this would be
> permitted at this point? There are other ways to do this, of course, I
> could expose this functionality through a separate API, and I am not
> really partial to one solution or the other.
> 
> Thanks,
> - Ian

Best regards,
Ildar



Re: Force LSM component flush & NC-CC messaging ACK

2017-01-21 Thread Ildar Absalyamov
As Mike mentioned I need this force flush to trigger the stats collecting 
during my experiments. I brought up messaging only because I noticed if I use 
shutdown for force flush some messages are lost due to CC being shutdown by the 
time they arrive.

Anyway this ConnectorAPI indeed did exactly that I wanted. Thanks Wail!
Given that we have an API way of forcing the flush, I am not sure if that the 
language level construct is need.

> On Jan 21, 2017, at 08:51, Wail Alkowaileet <wael@gmail.com> wrote:
> 
> I remember one reason to enforce flush is for Preglix connector [1][2][3].
> 
> For the messaging framework, I believe that you probably have the same
> issue I had. I did what Till has suggested as it is guaranteed by the
> robustness of AsterixDB and not the user who might kill the process anyway.
> 
> [1]
> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/main/java/org/apache/asterix/api/http/servlet/ConnectorAPIServlet.java
> [2]
> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/main/java/org/apache/asterix/util/FlushDatasetUtils.java
> [3]
> https://github.com/apache/asterixdb/blob/2f9d4c3ab4d55598fe9a14fbf28faef12bed208b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/operators/std/FlushDatasetOperatorDescriptor.java
> 
> On Sat, Jan 21, 2017 at 7:17 PM, Mike Carey <dtab...@gmail.com> wrote:
> 
>> I believe Ildar is just looking for a way to ensure, in doing experiments,
>> that things are all in disk components.  His stats-gathering extensions
>> camp on the LSM lifecycle - flushes in particular - and he wants to finish
>> that process in his testing and experiments.  Wail's schema inference stuff
>> has a similar flavor.  So the goal is to flush any lingering memory
>> components to disk for a given dataset at the end of the "experiment
>> lifecycle".
>> 
>> We have DDL to compact a dataset - which flushes AND compacts - it might
>> also be useful to have DDL to flush a dataset without also forcing
>> compaction - as a way for an administrator to release that dataset's
>> in-memory component related resources.  (Not that it's "necessary" for any
>> correctness reason - just might be nice to be able to do that.  That could
>> also be useful in scripting more user-level-oriented recovery tests.)
>> 
>> Thus, I'd likely vote for adding a harmless new DDL statement - another
>> arm of the one that supports compaction - for this.
>> 
>> Cheers,
>> 
>> Mike
>> 
>> 
>> 
>> On 1/21/17 6:21 AM, Till Westmann wrote:
>> 
>>> Hi Ildar,
>>> 
>>> On 19 Jan 2017, at 4:02, Ildar Absalyamov wrote:
>>> 
>>> Since I was out for quite a while and a lot of things happened in a
>>>> meantime in a codebase I wanted to clarify couple of things.
>>>> 
>>>> I was wondering if there is any legitimate way to force the data of
>>>> in-memory components to be flushed, other then stop the whole instance?
>>>> It used to be that choosing a different default dataverse with “use”
>>>> statement did that trick, but that is not the case anymore.
>>>> 
>>> 
>>> Just wondering, why do you want to flush the in-memory components to disk?
>>> 
>>> Another question is regarding CC<->NC & NC<->NC messaging. Does the
>>>> sender get some kind of ACK that the message was received by the addressee?
>>>> Say if I send a message just before the instance shutdown will the shutdown
>>>> hook wait until the message is delivered and processed?
>>>> 
>>> 
>>> I agree with Murtadha, that I can certainly be done. However, we also
>>> need to assume that some shutdowns won’t be clean and so the messages might
>>> not be received. So it might be easier to just be able to recover from
>>> missing messages than to be able to recover *and* to synchronize on
>>> shutdown. Just a thought - maybe that’s not even an issue for your use-case.
>>> 
>>> Cheers,
>>> Till
>>> 
>> 
>> 
> 
> 
> -- 
> 
> *Regards,*
> Wail Alkowaileet

Best regards,
Ildar



Force LSM component flush & NC-CC messaging ACK

2017-01-18 Thread Ildar Absalyamov
Hi devs,

Since I was out for quite a while and a lot of things happened in a meantime in 
a codebase I wanted to clarify couple of things.

I was wondering if there is any legitimate way to force the data of in-memory 
components to be flushed, other then stop the whole instance? 
It used to be that choosing a different default dataverse with “use” statement 
did that trick, but that is not the case anymore.

Another question is regarding CC<->NC & NC<->NC messaging. Does the sender get 
some kind of ACK that the message was received by the addressee? Say if I send 
a message just before the instance shutdown will the shutdown hook wait until 
the message is delivered and processed?

Best regards,
Ildar



Re: About the Multiple Join Optimization on AsterixDB

2016-10-31 Thread Ildar Absalyamov
As Yingyi pointed out we don't reorder joins because the framework for
stats and cardinalities is not there yet.
However what we can do in an meantime is to provide an interface for the
statistical information, needed for join reordering, independent of the way
the stats were collected (either sampling-based or LSM-based) and work out
details of the cost model.

2016-10-31 11:30 GMT-07:00 mingda li :

> Hi Yingyi,
>
> I see. Thanks for your reply:-)
>
> Bests,
> Mingda
>
>
> On Mon, Oct 31, 2016 at 11:23 AM, Yingyi Bu  wrote:
>
> > Mingda,
> >
> >  I'm not sure how much re-ordering can be done at the Hyracks level,
> > i.e., the runtime level.
> >  In the optimizer (the asterixdb/algebricks level), we don't have
> > re-ordering for  joins, because:
> >  --- the cost model has not been added yet.  I'm not sure about the
> > timeline for this. @Ildar?
> >  --- respecting user-specified join orders is important for certain
> > cases, for example, to get stable/predictable performance (zero surprise)
> > for applications.
> >
> >  In the runtime, we have a role-reversal optimization in hybrid hash
> > join, which is a safe optimization that is not based on estimations.  You
> > can look at OptimizedHybridHashJoin.
> >
> > Best,
> > Yingyi
> >
> >
> > On Mon, Oct 31, 2016 at 11:16 AM, mingda li 
> > wrote:
> >
> > > Dear all,
> > >
> > > Hi, I am working on multiple join on Hyracks level. I am not sure if I
> do
> > > the multiple join on AsterixDB, whether it will optimize the query by
> > > changing the join order or just execute according to how we write the
> > > query. I think this may not be done in Algebricks level based on rule
> but
> > > not sure.
> > >
> > > Bests,
> > > Mingda
> > >
> >
>



-- 
Best regards,
Ildar


Re: Unsigned integers data types

2016-06-16 Thread Ildar Absalyamov
Things like Spark and Flink don’t do that as well, but because they need 
integration with proper Java types.

> On Jun 16, 2016, at 15:45, Yingyi Bu <buyin...@gmail.com> wrote:
> 
>>> Is there any database or SQL implementation supporting that?
> Ok, it turns out MySQL supports that,  while Postgres, MS SQL and Hive do
> not have that.
> 
> Best,
> Yingyi
> 
> On Thu, Jun 16, 2016 at 3:40 PM, Yingyi Bu <buyin...@gmail.com> wrote:
> 
>>>> I guess part of the reason why we do that is because Java used to lack
>> native support of unsigned integers.
>> Is there any database or SQL implementation supporting that?
>> 
>> FYI:
>> 
>> http://dba.stackexchange.com/questions/53050/why-arent-unsigned-integer-types-available-in-the-top-database-platforms
>> 
>> Best,
>> Yingyi
>> 
>> 
>> On Thu, Jun 16, 2016 at 3:27 PM, Ildar Absalyamov <
>> ildar.absalya...@gmail.com> wrote:
>> 
>>> Hi devs,
>>> 
>>> As I was generating various data distributions for statistics experiments
>>> one thing kept bothering me.
>>> All Asterix integer types (int8, int16, int32, int64) are signed. However
>>> majority of real use cases does not require negative integer values. Seems
>>> like we are waisting half of the data range on something which does not get
>>> used that often. I guess part of the reason why we do that is because Java
>>> used to lack native support of unsigned integers. But since Java 8 there
>>> are methods which do unsigned comparison and division (summation,
>>> subtraction, multiplication are the same in both signed and unsigned
>>> cases). So it seems like conversion to support unsigned integers would not
>>> be that difficult.
>>> 
>>> Any thoughts on whether we need unsigned integers in the type system?
>>> 
>>> Best regards,
>>> Ildar
>>> 
>>> 
>> 

Best regards,
Ildar



Unsigned integers data types

2016-06-16 Thread Ildar Absalyamov
Hi devs,

As I was generating various data distributions for statistics experiments one 
thing kept bothering me.
All Asterix integer types (int8, int16, int32, int64) are signed. However 
majority of real use cases does not require negative integer values. Seems like 
we are waisting half of the data range on something which does not get used 
that often. I guess part of the reason why we do that is because Java used to 
lack native support of unsigned integers. But since Java 8 there are methods 
which do unsigned comparison and division (summation, subtraction, 
multiplication are the same in both signed and unsigned cases). So it seems 
like conversion to support unsigned integers would not be that difficult.

Any thoughts on whether we need unsigned integers in the type system?

Best regards,
Ildar