[ML][DISCUSSION] Big Double problem

2019-06-10 Thread Ravil Galeyev
Hi Team,

I tried to run Ignite ML across the dataset with categorical features and
came across some problems.

My dataset is Mushrooms
 dataset from Kaggle.
There are only categorial features and categorical labels.

(so-called classification problem). My attempt you can find in my repo

.

My goal is to make a pipeline which takes raw string values, encodes them
to numbers, then train a model.

The first problem is the Vectorizer.

I started with DummyVectorizer but it supports only Double labels.

All other vectorizers have the same issue because all of them are inherited

from DefaultLabelVectorizer

where Double labels are hardcoded at the generic level.

I didn’t find an approach to work with only categorical data with standard
Ignite vectorizers. I wrote my own.

The second problem. EncoderTrainer (in my case STRING_ENCODER).

It doesn’t encode labels. The trainer just ignores labels. See
EncoderTrainer

.

Probably ignoring labels makes sense, but…

The third problem. ClassCastException.

There are “hidden” (for user) casts labels to Double in model trainers

i.e. SVMLinearClassificationTrainer
,
DiscreteNaiveBayesTrainer etc.

Feel free to use my regex \(Double\).*\.label\(\) to search other casts.

To sum up, I can say that there are assumptions that labels are numeric
values,

but if we solve a classification problem, labels can be whatever.

But I didn’t find an easy way to preprocess them.



If you have any question or need details, feel free to write to me.

Best regards,

Ravil


Re: SQL query crashes Ignite.

2019-06-10 Thread Shane Duan
Thanks, Ilya. I tried the lazy=true, still no luck. We also tried a
different test workflow, in which, there is a table contains about 1000
rows; each rows has id column(primary key),  3 string columns and a blob
column (about 1M each). Then we have a multithread application(using Tomcat
thread pool) which will perform queries (where id=). Ignite
crashes with this application as well(out of memory).  If the WHERE clause
is identical (all with same id), no problem.

By the way, is it okay to use Ignite JDC driver with Tomcat thread pool?



On Fri, Jun 7, 2019 at 9:25 AM Ilya Kasnacheev 
wrote:

> Hello!
>
> Consider adding ;lazy=true to connection string. You're causing overly huge
> result set to be held on heap otherwise.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пт, 7 июн. 2019 г. в 19:07, Shane Duan :
>
> > Thanks Denis.
> >
> > Oops, copied wrong section of log file. The previous log which complains
> > schema will happen for any further operations once Ignite crashed.
> >
> > Here is how I prepared the JDBC connection:
> >
> > String connectionStr = "jdbc:ignite:thin://" + hostName + ":" +
> portNumber
> > +";"
> > + "schema=" + schema + ";"
> > + "user=" + userName + ";"
> > + "password=" + password;
> >
> > // Register JDBC driver.
> > Class.forName("org.apache.ignite.IgniteJdbcThinDriver");
> >
> > // Open the JDBC connection with predefined connection endpoints.
> > Connection conn = DriverManager.getConnection(connectionStr);
> >
> >
> > The example works if just fetch a couple of rows.
> >
> > Here is the right log.
> >
> >
> > 2019-06-06 10:42:34,402 DEBUG Client request received [reqId=0, addr=/
> > 10.212.22.67:54469, req=JdbcQueryExecuteRequest
> > [schemaName=FEATURE_TILE_CACHE, pageSize=1024, maxRows=0, sqlQry=SELECT
> id,
> > val FROM CITY WHERE id > 500, args=[], stmtType=SELECT_STATEMENT_TYPE,
> > autoCommit=true]]
> > 2019-06-06 10:42:34,407 DEBUG Set schema: FEATURE_TILE_CACHE
> > 2019-06-06 10:42:34,448 DEBUG Parsed query: `SELECT id, val FROM CITY
> WHERE
> > id > 500` into two step query: GridCacheTwoStepQuery
> > [mapQrys=[GridCacheSqlQuery [qry=SELECT
> > __Z0.ID __C0_0,
> > __Z0.VAL __C0_1
> > FROM FEATURE_TILE_CACHE.CITY __Z0
> > WHERE __Z0.ID > 500, paramIdxs=[], cols={__C0_0=GridSqlType [type=5,
> > scale=0, precision=19, displaySize=20, sql=BIGINT], __C0_1=GridSqlType
> > [type=12, scale=0, precision=2147483647, displaySize=2147483647,
> > sql=VARBINARY]}, alias=null, sort=[], partitioned=true, node=null,
> > derivedPartitions=null, hasSubQries=false]], rdc=GridCacheSqlQuery
> > [qry=SELECT
> > __C0_0 ID,
> > __C0_1 VAL
> > FROM PUBLIC.__T0, paramIdxs=[], cols=null, alias=null, sort=null,
> > partitioned=false, node=null, derivedPartitions=null, hasSubQries=false],
> > pageSize=1024, explain=false, originalSql=SELECT
> > ID,
> > VAL
> > FROM FEATURE_TILE_CACHE.CITY
> > WHERE ID > 500, distributedJoins=false, skipMergeTbl=true, local=false,
> > mvccEnabled=false, forUpdate=false]
> > 2019-06-06 10:42:34,448 DEBUG Parsed query: `SELECT id, val FROM CITY
> WHERE
> > id > 500` into two step query: GridCacheTwoStepQuery
> > [mapQrys=[GridCacheSqlQuery [qry=SELECT
> > __Z0.ID __C0_0,
> > __Z0.VAL __C0_1
> > FROM FEATURE_TILE_CACHE.CITY __Z0
> > WHERE __Z0.ID > 500, paramIdxs=[], cols={__C0_0=GridSqlType [type=5,
> > scale=0, precision=19, displaySize=20, sql=BIGINT], __C0_1=GridSqlType
> > [type=12, scale=0, precision=2147483647, displaySize=2147483647,
> > sql=VARBINARY]}, alias=null, sort=[], partitioned=true, node=null,
> > derivedPartitions=null, hasSubQries=false]], rdc=GridCacheSqlQuery
> > [qry=SELECT
> > __C0_0 ID,
> > __C0_1 VAL
> > FROM PUBLIC.__T0, paramIdxs=[], cols=null, alias=null, sort=null,
> > partitioned=false, node=null, derivedPartitions=null, hasSubQries=false],
> > pageSize=1024, explain=false, originalSql=SELECT
> > ID,
> > VAL
> > FROM FEATURE_TILE_CACHE.CITY
> > WHERE ID > 500, distributedJoins=false, skipMergeTbl=true, local=false,
> > mvccEnabled=false, forUpdate=false]
> > 2019-06-06 10:42:34,457 DEBUG Sending: [msg=GridH2QueryRequest [reqId=1,
> > caches=[-2013421729], topVer=AffinityTopologyVersion [topVer=1,
> > minorTopVer=2], parts=null, qryParts=null, pageSize=1024,
> > qrys=[GridCacheSqlQuery [qry=SELECT
> > __Z0.ID __C0_0,
> > __Z0.VAL __C0_1
> > FROM FEATURE_TILE_CACHE.CITY __Z0
> > WHERE __Z0.ID > 500, paramIdxs=[], cols={__C0_0=GridSqlType [type=5,
> > scale=0, precision=19, displaySize=20, sql=BIGINT], __C0_1=GridSqlType
> > [type=12, scale=0, precision=2147483647, displaySize=2147483647,
> > sql=VARBINARY]}, alias=null, sort=[], partitioned=true, node=null,
> > derivedPartitions=null, hasSubQries=false]], flags=2, tbls=null,
> timeout=0,
> > params=[], schemaName=FEATURE_TILE_CACHE, mvccSnapshot=null, txReq=null],
> > nodes=[TcpDiscoveryNode [id=9f8714ae-d9ba-4135-836c-ca8c2a1cc4c1,
> > addrs=[0:0:0:0:0:0:0:1%lo, 10.29.77.101, 127.0.0.1, 172.17.0.1],
> > sockAddrs=[/172.17.0.1:9820, greentea.esri.com/10.29.77.101:9820,
> > 

[jira] [Created] (IGNITE-11909) Cache.invokeAll() returns a map with BinaryObjects as keys

2019-06-10 Thread Sergey Kosarev (JIRA)
Sergey Kosarev created IGNITE-11909:
---

 Summary: Cache.invokeAll() returns a map with BinaryObjects as keys
 Key: IGNITE-11909
 URL: https://issues.apache.org/jira/browse/IGNITE-11909
 Project: Ignite
  Issue Type: Bug
Reporter: Sergey Kosarev


Preconditions:
1) AtomicityMode.Transactional
2) Key is custom object. (i.e MyKey)

cache.returnAll returns should return Map>, but 
keys 
processed on remote node(s) are not unwrapped and return as BinaryObject, so we 
can gat a map with mixed keys:

{code}
key.class = BinaryObjectImpl, key = 
org.apache.ignite.examples.datagrid.CacheEntryProcessorExample2$MyKey 
[idHash=151593342, hash=31459296, i=2]
key.class = MyKey, key = MyKey{i=7}
key.class = BinaryObjectImpl, key = 
org.apache.ignite.examples.datagrid.CacheEntryProcessorExample2$MyKey 
[idHash=405215542, hash=31638042, i=8]
key.class = MyKey, key = MyKey{i=1}
key.class = BinaryObjectImpl, key = 
org.apache.ignite.examples.datagrid.CacheEntryProcessorExample2$MyKey 
[idHash=1617838096, hash=31548669, i=5]
key.class = MyKey, key = MyKey{i=0}
key.class = BinaryObjectImpl, key = 
org.apache.ignite.examples.datagrid.CacheEntryProcessorExample2$MyKey 
[idHash=138776324, hash=31578460, i=6]
key.class = MyKey, key = MyKey{i=9}
key.class = MyKey, key = MyKey{i=4}
{code}

Reproducer is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11908) OOM in MVCC PDS4

2019-06-10 Thread Ivan Pavlukhin (JIRA)
Ivan Pavlukhin created IGNITE-11908:
---

 Summary: OOM in MVCC PDS4 
 Key: IGNITE-11908
 URL: https://issues.apache.org/jira/browse/IGNITE-11908
 Project: Ignite
  Issue Type: Bug
  Components: mvcc
Reporter: Ivan Pavlukhin
Assignee: Ivan Pavlukhin


Almost every time [MVCC PDS 
4|https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_MvccPds4_IgniteTests24Java8=%3Cdefault%3E=buildTypeStatusDiv]
 fails with OOM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Code Style Check] TC issues in master

2019-06-10 Thread Maxim Muzafarov
Igniters,

It seems to me that building [ignite-scalar] module under JDK9+ have
been successfully solved for the ~Build Apache Ignite~ suite [1] [2]
[3], but it was not configured for the [Check Code Style] suite. We
should configure it the same way (but it sounds to me very odd). I
see, that we have several options here:

1. Enable `checkstyle` profile for the ~Build Apache Ignite~ suite as
we've discussed it previously [4] and forget about any duplicate
configuration once and for all. One more thing to do so is that check
style has been violated for a few days and nobody mentioned it [5].

2. Since the checkstyle plugin is not related to scala-source code (it
does not check it) we can exclude scala modules from maven build
procedure for the checkstyle suite by adding some command-line
parameters (test them locally, but have no TC permissions to check it
on TC):
-pl 
-:ignite-scalar_2.10,-:ignite-scalar,-:ignite-visor-console,-:ignite-visor-console_2.10

3. Configure [Check Code Style] the same way as ~Build Apache Ignite~
to support builds for JDK9+.

WDYT?
What options will be the best for the Apache Ignite?

[1] https://github.com/scala/bug/issues/10871
[2] https://issues.apache.org/jira/browse/IGNITE-6730
[3] https://issues.apache.org/jira/browse/IGNITE-11189
[4] 
http://apache-ignite-developers.2346864.n4.nabble.com/Code-inspection-tp27709p41297.html
[5] https://issues.apache.org/jira/browse/IGNITE-11899

On Fri, 7 Jun 2019 at 15:36, Nikolay Izhikov  wrote:
>
> Hello, Petr.
>
> > at least Scala does not compile
>
> How cat I reproduce it?
> Do we have ticket?
>
> В Пт, 07/06/2019 в 15:28 +0300, Petr Ivanov пишет:
> > Suite fails because Apache Ignite compilation is not supported under JDK 9+ 
> > (at least Scala does not compile).
> > Your build from [3] was triggered with JDK 11.
> >
> > > On 7 Jun 2019, at 14:57, Maxim Muzafarov  wrote:
> > >
> > > Igniters,
> > >
> > > I've noticed a few problems with Code Style Check Suite on TC in the
> > > master branch.
> > >
> > > 1. Some of the rules have been violated by previous commits to the
> > > master branch. I've created ticket [1] and have prepared PR [2] which
> > > is fixing it.
> > > Dmitry, or maybe someone else, can you take a look, please?
> > >
> > > 2. The Code Style Check Stuite still fails (time to time) on TC with
> > > compile error on [ignite-scalar] module
> > > (java.lang.NoClassDefFoundError: javax/tools/ToolProvider). For
> > > instance, this build [3] fails and this is fully ok [4]. However, the
> > > ~Build Apache Ignite~ Suite with almost the same configuration passes
> > > normally.
> > >
> > > I'd like to create a new suite with checkstyle for debug purposes, can
> > > anyone grant permission to copy\clone\edit suites on TC? My login:
> > > maxmu...@gmail.com
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-11899
> > > [2] https://github.com/apache/ignite/pull/6597
> > > [3] 
> > > https://ci.ignite.apache.org/viewLog.html?buildId=4020653=IgniteTests24Java8_CheckCodeStyle=buildLog_IgniteTests24Java8=%3Cdefault%3E
> > > [4] 
> > > https://ci.ignite.apache.org/viewLog.html?buildId=4021372=IgniteTests24Java8_CheckCodeStyle
> >
> >


[jira] [Created] (IGNITE-11907) Registration of continuous query should fail if nodes don't have remote filter class

2019-06-10 Thread Denis Mekhanikov (JIRA)
Denis Mekhanikov created IGNITE-11907:
-

 Summary: Registration of continuous query should fail if nodes 
don't have remote filter class
 Key: IGNITE-11907
 URL: https://issues.apache.org/jira/browse/IGNITE-11907
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.7
Reporter: Denis Mekhanikov
 Attachments: ContinuousQueryRemoteFilterMissingInClassPathSelfTest.java

If one of data nodes doesn't have a remote filter class, then registration of 
continuous queries should fail with an exception. Currently nodes fail instead.

Reproducer is attached: 
[^ContinuousQueryRemoteFilterMissingInClassPathSelfTest.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: {DISCUSSION] Cluster read-only mode.

2019-06-10 Thread Alexey Kukushkin
I agree with Ivan's concern - do we really need the "activation" concept in
Ignite?

Activation was introduced with Ignite persistence: we must prevent both the
read and write operations on a cluster with persistence on until full data
set is loaded (all the nodes are started). Cluster "activation" was a hint
to the cluster to know that enough nodes had started for the cluster to have
all the data.

Then we introduced the concept of "baseline topology". It looks like the
"cluster is active" is similar to "cluster has baseline topology defined". 

Can we remove the concept of "activation" now and leave only "set baseline
topology" ? Having duplicate concepts negatively impacts Ignite's usability,
making it unnecessary more complex. 



--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/


[jira] [Created] (IGNITE-11906) Scalar examples fails on TC

2019-06-10 Thread Nikolay Izhikov (JIRA)
Nikolay Izhikov created IGNITE-11906:


 Summary: Scalar examples fails on TC
 Key: IGNITE-11906
 URL: https://issues.apache.org/jira/browse/IGNITE-11906
 Project: Ignite
  Issue Type: Improvement
Reporter: Nikolay Izhikov
Assignee: Nikolay Izhikov


Scalar examples tests fails in master.

https://ci.ignite.apache.org/viewLog.html?buildId=4085544=buildResultsDiv=IgniteTests24Java8_ScalaExamples



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[RESULT] [VOTE] Release Apache Ignite 2.7.5-rc4

2019-06-10 Thread Dmitriy Pavlov
The vote for a new release candidate is closed, now



Vote result: Vote passes with 9 votes +1 (4 binding +1 votes), no 0 and no
-1.



+1 votes:

- Ilya Kasnacheev

- Nikolay Izhikov (binding)

- Denis Magda (binding)

- Andrey Gura (binding)

- Alexey Goncharuk (binding)

- Igor Sapego

- Ivan Pavlukhin

- Yuriy Babak

- Vyacheslav Daradur



Vote thread
https://lists.apache.org/thread.html/35cbc2d4c5b769155dc8aec15edd808a25c5cf48a5e12637528e931d@%3Cdev.ignite.apache.org%3E


[IEP-35] Monitoring & Profiling. Phase 2

2019-06-10 Thread Nikolay Izhikov
Hello, Igniters.

Since Phase 1 will be merged in master soon I've created the ticket [1] for 
Phase 2.

Scope of Phase 2(copy-paste from the ticket)

Ability to collect lists of some internal object Ignite manage.
Examples of such objects:

  * Caches
  * Queries (including continuous queries)
  * Services
  * Compute tasks
  * Distributed Data Structures
  * etc...


1. Fields for each list(that doesn't currently exists in Ignite) will be 
discussed in separate tickets
2. Metric Exporters (optionally) can support list export.

[1] https://issues.apache.org/jira/browse/IGNITE-11905


В Вт, 14/05/2019 в 16:42 +0300, Nikolay Izhikov пишет:
> Ticket for IEP.Phase1 created - 
> https://issues.apache.org/jira/browse/IGNITE-11848
> 
> 
> В Пн, 13/05/2019 в 18:06 +0300, Nikolay Izhikov пишет:
> > Hello, Igniters.
> > 
> > We have discussed this IEP [1] with Alexey Goncharyuk, Anton Vinogradov, 
> > Andrey Gura, Alexey Scherbakov and Pavel Kovalenko.
> > 
> > Issues to address:
> > 
> > 1. Study experience of following libs, tools:
> > * OpenTracing
> > * OpenSensus
> > * DropWizard
> > 
> > 2. Support histogram sensor: Sensor that collects values that gets into 
> > predefined segments 
> > 
> > 3. Use more widely used naming(like in OpenSensus?) 
> > 
> > 4. Consider the usage of OpenSensus as a default implementation for local 
> > metric storage.
> > 
> > 5. To measure the performance penalty for metrics for 5_000 caches.
> > 
> > 6. Some metrics should be part of public API and others are not(may be 
> > changed/removed in release without warnings).
> > 
> > My plan for Phase #1 is the following:
> > 
> > 1. Address the issues.
> > 2. Prepare public API
> > 3. Prepare PR for monitoring subsystem + existing metrics rewritten with it.
> > 4. Prepare a PR with lists of each user API.
> > 5. Collect feedback for a #4.
> > 6. Design a log exposer. Consider the usage of JFR format or some other 
> > widely used, tool compatible format.
> > 
> > [1] 
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=112820392
> > 
> > В Чт, 02/05/2019 в 14:02 +0300, Nikolay Izhikov пишет:
> > > Hello, Maxim.
> > > 
> > > > How will be recorded throughput sensor values which will require an 
> > > > interval for the rate calculations?
> > > 
> > > I answered to this question in IEP "Design principles":
> > > 
> > > ```
> > > Sensors should contain only raw values. No aggregation of numeric metrics 
> > > on Ignite side. 
> > > Min, max, avg and other functions are the matter of an external 
> > > monitoring system.
> > > ```
> > > 
> > > Throughput is a function `(S(t2) - S(t1))/(t2-t1)`
> > > where S(t) is the sensor value in some point of time t.
> > > 
> > > Seems, throughput calculation is a responsibility of an external system.
> > > 
> > > What do you think?
> > > 
> > > > It seems to me that we can add an additional parameter of 
> > > > `sensitivityLevel` to provide for the user a flexible sensor control 
> > > > (e.g., INFO, WARN, NOTICE, DEBUG).
> > > 
> > > For now, I think that all sensors and lists will be very(very!) 
> > > lightweight.
> > > So, we should be able to disable/enable it's, for sure.
> > > 
> > > But, we should turn off and turn on the whole Ignite subsystem 
> > > for the case we have strong performance limitations for a particular 
> > > workload.
> > > 
> > > So, we have two "level" of monitoring - INFO and DEBUG(for profiling: 
> > > IEP-35 - Phase 3).
> > > For example, AFAIK we can't disable current SQL system views(Why should 
> > > we?)
> > > 
> > > В Вт, 30/04/2019 в 14:33 +0300, Maxim Muzafarov пишет:
> > > > Hello Nikolay,
> > > > 
> > > > I've looked through your PRs changes.
> > > > 
> > > > > Sensors
> > > > 
> > > > How will be recorded throughput sensor values which will require an
> > > > interval for the rate calculations? Do we have such an example? For
> > > > instance, getAllocationRate() or getEvictionRate(). These metrics are
> > > > out of the scope of current PoC and IEP as they are not related to the
> > > > user metrics, but it is a good example of a particular metric type.
> > > > 
> > > > It seems to me that we can add an additional parameter of
> > > > `sensitivityLevel` to provide for the user a flexible sensor control
> > > > (e.g., INFO, WARN, NOTICE, DEBUG).
> > > > 
> > > > It also seems that for the sensors getValue() the completely
> > > > functional java approach can be used. Am I right?
> > > > 
> > > > On Mon, 29 Apr 2019 at 11:44, Nikolay Izhikov  
> > > > wrote:
> > > > > 
> > > > > Hello, Vyacheslav.
> > > > > 
> > > > > Thanks for the feedback!
> > > > > 
> > > > > > HttpExposer with Jetty's dependencies should be detached> from the 
> > > > > > core module.
> > > > > 
> > > > > Agreed. module hierarchy is the essence of the next steps.
> > > > > For now it just a proof of my ideas for Ignite monitoring we can 
> > > > > discuss.
> > > > > 
> > > > > > I like your approach with 'wrapper' for monitored objects, like 
> > > > > > 

[jira] [Created] (IGNITE-11905) [IEP-35] Monitoring Phase 2

2019-06-10 Thread Nikolay Izhikov (JIRA)
Nikolay Izhikov created IGNITE-11905:


 Summary: [IEP-35] Monitoring Phase 2
 Key: IGNITE-11905
 URL: https://issues.apache.org/jira/browse/IGNITE-11905
 Project: Ignite
  Issue Type: Improvement
Reporter: Nikolay Izhikov
Assignee: Nikolay Izhikov


Phase 2 should introduce:

Ability to collect lists of some internal object Ignite manage.
Examples of such objects:

* Caches
* Queries (including continuous queries)
* Services
* Compute tasks
* Distributed Data Structures
* etc...

1. Fields for each list should be discussed in separate tickets
2. Metric Exporters (optionally) can support list export.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)