[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-02-07 Thread Kirill Shirokov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355283#comment-16355283
 ] 

Kirill Shirokov edited comment on IGNITE-6917 at 2/7/18 10:29 AM:
--

[~al.psc], so, the bottom line is that you don't have any other review 
comments, so I've moved this issue to 'Patch Available' state. Please revert it 
back if you haven't finished with reviewing.

Relevant tests have passed (the failures in the build aren't related to this 
issue).

https://ci.ignite.apache.org/viewLog.html?buildId=1075531=buildResultsDiv=IgniteTests24Java8_RunAllSql


was (Author: kirill.shirokov):
[~al.psc], so, the bottom line is that you don't have any other review 
comments, so I've moved this issue to 'Patch Available' state. Please revert it 
back if you haven't finished with reviewing.

> SQL: implement COPY command for efficient data loading
> --
>
> Key: IGNITE-6917
> URL: https://issues.apache.org/jira/browse/IGNITE-6917
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Affects Versions: 2.4
>Reporter: Vladimir Ozerov
>Assignee: Kirill Shirokov
>Priority: Major
>  Labels: iep-1
> Fix For: 2.4
>
>
> Inspired by Postgres [1]
> Common use case - bulk data load through JDBC/ODBC interface. Currently it is 
> only possible to execute single commands one by one. We already can batch 
> them to improve performance, but there is still big room for improvement.
> We should think of a completely new command - {{COPY}}. It will accept a file 
> (or input stream in general case) on the client side, then transfer data to 
> the cluster, and then execute update inside the cluster, e.g. through 
> streamer.
> First of all we need to create quick and dirty prototype to assess potential 
> performance improvement. It speedup is confirmed, we should build base 
> implementation which will accept only files. But at the same time we should 
> understand how it will evolve in future: multiple file formats (probably 
> including Hadoop formarts, e.g. Parquet), escape characters, input streams, 
> etc..
> [1] [https://www.postgresql.org/docs/9.6/static/sql-copy.html]
> h1. Proposed syntax
> Curent implementation:
> {noformat}
> COPY 
> FROM "file.name"
> INTO .
> [(col-name, ...)]
> FORMAT  -- Only CSV format is supported in the current 
> release
> [BATCH_SIZE ]
> {noformat}
> We may want to gradually add features to this command in future to have 
> something like this:
> {noformat}
> COPY
> FROM "file.name"[CHARSET ""]
> INTO . [CREATE [IF NOT EXISTS]]
> [(col-name [] [NULLABLE] [ESCAPES], ...) [MATCH HEADER]]
> FORMAT (csv|tsv|...)
> -- CSV format options:
> [FIELDSEP='column-separators-regexp']
> [LINESEP='row-separators-regexp']
> [QUOTE='quote-chars']
> [ESCAPE='escape-char']
> [NULL='null-sequence']
> [COMMENT='single-line-comment-start-char']
> [TRIM_LINES]
> [IMPORT_EMPTY_LINES]
> [CHARSET ""]
> [ROWS -]
> --or--
> [SKIP ROWS ] [MAX ROWS ]
> [COLS -]
> --or--
> [SKIP COLS ] [MAX COLS ]
> [(MATCH | SKIP) HEADER]
> [(REPLACE|IGNORE|ABORT ON [])) DUPLICATE KEYS]
> [BATCH SIZE ( ROWS | [K|M|G|T|P])]
> [COMPRESS "codec-name" [codec options]]
> [LOCK (TABLE|ROWS)]
> [NOLOGGING]
> [BACKEND (DIRECT | STREAMER)]
> {noformat}
> h1. Implementation decisions and notes
> h2. Parsing
> * We support CSV format described in RFC 4180.
> * Custom row and column separators, quoting characters are currently hardcoded
> * Escape sequences, line comment characters are currently not supported
> * We may want to support fixed-length formats (via format descriptors) in 
> future
> * We may want to strip comments from lines (for example, starting with '#')
> * We may want to allow user to either ignore empty lines or treat them as a 
> special case of record having all default values
> * We may allow user to enable whitespace trimming from beginning and end of a 
> line
> * We may want to allow user to specify error handling strategy: e.g., only 
> one quote character is present or escape sequence is invalid.
> h2. File handling
> * File character set to be supported in future
> * Skipped/imported row number (or first/last line or skip header option), 
> skipped/imported column number (or first/last column): to be supported in 
> future
> * Line start pattern (as in MySQL): no support planned
> * We currently support only client-side import. No server-side file import.
> * We may want to support client-side stdin import in future.
> * We do not handle importing multiple files from single command
> * We don't benefit from any kind of pre-sorting pre-partitioning data on 
> client side.
> * We don't include any any 

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-02-05 Thread Kirill Shirokov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352334#comment-16352334
 ] 

Kirill Shirokov edited comment on IGNITE-6917 at 2/5/18 1:48 PM:
-

> 52. {{BulkLoadParser#createParser}} - I think we don't need two factories, 
> one for formats is enough. Just create abstract method {{BulkLoadParser 
> createParser()}} in {{BulkLoadFormat}}. You've stated above that format and 
> parser are 1:1 related, so this seems to be the way to go.

A possible drawback of this approach is that if parser would require context 
information, which is not available during parsing SQL statement (where we 
create the corresponding BulkLoadFormat class), we will have to introduce 
setters in the BulkLoadXxxFormat class to allow it to create a parser.

Instead, the factory in BulkLoadFormat can be removed, IMO.


was (Author: kirill.shirokov):
> 52. {{BulkLoadParser#createParser}} - I think we don't need two factories, 
> one for formats is enough. Just create abstract method {{BulkLoadParser 
> createParser()}} in {{BulkLoadFormat}}. You've stated above that format and 
> parser are 1:1 related, so this seems to be the way to go.

A possible drawback of this approach is that if parser would require context 
information, which is not available during parsing SQL statement (where we 
create the corresponding BulkLoadFormat class), we will have to introduce 
setters in the BulkLoadXxxFormat class to allow it to create a parser.

> SQL: implement COPY command for efficient data loading
> --
>
> Key: IGNITE-6917
> URL: https://issues.apache.org/jira/browse/IGNITE-6917
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Vladimir Ozerov
>Assignee: Kirill Shirokov
>Priority: Major
>  Labels: iep-1
>
> Inspired by Postgres [1]
> Common use case - bulk data load through JDBC/ODBC interface. Currently it is 
> only possible to execute single commands one by one. We already can batch 
> them to improve performance, but there is still big room for improvement.
> We should think of a completely new command - {{COPY}}. It will accept a file 
> (or input stream in general case) on the client side, then transfer data to 
> the cluster, and then execute update inside the cluster, e.g. through 
> streamer.
> First of all we need to create quick and dirty prototype to assess potential 
> performance improvement. It speedup is confirmed, we should build base 
> implementation which will accept only files. But at the same time we should 
> understand how it will evolve in future: multiple file formats (probably 
> including Hadoop formarts, e.g. Parquet), escape characters, input streams, 
> etc..
> [1] [https://www.postgresql.org/docs/9.6/static/sql-copy.html]
> h1. Proposed syntax
> Curent implementation:
> {noformat}
> COPY 
> FROM "file.name"
> INTO .
> [(col-name, ...)]
> FORMAT  -- Only CSV format is supported in the current 
> release
> [BATCH_SIZE ]
> {noformat}
> We may want to gradually add features to this command in future to have 
> something like this:
> {noformat}
> COPY
> FROM "file.name"[CHARSET ""]
> INTO . [CREATE [IF NOT EXISTS]]
> [(col-name [] [NULLABLE] [ESCAPES], ...) [MATCH HEADER]]
> FORMAT (csv|tsv|...)
> -- CSV format options:
> [FIELDSEP='column-separators-regexp']
> [LINESEP='row-separators-regexp']
> [QUOTE='quote-chars']
> [ESCAPE='escape-char']
> [NULL='null-sequence']
> [COMMENT='single-line-comment-start-char']
> [TRIM_LINES]
> [IMPORT_EMPTY_LINES]
> [CHARSET ""]
> [ROWS -]
> --or--
> [SKIP ROWS ] [MAX ROWS ]
> [COLS -]
> --or--
> [SKIP COLS ] [MAX COLS ]
> [(MATCH | SKIP) HEADER]
> [(REPLACE|IGNORE|ABORT ON [])) DUPLICATE KEYS]
> [BATCH SIZE ( ROWS | [K|M|G|T|P])]
> [COMPRESS "codec-name" [codec options]]
> [LOCK (TABLE|ROWS)]
> [NOLOGGING]
> [BACKEND (DIRECT | STREAMER)]
> {noformat}
> h1. Implementation decisions and notes
> h2. Parsing
> * We support CSV format described in RFC 4180.
> * Custom row and column separators, quoting characters are currently hardcoded
> * Escape sequences, line comment characters are currently not supported
> * We may want to support fixed-length formats (via format descriptors) in 
> future
> * We may want to strip comments from lines (for example, starting with '#')
> * We may want to allow user to either ignore empty lines or treat them as a 
> special case of record having all default values
> * We may allow user to enable whitespace trimming from beginning and end of a 
> line
> * We may want to allow user to specify error handling strategy: e.g., only 
> one quote character is present or escape sequence is invalid.
> h2. File handling

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-02-02 Thread Alexander Paschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350605#comment-16350605
 ] 

Alexander Paschenko edited comment on IGNITE-6917 at 2/2/18 4:29 PM:
-

[~kirill.shirokov],

50. {{BulkLoadProcessor#close}}

>See javadoc.

I have seen it. Still this class is not a member of any hierarchy, {{close}} 
method neither overrides anything nor is overridden by anyone and thus now this 
param neither is used nor needed. Please remove it, adding it back once we need 
it won't be a big deal at all. Really, unused param in this case hardly is an 
architectural point of future extension.

51. {{BulkLoadCacheWriter}}:

>O. But what is the reason for doing that?

At the very least, to maintain consistency about names of methods having 
similar, or close-to-identical meaning. Say, all our closures have their method 
called {{apply}} while this class employs {{accept}}. I don't see any reason 
why this class shouldn't be in that hierarchy.

 


was (Author: al.psc):
[~kirill.shirokov],

50. {{BulkLoadProcessor#close}}

>See javadoc.

I have seen it. Still this class is not a member of any hierarchy, {{close}} 
method neither overrides anything nor is overridden by anyone and thus now this 
param neither is used nor needed. Please remove it, adding it back once we need 
it won't be a big deal at all. Really, unused param in this case hardly is an 
architectural point of extension.

51. {{BulkLoadCacheWriter}}:

>O. But what is the reason for doing that?

At the very least, to maintain consistency about names of methods having 
similar, or close-to-identical meaning. Say, all our closures have their method 
called {{apply}} while this class employs {{accept}}. I don't see any reason 
why this class shouldn't be in that hierarchy.

 

> SQL: implement COPY command for efficient data loading
> --
>
> Key: IGNITE-6917
> URL: https://issues.apache.org/jira/browse/IGNITE-6917
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Vladimir Ozerov
>Assignee: Kirill Shirokov
>Priority: Major
>  Labels: iep-1
>
> Inspired by Postgres [1]
> Common use case - bulk data load through JDBC/ODBC interface. Currently it is 
> only possible to execute single commands one by one. We already can batch 
> them to improve performance, but there is still big room for improvement.
> We should think of a completely new command - {{COPY}}. It will accept a file 
> (or input stream in general case) on the client side, then transfer data to 
> the cluster, and then execute update inside the cluster, e.g. through 
> streamer.
> First of all we need to create quick and dirty prototype to assess potential 
> performance improvement. It speedup is confirmed, we should build base 
> implementation which will accept only files. But at the same time we should 
> understand how it will evolve in future: multiple file formats (probably 
> including Hadoop formarts, e.g. Parquet), escape characters, input streams, 
> etc..
> [1] [https://www.postgresql.org/docs/9.6/static/sql-copy.html]
> h1. Proposed syntax
> Curent implementation:
> {noformat}
> COPY 
> FROM "file.name"
> INTO .
> [(col-name, ...)]
> FORMAT  -- Only CSV format is supported in the current 
> release
> [BATCH_SIZE ]
> {noformat}
> We may want to gradually add features to this command in future to have 
> something like this:
> {noformat}
> COPY
> FROM "file.name"[CHARSET ""]
> INTO . [CREATE [IF NOT EXISTS]]
> [(col-name [] [NULLABLE] [ESCAPES], ...) [MATCH HEADER]]
> FORMAT (csv|tsv|...)
> -- CSV format options:
> [FIELDSEP='column-separators-regexp']
> [LINESEP='row-separators-regexp']
> [QUOTE='quote-chars']
> [ESCAPE='escape-char']
> [NULL='null-sequence']
> [COMMENT='single-line-comment-start-char']
> [TRIM_LINES]
> [IMPORT_EMPTY_LINES]
> [CHARSET ""]
> [ROWS -]
> --or--
> [SKIP ROWS ] [MAX ROWS ]
> [COLS -]
> --or--
> [SKIP COLS ] [MAX COLS ]
> [(MATCH | SKIP) HEADER]
> [(REPLACE|IGNORE|ABORT ON [])) DUPLICATE KEYS]
> [BATCH SIZE ( ROWS | [K|M|G|T|P])]
> [COMPRESS "codec-name" [codec options]]
> [LOCK (TABLE|ROWS)]
> [NOLOGGING]
> [BACKEND (DIRECT | STREAMER)]
> {noformat}
> h1. Implementation decisions and notes
> h2. Parsing
> * We support CSV format described in RFC 4180.
> * Custom row and column separators, quoting characters are currently hardcoded
> * Escape sequences, line comment characters are currently not supported
> * We may want to support fixed-length formats (via format descriptors) in 
> future
> * We may want to strip comments from lines (for example, starting with '#')
> * We may want to allow user to either ignore empty lines or 

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-02-02 Thread Alexander Paschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349007#comment-16349007
 ] 

Alexander Paschenko edited comment on IGNITE-6917 at 2/2/18 12:28 PM:
--

[~kirill.shirokov],

31. {{BulkLoadDataConverter}} is now empty and doesn't convert anything. I 
think we should remove it and simply pass {{UpdatePlan}} to 
{{BulkLoadProcessor}}.

32. {{BulkLoadCacheWriter}}:

32.1 Odd empty line after interface header.

32.2 Leaving this interface be does not contradict with making it extend some 
of existing Ignite interfaces. Please add another appropriate ancestor to this 
interface.

33. Regarding this:

>Does this anti-pattern makes sense here? Or we're about keeping similarity 
>with existing code and bugward compatibility?

First, providing for resilience is one of the reasons that justify "error 
hiding". We aim not to let JDBC handler or NIO listener crash.

Second, here's no "hiding" per se as exception is properly logged and we send 
the user message.

34. {{BulkLoadProcessor}}:

34.1 odd empty line after ctor header.

34.2 {{processBatch}} is still coupled with JDBC related class.

34.3 I think class name {{BulkLoadContext}} would reflect what this class does 
more correctly. In fact, it does not process anything, just holds various bits 
that help the process. In favor of this tells the fact that you even use the 
word "context" in javadoc for this class.

35. {{BulkLoadParser}}: is still coupled with JDBC related class. I have 
already suggested that we introduce some JDBC neutral enum to indicate current 
stage of the process.

36. {{JdbcRequestHandler#executeQuery}} - odd empty line after your new "if".

37. {{JdbcBulkLoadBatchAckResult}}: I suggested simply 
{{JdbcBulkLoadAckResult}}, without the word "batch", as this ack response has 
to do with the command overall and *not* with an individual batch. It's sent 
once per bulk load.

38. {{UpdateMode}}: there's a typo in javadocs you've added to all enum 
members. Please correct: com *_m_* and

39. {{UpdatePlanBuilder}}: unused new static import.

40. {{BulkLoadStreamerWriter}}: why is counter internally stored as {{int}}, 
not {{long}}?

41. {{JdbcQueryTask#call}}: I think it would be better to gracefully close 
newly created context before throwing an exception.

42. {{BulkLoadCsvFormat}}: many new members and their getters have "Re" suffix. 
It's not standard to Ignite codebase. I would either replace it to "regex" or 
"regexp", or even remove it - the user of this class can clearly see what is 
return type of those getters, and those suffixes actually don't tell anything 
new.

43. {{DdlStatementsProcessor#runDdlStatement}}: why employ log throttling here? 
Why would we want to skip some error messages?

This is it for now, thanks. Looking forward to fixes.


was (Author: al.psc):
[~kirill.shirokov],

31. {{BulkLoadDataConverter}} is now empty and doesn't convert anything. I 
think we should remove it and simply pass {{UpdatePlan}} to 
{{BulkLoadProcessor}}.

32. {{BulkLoadCacheWriter}}:

32.1 Odd empty line after interface header.

32.2 Leaving this interface be does not contradict with making it extend some 
of existing Ignite interfaces. Please add another appropriate ancestor to this 
interface.

33. Regarding this:

>Does this anti-pattern makes sense here? Or we're about keeping similarity 
>with existing code and bugward compatibility?

First, providing for resilience is one of the reasons that justify "error 
hiding". We aim not to let JDBC handler or NIO listener crash.

Second, here's no "hiding" per se as exception is properly logged and we send 
the user message.

34. {{BulkLoadProcessor}}:

34.1 odd empty line after ctor header.

34.2 {{processBatch}} is still coupled with JDBC related class.

34.3 I think class name {{BulkLoadContext}} would reflect what this class does 
more correctly. In fact, it does not process anything, just holds various bits 
that help the process. In favor of this tells the fact that you even use the 
word "context" in javadoc for this class.

35. {{BulkLoadParser}}: is still coupled with JDBC related class. I have 
already suggested that we introduce some JDBC neutral enum to indicate current 
stage of the process.

36. {{JdbcRequestHandler#executeQuery}} - odd empty line after your new "if".

37. {{JdbcBulkLoadBatchAckResult}}: I suggested simply 
{{JdbcBulkLoadAckResult}}, without the word "batch", as this ack response has 
to do with the command overall and *not* with an individual batch. It's sent 
once per bulk load.

38. \{{UpdateMode: t}}here's a typo in javadocs you've added to all enum 
members. Please correct: com*_m_*and

39. {{UpdatePlanBuilder}}: unused new static import.

40. {{BulkLoadStreamerWriter}}: why is counter internally stored as {{int}}, 
not {{long}}?

41. {{JdbcQueryTask#call}}: I think it would be better to gracefully close 
newly created 

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-02-01 Thread Alexander Paschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349007#comment-16349007
 ] 

Alexander Paschenko edited comment on IGNITE-6917 at 2/1/18 6:00 PM:
-

[~kirill.shirokov],

31. {{BulkLoadDataConverter}} is now empty and doesn't convert anything. I 
think we should remove it and simply pass {{UpdatePlan}} to 
{{BulkLoadProcessor}}.

32. {{BulkLoadCacheWriter}}:

32.1 Odd empty line after interface header.

32.2 Leaving this interface be does not contradict with making it extend some 
of existing Ignite interfaces. Please add another appropriate ancestor to this 
interface.

33. Regarding this:

>Does this anti-pattern makes sense here? Or we're about keeping similarity 
>with existing code and bugward compatibility?

First, providing for resilience is one of the reasons that justify "error 
hiding". We aim not to let JDBC handler or NIO listener crash.

Second, here's no "hiding" per se as exception is properly logged and we send 
the user message.

34. {{BulkLoadProcessor}}:

34.1 odd empty line after ctor header.

34.2 {{processBatch}} is still coupled with JDBC related class.

34.3 I think class name {{BulkLoadContext}} would reflect what this class does 
more correctly. In fact, it does not process anything, just holds various bits 
that help the process. In favor of this tells the fact that you even use the 
word "context" in javadoc for this class.

35. {{BulkLoadParser}}: is still coupled with JDBC related class. I have 
already suggested that we introduce some JDBC neutral enum to indicate current 
stage of the process.

36. {{JdbcRequestHandler#executeQuery}} - odd empty line after your new "if".

37. {{JdbcBulkLoadBatchAckResult}}: I suggested simply 
{{JdbcBulkLoadAckResult}}, without the word "batch", as this ack response has 
to do with the command overall and *not* with an individual batch. It's sent 
once per bulk load.

38. \{{UpdateMode: t}}here's a typo in javadocs you've added to all enum 
members. Please correct: com*_m_*and

39. {{UpdatePlanBuilder}}: unused new static import.

40. {{BulkLoadStreamerWriter}}: why is counter internally stored as {{int}}, 
not {{long}}?

41. {{JdbcQueryTask#call}}: I think it would be better to gracefully close 
newly created context before throwing an exception.

42. {{BulkLoadCsvFormat}}: many new members and their getters have "Re" suffix. 
It's not standard to Ignite codebase. I would either replace it to "regex" or 
"regexp", or even remove it - the user of its class can clearly see what is 
return type of those getters, and those suffixes actually don't tell anything 
new.

43. {{DdlStatementsProcessor#runDdlStatement}}: why employ log throttling here? 
Why would we want to skip some error messages?

This is it for now, thanks. Looking forward to fixes.


was (Author: al.psc):
[~kirill.shirokov],

31. {{BulkLoadDataConverter}} is now empty and doesn't convert anything. I 
think we should remove it and simply pass {{UpdatePlan}} to 
{{BulkLoadProcessor}}.

32. {{BulkLoadCacheWriter}}:

32.1 Odd empty line after interface header.

32.2 Leaving this interface be does not contradict with making it extend some 
of existing Ignite interfaces. Please add another appropriate ancestor to this 
interface.

33. Regarding this:

>Does this anti-pattern makes sense here? Or we're about keeping similarity 
>with existing code and bugward compatibility?

First, providing for resilience is one of the reasons that justify "error 
hiding". We aim not to let JDBC handler or NIO listener crash.

Second, here's no "hiding" per se as exception is properly logged and we send 
the user message.

34. {{BulkLoadProcessor}}:

34.1 odd empty line after ctor header.

34.2 {{processBatch}} is still coupled with JDBC related class.

34.3 I think class name {{BulkLoadContext}} would reflect what this class does 
more correctly. In fact, it does not process anything, just holds various bits 
that help the process. In favor of this tells the fact that you even use the 
word "context" in javadoc for this class.

35. {{BulkLoadParser}}: is still coupled with JDBC related class. I have 
already suggested that we introduce some JDBC neutral enum to indicate current 
stage of the process.

36. {{JdbcRequestHandler#executeQuery}} - odd empty line after your new "if".

37. {{JdbcBulkLoadBatchAckResult}}: I suggested simply 
{{JdbcBulkLoadAckResult}}, without the word "batch", as this ack response has 
to do with the command overall and *not* with an individual batch. It's sent 
once per bulk load.

38. Point *26* is unanswered and I don't see related code changes either.

39. {{UpdateMode: t}}here's a typo in javadocs you've added to all enum 
members. Please correct: com*_m_*and

40. {{UpdatePlanBuilder}}: unused new static import.

41. {{BulkLoadStreamerWriter}}: why is counter internally stored as {{int}}, 
not {{long}}?

42. 

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-02-01 Thread Kirill Shirokov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348592#comment-16348592
 ] 

Kirill Shirokov edited comment on IGNITE-6917 at 2/1/18 3:45 PM:
-

> 20. SqlParserCopySelfTest - no single test for positive cases, only negative 
> stuff. We need to cover positive cases extensively as well.

Basic set of positive tests covered in JdbcThinBulkLoadAbstractSelfTest.
 Do you think we need to duplicate some of them here?

> 21. JdbcBulkLoadContext - probably we don't need this class and it would be 
> better to keep update counter in BulkLoadContext as the notion of counter is 
> used not just in JDBC driver.

FIXED (removed).

> 22. JdbcThinStatement -
 > 22.1 let's catch and wrap any exception in sendFile

FIXED. Ok.

> 22.2 Odd empty line in the beginning of sendFile

FIXED.

> 23. CsvLineProcessorBlock - we don't handle cases when quoted strings in CSV 
> line contain commas, do we? This clearly does not seem right and I don't see 
> any tests for that either. Shouldn't we handle escaped quotes too?

Yes, Vladimir asked to make a very simple implementation for now, so my plan 
was to implement this properly in IGNITE-7537.

> 24. All *PipelineBlock classes - odd empty line after class header.

FIXED

> 25. StrListAppenderBlock - do we need list of lists here? Probably just list 
> of arrays could do?

Afterwards each record is processed (via dataConverter) in 
UpdatePlan.processRow(), which requires List. Converting it to array 
would require us to change UpdatePlan implementation.

> 26. BulkLoadContextCursor\{{}} - it's unclear to me why we return context in 
> fake "row" while all other parameters should be retrieved via getters. Can we 
> unify this? (E.g. add getters for everything AND return everything in fake 
> row - that would be consistent.)

> 27. Please add more tests for behavior in case of failure - server or client 
> disconnect during file processing, file open error on client, etc.

Vladimir said that we can postpone such tests. Filed IGNITE-7602.

> I think this is it for now. Looking forward to the fixes.

Thanks a lot!


was (Author: kirill.shirokov):
> 20. SqlParserCopySelfTest - no single test for positive cases, only negative 
> stuff. We need to cover positive cases extensively as well.

Basic set of positive tests covered in JdbcThinBulkLoadAbstractSelfTest.
Do you think we need to duplicate some of them here? 

> 21. JdbcBulkLoadContext - probably we don't need this class and it would be 
> better to keep update counter in BulkLoadContext as the notion of counter is 
> used not just in JDBC driver.

FIXED (removed).

> 22. JdbcThinStatement -
> 22.1 let's catch and wrap any exception in sendFile

FIXED. Ok.

> 22.2 Odd empty line in the beginning of sendFile

FIXED.

> 23. CsvLineProcessorBlock - we don't handle cases when quoted strings in CSV 
> line contain commas, do we? This clearly does not seem right and I don't see 
> any tests for that either. Shouldn't we handle escaped quotes too?

Yes, Vladimir asked to make a very simple implementation for now, so my plan 
was to implement this properly in IGNITE-7537.

> 24. All *PipelineBlock classes - odd empty line after class header.

FIXED

> 25. StrListAppenderBlock - do we need list of lists here? Probably just list 
> of arrays could do?

Afterwards each record is processed (via dataConverter) in 
UpdatePlan.processRow(), which requires List. Converting it to array 
would require us to change UpdatePlan implementation.

> 26. BulkLoadContextCursor\{{}} - it's unclear to me why we return context in 
> fake "row" while all other parameters should be retrieved via getters. Can we 
> unify this? (E.g. add getters for everything AND return everything in fake 
> row - that would be consistent.)

> 27. Please add more tests for behavior in case of failure - server or client 
> disconnect during file processing, file open error on client, etc.

Vladimir said that we can postpone such tests. Filed IGNITE-6702.

> I think this is it for now. Looking forward to the fixes.

Thanks a lot!

> SQL: implement COPY command for efficient data loading
> --
>
> Key: IGNITE-6917
> URL: https://issues.apache.org/jira/browse/IGNITE-6917
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Vladimir Ozerov
>Assignee: Kirill Shirokov
>Priority: Major
>  Labels: iep-1
>
> Inspired by Postgres [1]
> Common use case - bulk data load through JDBC/ODBC interface. Currently it is 
> only possible to execute single commands one by one. We already can batch 
> them to improve performance, but there is still big room for improvement.
> We should think of a completely new command - {{COPY}}. It will accept a file 
> (or input stream in general 

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-02-01 Thread Alexander Paschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346774#comment-16346774
 ] 

Alexander Paschenko edited comment on IGNITE-6917 at 2/1/18 8:45 AM:
-

[~kirill.shirokov], please let me continue with comments.

8. {{BulkLoadParameters}} - do we really need this class? It just has two 
fields. For now I see no point in it. Do you expect many more new fields to 
come?

9. {{BulkLoadParser}}:

9.1 Unused private field.

9.2 Too long line in {{processBatch}} declaration.

9.3 It seems plain wrong that arguments of {{processBatch}} are related with 
JDBC. JDBC is a connectivity interface, nothing else, and these references are 
in deeper "guts". We need to decouple these somehow - one day we might want to 
support this feature in ODBC, or somewhere else, and then it'd be inevitable. 
I'd just introduce an enum for various command types.

10. {{BulkLoadFormat}}:

10.1 Unused import

10.2 Odd empty lines in the beginning of class, in switch, etc

11. {{BulkLoadParser}} and {{BulkLoadFormat}} seem to be related 1:1 - probably 
they should be single class and implementations could be merged accordingly 
too. Can you come up with a case where we would really need to separate this 
entities? Say, one parser for few formats, or one format yielding more than one 
type of parser?

12. {{BulkLoadCsvParser}}:

12.1 {{processBatch}} method looks like it could be actually defined in parent 
class, and the logic to be changed in subclasses actually lives in 
{{parseBatch}}. This is so because command types like {{CMD_FINISHED_EOF}} are 
defined on higher level. Can we rework hierarchy accordingly?

12.2 Unused parameter in ctor ( {{params}} ), do we have tests for params at 
all?

12.3 Odd empty lines after class header, method declarations

13. {{BulkLoadEntryConverter}}: do we really expect more than one 
implementation of this interface? If no, let's just replace it with 
{{IgniteClosureX}} or something like that. We have more than enough interfaces 
and abstract classes to represent lambdas. *Before refactoring this, please 
look at pt.17.3 as well*

14. {{BulkLoadCsvFormat}}: amount of useful code in this class approaches zero, 
and this further assures that parser and its format must be one.

15. {{BulkLoadContext}}: too long comment line.

16. {{BulkLoadCacheWriter}}: do we expect more than one implementation of this 
interface? If no, just remove this class and let {{BulkLoadContext}} know about 
streamer. If yes, make it inherit one of Ignite's provided classes/interfaces 
for lambdas. *Before refactoring this, please look at pt.17.3 as well*

17. {{JdbcRequestHandler}}:

17.1 Unused import

17.2 {{processBulkLoadFileBatch}}: all other methods just catch {{Exception}}, 
please do the same here.

17.3 {{processBulkLoadFileBatch}}: it looks like doing actual job (data 
conversion, feeding to streamer) is just plain wrong. Not only it's logic that 
could be reused when we'll implement same feature for other interface like 
JDBC, it's also not the area of responsibility of JDBC request handler. This 
again shows the need of refactoring of {{BulkLoadContext}}: probably we need to 
introduce a concrete class that will *a)* wrap streamer and *b)* do conversion. 
It will just accept {{List}} and do with it whatever it wants. JDBC 
must not know about this at all, it must just detect correct context and feed 
lists of objects to that new class I've mentioned above.

17.4 {{executeQuery}}: I'd move whole stuff of {{if (fieldsCur instanceof 
BulkLoadContextCursor)}} above, before {{if (results.size() == 1)}} - it would 
decrease scope depth, make code better readable and decrease overall size of 
diff (never hurts).

17.5 {{executeQuery}}: {{filesToSendResult}} - doesn't have anything at all to 
do with files.

18. {{JdbcBulkLoadContext}}: unused field {{cmdParsingResult}} (getter has 0 
usages), not to mention that field name is very misleading - what does JDBC 
have to do with parsing?

19. {{JdbcBulkLoadBatchRequestResult}}: quite weird name, I don't think we 
should mix words "request" and "result", we never do so in JDBC. Not to mention 
that we never send {{JdbcBulkLoadBatchRequest}} *before* we get 
{{JdbcBulkLoadBatchRequestResult}} :) I'd just rename this to 
{{JdbcBulkLoadAckResult}} as it's sent as an acknowledgement of {{COPY}} 
command and never else.

This is not all, more comments will follow.


was (Author: al.psc):
[~kirill.shirokov], please let me continue with comments.

8. {{BulkLoadParameters}} - do we really need this class? It just has two 
fields. For now I see no point in it. Do you expect many more new fields to 
come?

9. {{BulkLoadParser}}:

9.1 Unused private field.

9.2 Too long line in {{processBatch}} declaration.

9.3 It seems plain wrong that arguments of {{processBatch}} are related with 
JDBC. JDBC is a connectivity interface, nothing else, and 

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-02-01 Thread Alexander Paschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346774#comment-16346774
 ] 

Alexander Paschenko edited comment on IGNITE-6917 at 2/1/18 8:44 AM:
-

[~kirill.shirokov], please let me continue with comments.

8. {{BulkLoadParameters}} - do we really need this class? It just has two 
fields. For now I see no point in it. Do you expect many more new fields to 
come?

9. {{BulkLoadParser}}:

9.1 Unused private field.

9.2 Too long line in {{processBatch}} declaration.

9.3 It seems plain wrong that arguments of {{processBatch}} are related with 
JDBC. JDBC is a connectivity interface, nothing else, and these references are 
in deeper "guts". We need to decouple these somehow - one day we might want to 
support this feature in ODBC, or somewhere else, and then it'd be inevitable. 
I'd just introduce an enum for various command types.

10. {{BulkLoadFormat}}:

10.1 Unused import

10.2 Odd empty lines in the beginning of class, in switch, etc

11. {{BulkLoadParser}} and {{BulkLoadFormat}} seem to be related 1:1 - probably 
they should be single class and implementations could be merged accordingly 
too. Can you come up with a case where we would really need to separate this 
entities? Say, one parser for few formats, or one format yielding more than one 
type of parser?

12. {{BulkLoadCsvParser}}:

12.1 {{processBatch}} method looks like it could be actually defined in parent 
class, and the logic to be changed in subclasses actually lives in 
{{parseBatch}}. This is so because command types like {{CMD_FINISHED_EOF}} are 
defined on higher level. Can we rework hierarchy accordingly?

12.2 Unused parameter in ctor ( {{params}} ), do we have tests for params at 
all?

12.3 Odd empty lines after class header, method declarations

13. {{BulkLoadEntryConverter}}: do we really expect more than one 
implementation of this interface? If no, let's just replace it with 
{{IgniteClosureX}} or something like that. We have more than enough interfaces 
and abstract classes to represent lambdas. *Before refactoring this, please 
look at pt.10.3 as well*

14. {{BulkLoadCsvFormat}}: amount of useful code in this class approaches zero, 
and this further assures that parser and its format must be one.

15. {{BulkLoadContext}}: too long comment line.

16. {{BulkLoadCacheWriter}}: do we expect more than one implementation of this 
interface?quit If no, just remove this class and let {{BulkLoadContext}} know 
about streamer. If yes, make it inherit one of Ignite's provided 
classes/interfaces for lambdas. *Before refactoring this, please look at 
pt.10.3 as well*

17. {{JdbcRequestHandler}}:

17.1 Unused import

17.2 {{processBulkLoadFileBatch}}: all other methods just catch {{Exception}}, 
please do the same here.

17.3 {{processBulkLoadFileBatch}}: it looks like doing actual job (data 
conversion, feeding to streamer) is just plain wrong. Not only it's logic that 
could be reused when we'll implement same feature for other interface like 
JDBC, it's also not the area of responsibility of JDBC request handler. This 
again shows the need of refactoring of {{BulkLoadContext}}: probably we need to 
introduce a concrete class that will *a)* wrap streamer and *b)* do conversion. 
It will just accept {{List}} and do with it whatever it wants. JDBC 
must not know about this at all, it must just detect correct context and feed 
lists of objects to that new class I've mentioned above.

17.4 {{executeQuery}}: I'd move whole stuff of {{if (fieldsCur instanceof 
BulkLoadContextCursor)}} above, before {{if (results.size() == 1)}} - it would 
decrease scope depth, make code better readable and decrease overall size of 
diff (never hurts).

17.5 {{executeQuery}}: {{filesToSendResult}} - doesn't have anything at all to 
do with files.

18. {{JdbcBulkLoadContext}}: unused field {{cmdParsingResult}} (getter has 0 
usages), not to mention that field name is very misleading - what does JDBC 
have to do with parsing?

19. {{JdbcBulkLoadBatchRequestResult}}: quite weird name, I don't think we 
should mix words "request" and "result", we never do so in JDBC. Not to mention 
that we never send {{JdbcBulkLoadBatchRequest}} *before* we get 
{{JdbcBulkLoadBatchRequestResult}} :) I'd just rename this to 
{{JdbcBulkLoadAckResult}} as it's sent as an acknowledgement of {{COPY}} 
command and never else.

This is not all, more comments will follow.


was (Author: al.psc):
[~kirill.shirokov], please let me continue with comments.

8. {{BulkLoadParameters}} - do we really need this class? It just has two 
fields. For now I see no point in it. Do you expect many more new fields to 
come?

9. {{BulkLoadParser}}:

9.1 Unused private field.

9.2 Too long line in {{processBatch}} declaration.

9.3 It seems plain wrong that arguments of {{processBatch}} are related with 
JDBC. JDBC is a connectivity interface, nothing else, 

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-02-01 Thread Alexander Paschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346774#comment-16346774
 ] 

Alexander Paschenko edited comment on IGNITE-6917 at 2/1/18 8:43 AM:
-

[~kirill.shirokov], please let me continue with comments.

8. {{BulkLoadParameters}} - do we really need this class? It just has two 
fields. For now I see no point in it. Do you expect many more new fields to 
come?

9. {{BulkLoadParser}}:

9.1 Unused private field.

9.2 Too long line in {{processBatch}} declaration.

9.3 It seems plain wrong that arguments of {{processBatch}} are related with 
JDBC. JDBC is a connectivity interface, nothing else, and these references are 
in deeper "guts". We need to decouple these somehow - one day we might want to 
support this feature in ODBC, or somewhere else, and then it'd be inevitable. 
I'd just introduce an enum for various command types.

10. {{BulkLoadFormat}}:

10.1 Unused import

10.2 Odd empty lines in the beginning of class, in switch, etc

11. {{BulkLoadParser}} and {{BulkLoadFormat}} seem to be related 1:1 - probably 
they should be single class and implementations could be merged accordingly 
too. Can you come up with a case where we would really need to separate this 
entities? Say, one parser for few formats, or one format yielding more than one 
type of parser?

12. {{BulkLoadCsvParser}}:

12.1\{{ processBatch}} method looks like it could be actually defined in parent 
class, and the logic to be changed in subclasses actually lives in 
{{parseBatch}}. This is so because command types like {{CMD_FINISHED_EOF}} are 
defined on higher level. Can we rework hierarchy accordingly?

12.2 Unused parameter in ctor ( {{params}} ), do we have tests for params at 
all?

12.3 Odd empty lines after class header, method declarations

13. {{BulkLoadEntryConverter}}: do we really expect more than one 
implementation of this interface? If no, let's just replace it with 
{{IgniteClosureX}} or something like that. We have more than enough interfaces 
and abstract classes to represent lambdas. *Before refactoring this, please 
look at pt.10.3 as well*

14. {{BulkLoadCsvFormat}}: amount of useful code in this class approaches zero, 
and this further assures that parser and its format must be one.

15. {{BulkLoadContext}}: too long comment line.

16. {{BulkLoadCacheWriter}}: do we expect more than one implementation of this 
interface?quit If no, just remove this class and let {{BulkLoadContext}} know 
about streamer. If yes, make it inherit one of Ignite's provided 
classes/interfaces for lambdas. *Before refactoring this, please look at 
pt.10.3 as well*

17. {{JdbcRequestHandler}}:

17.1 Unused import

17.2 {{processBulkLoadFileBatch}}: all other methods just catch {{Exception}}, 
please do the same here.

17.3 {{processBulkLoadFileBatch}}: it looks like doing actual job (data 
conversion, feeding to streamer) is just plain wrong. Not only it's logic that 
could be reused when we'll implement same feature for other interface like 
JDBC, it's also not the area of responsibility of JDBC request handler. This 
again shows the need of refactoring of {{BulkLoadContext}}: probably we need to 
introduce a concrete class that will *a)* wrap streamer and *b)* do conversion. 
It will just accept {{List}} and do with it whatever it wants. JDBC 
must not know about this at all, it must just detect correct context and feed 
lists of objects to that new class I've mentioned above.

17.4 {{executeQuery}}: I'd move whole stuff of {{if (fieldsCur instanceof 
BulkLoadContextCursor)}} above, before {{if (results.size() == 1)}} - it would 
decrease scope depth, make code better readable and decrease overall size of 
diff (never hurts).

17.5 {{executeQuery}}: {{filesToSendResult}} - doesn't have anything at all to 
do with files.

18. {{JdbcBulkLoadContext}}: unused field {{cmdParsingResult}} (getter has 0 
usages), not to mention that field name is very misleading - what does JDBC 
have to do with parsing?

19. {{JdbcBulkLoadBatchRequestResult}}: quite weird name, I don't think we 
should mix words "request" and "result", we never do so in JDBC. Not to mention 
that we never send {{JdbcBulkLoadBatchRequest}} *before* we get 
{{JdbcBulkLoadBatchRequestResult}} :) I'd just rename this to 
{{JdbcBulkLoadAckResult}} as it's sent as an acknowledgement of {{COPY}} 
command and never else.

This is not all, more comments will follow.


was (Author: al.psc):
[~kirill.shirokov], please let me continue with comments.

8. {{BulkLoadParameters}} - do we really need this class? It just has two 
fields. For now I see no point in it. Do you expect many more new fields to 
come?

9. {{BulkLoadParser}}:

9.1 Unused private field.

9.2 Too long line in {{processBatch}} declaration.

9.3 It seems plain wrong that arguments of {{processBatch}} are related with 
JDBC. JDBC is a connectivity interface, nothing else, 

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-02-01 Thread Alexander Paschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347177#comment-16347177
 ] 

Alexander Paschenko edited comment on IGNITE-6917 at 2/1/18 8:19 AM:
-

[~kirill.shirokov], please allow me to proceed.

20. {{SqlParserCopySelfTest}} - no single test for positive cases, only 
negative stuff. We need to cover positive cases extensively as well.

21. {{JdbcBulkLoadContext}} - probably we don't need this class and it would be 
better to keep update counter in {{BulkLoadContext}} as the notion of counter 
is used not just in JDBC driver.

22. {{JdbcThinStatement}} -

22.1 let's catch and wrap any exception in {{sendFile}}

22.2 Odd empty line in the beginning of {{sendFile}}

23. {{CsvLineProcessorBlock}} - we don't handle cases when quoted strings in 
CSV line contain commas, do we? This clearly does not seem right and I don't 
see any tests for that either. Shouldn't we handle escaped quotes too?

24. All {{*PipelineBlock}} classes - odd empty line after class header.

25. {{StrListAppenderBlock}} - do we need list of lists here? Probably just 
list of arrays could do?

26. {{BulkLoadContextCursor}}{{}} - it's unclear to me why we return context in 
fake "row" while all other parameters should be retrieved via getters. Can we 
unify this? (E.g. add getters for everything AND return everything in fake row 
- that would be consistent.)

27. Please add more tests for behavior in case of failure - server or client 
disconnect during file processing, file open error on client, etc.

I think this is it for now. Looking forward to the fixes.


was (Author: al.psc):
[~kirill.shirokov], please allow me to proceed.

 

1. {{SqlParserCopySelfTest}} - no single test for positive cases, only negative 
stuff. We need to cover positive cases extensively as well.

2. {{JdbcBulkLoadContext}} - probably we don't need this class and it would be 
better to keep update counter in {{BulkLoadContext}} as the notion of counter 
is used not just in JDBC driver.

3. {{JdbcThinStatement}} -

3.1 let's catch and wrap any exception in {{sendFile}}

3.2 Odd empty line in the beginning of {{sendFile}}

4. {{CsvLineProcessorBlock}} - we don't handle cases when quoted strings in CSV 
line contain commas, do we? This clearly does not seem right and I don't see 
any tests for that either. Shouldn't we handle escaped quotes too?

5. All {{*PipelineBlock}} classes - odd empty line after class header.

6. {{StrListAppenderBlock}} - do we need list of lists here? Probably just list 
of arrays could do?

7. {{BulkLoadContextCursor}}{{}} - it's unclear to me why we return context in 
fake "row" while all other parameters should be retrieved via getters. Can we 
unify this? (E.g. add getters for everything AND return everything in fake row 
- that would be consistent.)

8. Please add more tests for behavior in case of failure - server or client 
disconnect during file processing, file open error on client, etc.

I think this is it for now. Looking forward to the fixes.

> SQL: implement COPY command for efficient data loading
> --
>
> Key: IGNITE-6917
> URL: https://issues.apache.org/jira/browse/IGNITE-6917
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Vladimir Ozerov
>Assignee: Kirill Shirokov
>Priority: Major
>  Labels: iep-1
>
> Inspired by Postgres [1]
> Common use case - bulk data load through JDBC/ODBC interface. Currently it is 
> only possible to execute single commands one by one. We already can batch 
> them to improve performance, but there is still big room for improvement.
> We should think of a completely new command - {{COPY}}. It will accept a file 
> (or input stream in general case) on the client side, then transfer data to 
> the cluster, and then execute update inside the cluster, e.g. through 
> streamer.
> First of all we need to create quick and dirty prototype to assess potential 
> performance improvement. It speedup is confirmed, we should build base 
> implementation which will accept only files. But at the same time we should 
> understand how it will evolve in future: multiple file formats (probably 
> including Hadoop formarts, e.g. Parquet), escape characters, input streams, 
> etc..
> [1] [https://www.postgresql.org/docs/9.6/static/sql-copy.html]
> h1. Proposed syntax
> Curent implementation:
> {noformat}
> COPY 
> FROM "file.name"
> INTO .
> [(col-name, ...)]
> FORMAT  -- Only CSV format is supported in the current 
> release
> [BATCH_SIZE ]
> {noformat}
> We may want to gradually add features to this command in future to have 
> something like this:
> {noformat}
> COPY
> FROM "file.name"[CHARSET ""]
> INTO . [CREATE [IF NOT EXISTS]]
> [(col-name [] 

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-02-01 Thread Alexander Paschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346774#comment-16346774
 ] 

Alexander Paschenko edited comment on IGNITE-6917 at 2/1/18 8:17 AM:
-

[~kirill.shirokov], please let me continue with comments.

8. {{BulkLoadParameters}} - do we really need this class? It just has two 
fields. For now I see no point in it. Do you expect many more new fields to 
come?

9. {{BulkLoadParser}}:

9.1 Unused private field.

9.2 Too long line in {{processBatch}} declaration.

9.3 It seems plain wrong that arguments of {{processBatch}} are related with 
JDBC. JDBC is a connectivity interface, nothing else, and these references are 
in deeper "guts". We need to decouple these somehow - one day we might work to 
support this feature in ODBC, or somewhere else, and then it'd be inevitable. 
I'd just introduce an enum for various command types.

10. {{BulkLoadFormat}}:

10.1 Unused import

10.2 Odd empty lines in the beginning of class, in switch, etc

11. {{BulkLoadParser}} and {{BulkLoadFormat}} seem to be related 1:1 - probably 
they should be single class and implementations could be merged accordingly 
too. Can you come up with a case where we would really need to separate this 
entities? Say, one parser for few formats, or one format yielding more than one 
type of parser?

12. {{BulkLoadCsvParser}}:

12.1{{ processBatch}} method looks like it could be actually defined in parent 
class, and the logic to be changed in subclasses actually lives in 
{{parseBatch}}. This is so because command types like {{CMD_FINISHED_EOF}} are 
defined on higher level. Can we rework hierarchy accordingly?

12.2 Unused parameter in ctor ( {{params}} ), do we have tests for params at 
all?

12.3 Odd empty lines after class header, method declarations

13. {{BulkLoadEntryConverter}}: do we really expect more than one 
implementation of this interface? If no, let's just replace it with 
{{IgniteClosureX}} or something like that. We have more than enough interfaces 
and abstract classes to represent lambdas. *Before refactoring this, please 
look at pt.10.3 as well*

14. {{BulkLoadCsvFormat}}: amount of useful code in this class approaches zero, 
and this further assures that parser and its format must be one.

15. {{BulkLoadContext}}: too long comment line.

16. {{BulkLoadCacheWriter}}: do we expect more than one implementation of this 
interface?quit If no, just remove this class and let {{BulkLoadContext}} know 
about streamer. If yes, make it inherit one of Ignite's provided 
classes/interfaces for lambdas. *Before refactoring this, please look at 
pt.10.3 as well*

17. {{JdbcRequestHandler}}:

17.1 Unused import

17.2 {{processBulkLoadFileBatch}}: all other methods just catch {{Exception}}, 
please do the same here.

17.3 {{processBulkLoadFileBatch}}: it looks like doing actual job (data 
conversion, feeding to streamer) is just plain wrong. Not only it's logic that 
could be reused when we'll implement same feature for other interface like 
JDBC, it's also not the area of responsibility of JDBC request handler. This 
again shows the need of refactoring of {{BulkLoadContext}}: probably we need to 
introduce a concrete class that will *a)* wrap streamer and *b)* do conversion. 
It will just accept {{List}} and do with it whatever it wants. JDBC 
must not know about this at all, it must just detect correct context and feed 
lists of objects to that new class I've mentioned above.

17.4 {{executeQuery}}: I'd move whole stuff of {{if (fieldsCur instanceof 
BulkLoadContextCursor)}} above, before {{if (results.size() == 1)}} - it would 
decrease scope depth, make code better readable and decrease overall size of 
diff (never hurts).

17.5 {{executeQuery}}: {{filesToSendResult}} - doesn't have anything at all to 
do with files.

18. {{JdbcBulkLoadContext}}: unused field {{cmdParsingResult}} (getter has 0 
usages), not to mention that field name is very misleading - what does JDBC 
have to do with parsing?

19. {{JdbcBulkLoadBatchRequestResult}}: quite weird name, I don't think we 
should mix words "request" and "result", we never do so in JDBC. Not to mention 
that we never send {{JdbcBulkLoadBatchRequest}} *before* we get 
{{JdbcBulkLoadBatchRequestResult}} :) I'd just rename this to 
{{JdbcBulkLoadAckResult}} as it's sent as an acknowledgement of {{COPY}} 
command and never else.

This is not all, more comments will follow.


was (Author: al.psc):
[~kirill.shirokov], please let me continue with comments.

1. {{BulkLoadParameters}} - do we really need this class? It just has two 
fields. For now I see no point in it. Do you expect many more new fields to 
come?

2. {{BulkLoadParser}}:

2.1 Unused private field.

2.2 Too long line in {{processBatch}} declaration.

2.3 It seems plain wrong that arguments of {{processBatch}} are related with 
JDBC. JDBC is a connectivity interface, nothing else, 

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-01-31 Thread Alexander Paschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347230#comment-16347230
 ] 

Alexander Paschenko edited comment on IGNITE-6917 at 1/31/18 5:28 PM:
--

[~kirill.shirokov],

Regarding 7.2 - yes, please disregard empty columns list. Suppose there's _key 
and _val.

Regarding empty lines - they are always added for readability but we don't add 
them right after line ending with {, be it method or class declaration or loop 
or switch. Yes, with long method declarations this may not look good, but for 
consistency let's stick with what I suggest, thanks.

Regarding 5 - ok, Vlad's always right. Still IMHO main thing is not simpler 
code but better usability here. (Just saying.)


was (Author: al.psc):
[~kirill.shirokov],

Regarding 7.2 - yes, please disregard empty columns list. Suppose there's _key 
and _val.

Regarding empty fields - they are always added for readability but we don't add 
them right after line ending with {, be it method or class declaration or loop 
or switch. Yes, with long method declarations this may not look good, but for 
consistency let's stick with what I suggest, thanks.

Regarding 5 - ok, Vlad's always right. Still IMHO main thing is not simpler 
code but better usability here. (Just saying.)

> SQL: implement COPY command for efficient data loading
> --
>
> Key: IGNITE-6917
> URL: https://issues.apache.org/jira/browse/IGNITE-6917
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Vladimir Ozerov
>Assignee: Kirill Shirokov
>Priority: Major
>  Labels: iep-1
>
> Inspired by Postgres [1]
> Common use case - bulk data load through JDBC/ODBC interface. Currently it is 
> only possible to execute single commands one by one. We already can batch 
> them to improve performance, but there is still big room for improvement.
> We should think of a completely new command - {{COPY}}. It will accept a file 
> (or input stream in general case) on the client side, then transfer data to 
> the cluster, and then execute update inside the cluster, e.g. through 
> streamer.
> First of all we need to create quick and dirty prototype to assess potential 
> performance improvement. It speedup is confirmed, we should build base 
> implementation which will accept only files. But at the same time we should 
> understand how it will evolve in future: multiple file formats (probably 
> including Hadoop formarts, e.g. Parquet), escape characters, input streams, 
> etc..
> [1] [https://www.postgresql.org/docs/9.6/static/sql-copy.html]
> h1. Proposed syntax
> Curent implementation:
> {noformat}
> COPY 
> FROM "file.name"
> INTO .
> [(col-name, ...)]
> FORMAT  -- Only CSV format is supported in the current 
> release
> [BATCH_SIZE ]
> {noformat}
> We may want to gradually add features to this command in future to have 
> something like this:
> {noformat}
> COPY
> FROM "file.name"[CHARSET ""]
> INTO . [CREATE [IF NOT EXISTS]]
> [(col-name [] [NULLABLE] [ESCAPES], ...) [MATCH HEADER]]
> FORMAT (csv|tsv|...)
> -- CSV format options:
> [FIELDSEP='column-separators-regexp']
> [LINESEP='row-separators-regexp']
> [QUOTE='quote-chars']
> [ESCAPE='escape-char']
> [NULL='null-sequence']
> [COMMENT='single-line-comment-start-char']
> [TRIM_LINES]
> [IMPORT_EMPTY_LINES]
> [CHARSET ""]
> [ROWS -]
> --or--
> [SKIP ROWS ] [MAX ROWS ]
> [COLS -]
> --or--
> [SKIP COLS ] [MAX COLS ]
> [(MATCH | SKIP) HEADER]
> [(REPLACE|IGNORE|ABORT ON [])) DUPLICATE KEYS]
> [BATCH SIZE ( ROWS | [K|M|G|T|P])]
> [COMPRESS "codec-name" [codec options]]
> [LOCK (TABLE|ROWS)]
> [NOLOGGING]
> [BACKEND (DIRECT | STREAMER)]
> {noformat}
> h1. Implementation decisions and notes
> h2. Parsing
> * We support CSV format described in RFC 4180.
> * Custom row and column separators, quoting characters are currently hardcoded
> * Escape sequences, line comment characters are currently not supported
> * We may want to support fixed-length formats (via format descriptors) in 
> future
> * We may want to strip comments from lines (for example, starting with '#')
> * We may want to allow user to either ignore empty lines or treat them as a 
> special case of record having all default values
> * We may allow user to enable whitespace trimming from beginning and end of a 
> line
> * We may want to allow user to specify error handling strategy: e.g., only 
> one quote character is present or escape sequence is invalid.
> h2. File handling
> * File character set to be supported in future
> * Skipped/imported row number (or first/last line or skip header option), 
> skipped/imported column number (or first/last column): to be 

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-01-31 Thread Alexander Paschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347177#comment-16347177
 ] 

Alexander Paschenko edited comment on IGNITE-6917 at 1/31/18 5:22 PM:
--

[~kirill.shirokov], please allow me to proceed.

 

1. {{SqlParserCopySelfTest}} - no single test for positive cases, only negative 
stuff. We need to cover positive cases extensively as well.

2. {{JdbcBulkLoadContext}} - probably we don't need this class and it would be 
better to keep update counter in {{BulkLoadContext}} as the notion of counter 
is used not just in JDBC driver.

3. {{JdbcThinStatement}} -

3.1 let's catch and wrap any exception in {{sendFile}}

3.2 Odd empty line in the beginning of {{sendFile}}

4. {{CsvLineProcessorBlock}} - we don't handle cases when quoted strings in CSV 
line contain commas, do we? This clearly does not seem right and I don't see 
any tests for that either. Shouldn't we handle escaped quotes too?

5. All {{*PipelineBlock}} classes - odd empty line after class header.

6. {{StrListAppenderBlock}} - do we need list of lists here? Probably just list 
of arrays could do?

7. {{BulkLoadContextCursor}}{{}} - it's unclear to me why we return context in 
fake "row" while all other parameters should be retrieved via getters. Can we 
unify this? (E.g. add getters for everything AND return everything in fake row 
- that would be consistent.)

8. Please add more tests for behavior in case of failure - server or client 
disconnect during file processing, file open error on client, etc.

I think this is it for now. Looking forward to the fixes.


was (Author: al.psc):
[~kirill.shirokov], please allow me to proceed.

 

1. {{SqlParserCopySelfTest}} - no single test for positive cases, only negative 
stuff. We need to cover positive cases extensively as well.

2. {{JdbcBulkLoadContext}} - probably we don't need this class and it would be 
better to keep update counter in {{BulkLoadContext}} as the notion of counter 
is used not just in JDBC driver.

3. {{JdbcThinStatement}} -

3.1 let's catch and wrap any exception in {{sendFile}}

3.2 Odd empty line in the beginning of {{sendFile}}

4. {{CsvLineProcessorBlock}} - we don't handle cases when quoted strings in CSV 
line contain commas, do we? This clearly does not seem right and I don't see 
any tests for that either. Shouldn't we handle escaped quotes too?

5. All {{*PipelineBlock}} classes - odd empty line after class header.

6. {{StrListAppenderBlock}} - do we need list of lists here? Probably just list 
of arrays could do?

7. {{BulkLoadContextCursor}}{{}} - it's unclear to me why we return context in 
fake "row" while all other parameters should be retrieved via getters. Can we 
unify this? (E.g. add getters for everything AND return everything in fake row 
- that would be consistent.)

8. Please add more tests for behavior in case of failure - server or client 
disconnect during file processing, file open error on client, etc.

I think this is it for now. Looking forward for the fixes.

> SQL: implement COPY command for efficient data loading
> --
>
> Key: IGNITE-6917
> URL: https://issues.apache.org/jira/browse/IGNITE-6917
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Vladimir Ozerov
>Assignee: Kirill Shirokov
>Priority: Major
>  Labels: iep-1
>
> Inspired by Postgres [1]
> Common use case - bulk data load through JDBC/ODBC interface. Currently it is 
> only possible to execute single commands one by one. We already can batch 
> them to improve performance, but there is still big room for improvement.
> We should think of a completely new command - {{COPY}}. It will accept a file 
> (or input stream in general case) on the client side, then transfer data to 
> the cluster, and then execute update inside the cluster, e.g. through 
> streamer.
> First of all we need to create quick and dirty prototype to assess potential 
> performance improvement. It speedup is confirmed, we should build base 
> implementation which will accept only files. But at the same time we should 
> understand how it will evolve in future: multiple file formats (probably 
> including Hadoop formarts, e.g. Parquet), escape characters, input streams, 
> etc..
> [1] [https://www.postgresql.org/docs/9.6/static/sql-copy.html]
> h1. Proposed syntax
> Curent implementation:
> {noformat}
> COPY 
> FROM "file.name"
> INTO .
> [(col-name, ...)]
> FORMAT  -- Only CSV format is supported in the current 
> release
> [BATCH_SIZE ]
> {noformat}
> We may want to gradually add features to this command in future to have 
> something like this:
> {noformat}
> COPY
> FROM "file.name"[CHARSET ""]
> INTO . [CREATE [IF NOT EXISTS]]
> [(col-name [] 

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-01-31 Thread Kirill Shirokov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347213#comment-16347213
 ] 

Kirill Shirokov edited comment on IGNITE-6917 at 1/31/18 5:18 PM:
--

So you have 3 lists with the same numbering :)

Ok, list #1:

> 1. JdbcAbstractUpdateStatementSelfTest - unused code

FIXED

> 2. org.apache.ignite.internal.jdbc2.JdbcBulkLoadSelfTest: createConnection - 
> unused param; testBulkLoadThrows - please use try-with-resources.

FIXED

> 3. org.apache.ignite.jdbc.thin.JdbcThinAbstractDmlStatementSelfTest:
 > 3.1 jdbcRun: please remove this method and use inherited method execute 
 > instead.

FIXED

> 3.2 testDOA: please rename into something more meaningful.

FIXED (=> testBasicStatement())

> 4. BulkLoadContextCursor:
 > 4.1 Please make field final.

FIXED

> 4.2 getAll - I changed the code to double call of singletonList just fine.

This time it worked.

> 4.3 getFieldName - I believe this method should throw NoSuchElementException 
> for all i except 0.

It looks more logical to throw IndexOutOfBoundsException (if you look at the 
existing implementations)

I don't think it's ever being called, but OK.

> 4.4 Odd empty line right after class header.

There's a lot of odd empty lines in this project, which don't seem logical or 
have nice appearance, but OK.

An example: you have a lot of empty lines around while and this is easily 
readable:
{noformat}
    private void parseColumnList(SqlLexer lex) {
    if (!lex.shift() || lex.tokenType() != 
SqlLexerTokenType.PARENTHESIS_LEFT)
    throw errorUnexpectedToken(lex, "(");

    while (true) {
    parseIndexColumn(lex);

    if (skipCommaOrRightParenthesis(lex))
    break;
    }
    }{noformat}
But there are no empty lines in the top of the method and if you have 5+ 
arguments and exceptions declared there, it's almost not readable.

> 5. SqlBulkLoadCommand: let's parse command in a way consistent with how it's 
> done in SQL usually - we clearly should not use ugly stuff like BATCH_SIZE 
> (with underscore instead of space) now that we are headed to our own parser. 
> Please rework parsing so that user has to write BATCH SIZE. (NB: yes, I'm 
> aware that we already have quite ugly stuff here, like INLINE_SIZE for 
> indexes, but I believe that this must be avoided as much as possible.) Also I 
> think we should support case when the user does not supply columns list (so 
> no empty brackets too) - currently it's a parsing error and there's a test 
> for it. (Please disregard what's been struck through.)

I also find it ugly and asked Vladimir about this, but he insisted on this ugly 
solution.

But to be optimistic about this, the approach with the underscore has some 
advantages:

1. More specific keywords (user has more freedom: more identifiers aren't 
clashing with keywords)
 2. A bit simpler code

> 6. UpdatePlan#processRow:
 > 6.1 Odd empty line after method declaration.

As in 4.4 it looks like I don't understand logic behind this. Or it's not about 
aestetics.

> 6.2. Shouldn't we rather check for equality here? row.size() < 
> colNames.length - like this

When we are importing CSV, every row can have its own number of columns. I 
think it's OK to silently skip extra columns. 
 Or we can handle this CSV-specific case in the code which is calling 
processRow().

> 7. UpdatePlanBuilder#planForBulkLoad:
 > 7.1 Odd empty line after method declaration.

FIXED with the same comment as in 4.4

> 7.2 Let's add VERY simple test case (like cache of  and bulk 
> load to it) - I think this may break on current implementation as we may 
> actually will to insert into _val column, and this case does not seem to be 
> handled - say, like this COPY from file to String () format csv (String is 
> table name, table has 3 columns all of which are system columns.)

Yes, I planned tests for auto-generated fields like _key and _val. But I think 
it makes sense to implement them as a part of this JIRA.

Although we decided to disallow allow empty or missing column list in the 
current implementation – this has to be handled as a separate JIRA.


was (Author: kirill.shirokov):
So you have 3 lists with the same numbering :)

Ok, list #1:

> 1. JdbcAbstractUpdateStatementSelfTest - unused code

FIXED

> 2. org.apache.ignite.internal.jdbc2.JdbcBulkLoadSelfTest: createConnection - 
> unused param; testBulkLoadThrows - please use try-with-resources.

FIXED

> 3. org.apache.ignite.jdbc.thin.JdbcThinAbstractDmlStatementSelfTest:
> 3.1 jdbcRun: please remove this method and use inherited method execute 
> instead.

FIXED

> 3.2 testDOA: please rename into something more meaningful.

FIXED (=> testBasicStatement())

> 4. BulkLoadContextCursor:
> 4.1 Please make field final.

FIXED

> 4.2 getAll - I changed the code to double call of singletonList just fine.

This time it worked.

> 

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-01-31 Thread Alexander Paschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345541#comment-16345541
 ] 

Alexander Paschenko edited comment on IGNITE-6917 at 1/31/18 2:47 PM:
--

[~kirill.shirokov], my comments:

1. {{JdbcAbstractUpdateStatementSelfTest}} - unused code

2. {{org.apache.ignite.internal.jdbc2.JdbcBulkLoadSelfTest: createConnection}} 
- unused param; {{testBulkLoadThrows}} - please use try-with-resources.

3. {{org.apache.ignite.jdbc.thin.JdbcThinAbstractDmlStatementSelfTest:}}

3.1 {{jdbcRun}}: please remove this method and use inherited method execute 
instead.

3.2 {{testDOA}}: please rename into something more meaningful.

4. {{BulkLoadContextCursor:}}

4.1 Please make field final.

4.2 {{getAll}} - I changed the code to double call of {{singletonList}} just 
fine.

4.3 {{getFieldName}} - I believe this method should throw 
{{NoSuchElementException}} for all {{i}} except {{0}}.

4.4 Odd empty line right after class header.

5. {{SqlBulkLoadCommand}}: let's parse command in a way consistent with how 
it's done in SQL usually - we clearly should not use ugly stuff like 
{{BATCH_SIZE}} (with underscore instead of space) now that we are headed to our 
own parser. Please rework parsing so that user has to write {{BATCH SIZE}}. 
(NB: yes, I'm aware that we already have quite ugly stuff here, like 
{{INLINE_SIZE}} for indexes, but I believe that this must be avoided as much as 
possible.) -Also I think we should support case when the user does not supply 
columns list (so no empty brackets too) - currently it's a parsing error and 
there's a test for it.- _(Please disregard what's been struck through.)_

6. {{UpdatePlan#processRow}}:

6.1 Odd empty line after method declaration.

6.2. Shouldn't we rather check for equality here? {{row.size() < 
colNames.length}} - like this

7. {{UpdatePlanBuilder#planForBulkLoad}}:

7.1 Odd empty line after method declaration.

7.2 Let's add VERY simple test case (like cache of  and bulk 
load to it) - I think this may break on current implementation as we may 
actually will to insert into _val column, and this case does not seem to be 
handled - say, like this {{COPY from file to String () format csv}} (String is 
table name, table has 3 columns all of which are system columns.)

 

This is not all, more comments will follow.

 

 


was (Author: al.psc):
[~kirill.shirokov], my comments:

1. {{JdbcAbstractUpdateStatementSelfTest}} - unused code

2. {{org.apache.ignite.internal.jdbc2.JdbcBulkLoadSelfTest: createConnection}} 
- unused param; {{testBulkLoadThrows}} - please use try-with-resources.

3. {{org.apache.ignite.jdbc.thin.JdbcThinAbstractDmlStatementSelfTest:}}

3.1 {{jdbcRun}}: please remove this method and use inherited method execute 
instead.

3.2 {{testDOA}}: please rename into something more meaningful.

4. {{BulkLoadContextCursor:}}

4.1 Please make field final.

4.2 {{getAll}} - I changed the code to double call of {{singletonList}} just 
fine.

4.3 {{getFieldName}} - I believe this method should throw 
{{NoSuchElementException}} for all {{i}} except {{0}}.

4.4 Odd empty line right after class header.

5. {{SqlBulkLoadCommand}}: let's parse command in a way consistent with how 
it's done in SQL usually - we clearly should not use ugly stuff like 
{{BATCH_SIZE}} (with underscore instead of space) now that we are headed to our 
own parser. Please rework parsing so that user has to write {{BATCH SIZE}}. 
(NB: yes, I'm aware that we already have quite ugly stuff here, like 
{{INLINE_SIZE}} for indexes, but I believe that this must be avoided as much as 
possible.) Also I think we should support case when the user does not supply 
columns list (so no empty brackets too) - currently it's a parsing error and 
there's a test for it.

6. {{UpdatePlan#processRow}}:

6.1 Odd empty line after method declaration.

6.2. Shouldn't we rather check for equality here? {{row.size() < 
colNames.length}} - like this

7. {{UpdatePlanBuilder#planForBulkLoad}}:

7.1 Odd empty line after method declaration.

7.2 Let's add VERY simple test case (like cache of  and bulk 
load to it) - I think this may break on current implementation as we may 
actually will to insert into _val column, and this case does not seem to be 
handled - say, like this {{COPY from file to String () format csv}} (String is 
table name, table has 3 columns all of which are system columns.)

 

This is not all, more comments will follow.

 

 

> SQL: implement COPY command for efficient data loading
> --
>
> Key: IGNITE-6917
> URL: https://issues.apache.org/jira/browse/IGNITE-6917
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Vladimir Ozerov
>Assignee: Kirill Shirokov
>

[jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading

2018-01-31 Thread Alexander Paschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346774#comment-16346774
 ] 

Alexander Paschenko edited comment on IGNITE-6917 at 1/31/18 1:13 PM:
--

[~kirill.shirokov], please let me continue with comments.

1. {{BulkLoadParameters}} - do we really need this class? It just has two 
fields. For now I see no point in it. Do you expect many more new fields to 
come?

2. {{BulkLoadParser}}:

2.1 Unused private field.

2.2 Too long line in {{processBatch}} declaration.

2.3 It seems plain wrong that arguments of {{processBatch}} are related with 
JDBC. JDBC is a connectivity interface, nothing else, and these references are 
in deeper "guts". We need to decouple these somehow - one day we might work to 
support this feature in ODBC, or somewhere else, and then it'd be inevitable. 
I'd just introduce an enum for various command types.

3. {{BulkLoadFormat}}:

3.1 Unused import

3.2 Odd empty lines in the beginning of class, in switch, etc

4. {{BulkLoadParser}} and {{BulkLoadFormat}} seem to be related 1:1 - probably 
they should be single class and implementations could be merged accordingly 
too. Can you come up with a case where we would really need to separate this 
entities? Say, one parser for few formats, or one format yielding more than one 
type of parser?

5. {{BulkLoadCsvParser}}:

{{5.1 processBatch}} method looks like it could be actually defined in parent 
class, and the logic to be changed in subclasses actually lives in 
{{parseBatch}}. This is so because command types like {{CMD_FINISHED_EOF}} are 
defined on higher level. Can we rework hierarchy accordingly?

5.2 Unused parameter in ctor ( {{params}} ), do we have tests for params at all?

5.3 Odd empty lines after class header, method declarations

6. {{BulkLoadEntryConverter}}: do we really expect more than one implementation 
of this interface? If no, let's just replace it with {{IgniteClosureX}} or 
something like that. We have more than enough interfaces and abstract classes 
to represent lambdas. *Before refactoring this, please look at pt.10.3 as well*

7. {{BulkLoadCsvFormat}}: amount of useful code in this class approaches zero, 
and this further assures that parser and its format must be one.

8. {{BulkLoadContext}}: too long comment line.

9. {{BulkLoadCacheWriter}}: do we expect more than one implementation of this 
interface?quit If no, just remove this class and let {{BulkLoadContext}} know 
about streamer. If yes, make it inherit one of Ignite's provided 
classes/interfaces for lambdas. *Before refactoring this, please look at 
pt.10.3 as well*

10. {{JdbcRequestHandler}}:

10.1 Unused import

10.2 {{processBulkLoadFileBatch}}: all other methods just catch {{Exception}}, 
please do the same here.

10.3 {{processBulkLoadFileBatch}}: it looks like doing actual job (data 
conversion, feeding to streamer) is just plain wrong. Not only it's logic that 
could be reused when we'll implement same feature for other interface like 
JDBC, it's also not the area of responsibility of JDBC request handler. This 
again shows the need of refactoring of {{BulkLoadContext}}: probably we need to 
introduce a concrete class that will *a)* wrap streamer and *b)* do conversion. 
It will just accept {{List}} and do with it whatever it wants. JDBC 
must not know about this at all, it must just detect correct context and feed 
lists of objects to that new class I've mentioned above.

10.4 {{executeQuery}}: I'd move whole stuff of {{if (fieldsCur instanceof 
BulkLoadContextCursor)}} above, before {{if (results.size() == 1)}} - it would 
decrease scope depth, make code better readable and decrease overall size of 
diff (never hurts).

10.5 {{executeQuery}}: {{filesToSendResult}} - doesn't have anything at all to 
do with files.

11. {{JdbcBulkLoadContext}}: unused field {{cmdParsingResult}} (getter has 0 
usages), not to mention that field name is very misleading - what does JDBC 
have to do with parsing?

12. {{JdbcBulkLoadBatchRequestResult}}: quite weird name, I don't think we 
should mix words "request" and "result", we never do so in JDBC. Not to mention 
that we never send {{JdbcBulkLoadBatchRequest}} *before* we get 
{{JdbcBulkLoadBatchRequestResult}} :) I'd just rename this to 
{{JdbcBulkLoadAckResult}} as it's sent as an acknowledgement of {{COPY}} 
command and never else.

This is not all, more comments will follow.


was (Author: al.psc):
[~kirill.shirokov], please let me continue with comments.

1. {{BulkLoadParameters}} - do we really need this class? It just has two 
fields. For now I see no point in it. Do you expect many more new fields to 
come?

2. {{BulkLoadParser}}:

2.1 Unused private field.

2.2 Too long line in {{processBatch}} declaration.

2.3 It seems plain wrong that arguments of {{processBatch}} are related with 
JDBC. JDBC is a connectivity interface, nothing else, and these