Re: Error with lines ended with backslash when Bulk Data Loading

Gabriel Reid Thu, 08 Dec 2016 23:23:23 -0800

Hi,

Indeed, the -e parameter isn't documented for the bulk loader on the
documentation page. This is definitely not intentional.


It also does indeed appear that there are some functional differences
between the way that the escape character is defined between psql
(where there is no default escape character) and the bulk loader
(where backslash is the default escape character).

Also, the ignore-errors flag should truly ignore errors -- I think
that the reason that it's not doing that here is that there's an error
being thrown from the CSV parsing library, and not from Phoenix.
However, I agree that this shouldn't stop the bulk load tool.

Could you please log these issues in the Phoenix jira at
https://issues.apache.org/jira/browse/PHOENIX

A potential work-around (although I'm not sure if it will work) is to
supply an empty character string as the escape character for the bulk
loader, as follows: -e '' (that's the -e flag, followed by two single
quotes). That might resolve this issue on the short-term, although I'm
not sure.

- Gabriel

On Fri, Dec 9, 2016 at 2:28 AM, rubysina <[email protected]> wrote:
> ok.  thank you.
>
> but there's no parameter -e on page
> http://phoenix.apache.org/bulk_dataload.html
> and, why the -g,–ignore-errors parameter doesn't work?  if there's some
> lines ended with backslash, just ignore it, why fail?
>
> there's always something error in txt files. why not ignore it? how?
>
> and, if using -e parameter, what character should I use?
> seems that I must find a special character, but I don't know which is
> correct.
> actually, I don't want to use any escape character.
> is there any special option like "escape off" or something else, so I can
> load anything without treating any character as an escape letter.
>
> some other products , like greenplum, do have such interesting setting when
> bulkloading txt file: escape: 'OFF'
>
> -----------------------------------------------------------------
> quote on http://phoenix.apache.org/bulk_dataload.html
> The following parameters can be used with the MapReduce loader.
> Parameter     Description
> -i,–input     Input CSV path (mandatory)
> -t,–table     Phoenix table name (mandatory)
> -a,–array-delimiter     Array element delimiter (optional)
> -c,–import-columns     Comma-separated list of columns to be imported
> -d,–delimiter     Input delimiter, defaults to comma
> -g,–ignore-errors     Ignore input errors
> -o,–output     Output path for temporary HFiles (optional)
> -s,–schema     Phoenix schema name (optional)
> -z,–zookeeper     Zookeeper quorum to connect to (optional)
> -it,–index-table     Index table name to load (optional)
>
>
> --------------------------------
>
> From: Gabriel Reid <[email protected]>
> Subject: Re: Error with lines ended with backslash when Bulk Data Loading
> Date: 2016-12-09 02:06 (+0800)
> List: [email protected]
> Hi Backslash is the default escape character that is used for parsing CSV
> data when running a bulk import, so it has a special meaning. You can supply
> a different (custom) escape character with the -e or --escape flag on the
> command line so that parsing your CSV files that include backslashes like
> this will run properly. - Gabriel
>
> ----- Original Message -----
> From: "rubysina" <[email protected]>
> To: "user" <[email protected]>
> Subject: Error with lines ended with backslash when Bulk Data Loading
> Date: 2016-12-08 16:11
>
> hi, I'm new to phoenix sql and here's a little problem.
>
> I'm following this page http://phoenix.apache.org/bulk_dataload.html
> I just found that the MapReduce importer could not load file with lines
> ended with backslash
> even with the -g parameter , i.e. ignore-errors, "java.io.IOException: EOF
> whilst processing escape sequence"
>
> but it's OK if the line contains backslash but not at the end of line,
>
> and there's no problem when using psql.py to load the same file.
>
> why?  how?
>
> thank you.
>
>
>
> -----------------------------------------------------------------------------------------------
> for example:
>
>
> create table a(a char(100) primary key)
>
> echo \\>a.csv
> cat a.csv
> \
> hdfs dfs -put  a.csv
> ...JsonBulkLoadTool  -g -t a  -i a.csv
> -- error
> 16/12/08 15:44:21 INFO mapreduce.Job: Task Id :
> attempt_1481093434027_0052_m_000000_0, Status : FAILED
> Error: java.lang.RuntimeException: java.lang.RuntimeException:
> java.io.IOException: EOF whilst processing escape sequence
>         at
> org.apache.phoenix.mapreduce.FormatToBytesWritableMapper.map(FormatToBytesWritableMapper.java:202)
>         at
> org.apache.phoenix.mapreduce.FormatToBytesWritableMapper.map(FormatToBytesWritableMapper.java:74)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.RuntimeException: java.io.IOException: EOF whilst
> processing escape sequence
>         at
> org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:398)
>         at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:407)
>         at com.google.common.collect.Iterators.getNext(Iterators.java:890)
>         at com.google.common.collect.Iterables.getFirst(Iterables.java:781)
>         at
> org.apache.phoenix.mapreduce.CsvToKeyValueMapper$CsvLineParser.parse(CsvToKeyValueMapper.java:109)
>         at
> org.apache.phoenix.mapreduce.CsvToKeyValueMapper$CsvLineParser.parse(CsvToKeyValueMapper.java:91)
>         at
> org.apache.phoenix.mapreduce.FormatToBytesWritableMapper.map(FormatToBytesWritableMapper.java:161)
>         ... 9 more
>
>
>
> echo \\a>a.csv
> cat a.csv
> \a
> hdfs dfs -rm  a.csv
> hdfs dfs -put  a.csv
> ...JsonBulkLoadTool -g -t a  -i a.csv
> -- success
>
>
> echo \\>a.csv
> cat a.csv
> \
> psql.py -t A zoo a.csv
> CSV Upsert complete. 1 rows upserted
> -- success
>
>
> thank you.

Re: Error with lines ended with backslash when Bulk Data Loading

Reply via email to