Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

Chanh Le Thu, 08 Jun 2017 01:27:42 -0700

Can you recommend one?

Thanks.


On Thu, Jun 8, 2017 at 2:47 PM Jörn Franke <[email protected]> wrote:

> You can change the CSV parser library
>
> On 8. Jun 2017, at 08:35, Chanh Le <[email protected]> wrote:
>
>
> I did add mode -> DROPMALFORMED but it still couldn't ignore it because
> the error raise from the CSV library that Spark are using.
>
>
> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke <[email protected]> wrote:
>
>> The CSV data source allows you to skip invalid lines - this should also
>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>>
>> On 8. Jun 2017, at 03:04, Chanh Le <[email protected]> wrote:
>>
>> Hi Takeshi, Jörn Franke,
>>
>> The problem is even I increase the maxColumns it still have some lines
>> have larger columns than the one I set and it will cost a lot of memory.
>> So I just wanna skip the line has larger columns than the maxColumns I
>> set.
>>
>> Regards,
>> Chanh
>>
>>
>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro <[email protected]>
>> wrote:
>>
>>> Is it not enough to set `maxColumns` in CSV options?
>>>
>>>
>>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>>>
>>> // maropu
>>>
>>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke <[email protected]>
>>> wrote:
>>>
>>>> Spark CSV data source should be able
>>>>
>>>> On 7. Jun 2017, at 17:50, Chanh Le <[email protected]> wrote:
>>>>
>>>> Hi everyone,
>>>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>>>> One problem that I am facing is if one row of csv file has more columns
>>>> than maxColumns (default is 20480). The process of parsing was stop.
>>>>
>>>> Internal state when error was thrown: line=1, column=3, record=0,
>>>> charIndex=12
>>>> com.univocity.parsers.common.TextParsingException:
>>>> java.lang.ArrayIndexOutOfBoundsException - 2
>>>> Hint: Number of columns processed may have exceeded limit of 2 columns.
>>>> Use settings.setMaxColumns(int) to define the maximum number of columns
>>>> your input can have
>>>> Ensure your configuration is correct, with delimiters, quotes and
>>>> escape sequences that match the input format you are trying to parse
>>>> Parser Configuration: CsvParserSettings:
>>>>
>>>>
>>>> I did some investigation in univocity
>>>> <https://github.com/uniVocity/univocity-parsers> library but the way
>>>> it handle is throw error that why spark stop the process.
>>>>
>>>> How to skip the invalid row and just continue to parse next valid one?
>>>> Any libs can replace univocity in that job?
>>>>
>>>> Thanks & regards,
>>>> Chanh
>>>> --
>>>> Regards,
>>>> Chanh
>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>> --
>> Regards,
>> Chanh
>>
>> --
> Regards,
> Chanh
>
> --
Regards,
Chanh

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

Reply via email to