Can you recommend one? Thanks.
On Thu, Jun 8, 2017 at 2:47 PM Jörn Franke <jornfra...@gmail.com> wrote: > You can change the CSV parser library > > On 8. Jun 2017, at 08:35, Chanh Le <giaosu...@gmail.com> wrote: > > > I did add mode -> DROPMALFORMED but it still couldn't ignore it because > the error raise from the CSV library that Spark are using. > > > On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke <jornfra...@gmail.com> wrote: > >> The CSV data source allows you to skip invalid lines - this should also >> include lines that have more than maxColumns. Choose mode "DROPMALFORMED" >> >> On 8. Jun 2017, at 03:04, Chanh Le <giaosu...@gmail.com> wrote: >> >> Hi Takeshi, Jörn Franke, >> >> The problem is even I increase the maxColumns it still have some lines >> have larger columns than the one I set and it will cost a lot of memory. >> So I just wanna skip the line has larger columns than the maxColumns I >> set. >> >> Regards, >> Chanh >> >> >> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro <linguin....@gmail.com> >> wrote: >> >>> Is it not enough to set `maxColumns` in CSV options? >>> >>> >>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116 >>> >>> // maropu >>> >>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke <jornfra...@gmail.com> >>> wrote: >>> >>>> Spark CSV data source should be able >>>> >>>> On 7. Jun 2017, at 17:50, Chanh Le <giaosu...@gmail.com> wrote: >>>> >>>> Hi everyone, >>>> I am using Spark 2.1.1 to read csv files and convert to avro files. >>>> One problem that I am facing is if one row of csv file has more columns >>>> than maxColumns (default is 20480). The process of parsing was stop. >>>> >>>> Internal state when error was thrown: line=1, column=3, record=0, >>>> charIndex=12 >>>> com.univocity.parsers.common.TextParsingException: >>>> java.lang.ArrayIndexOutOfBoundsException - 2 >>>> Hint: Number of columns processed may have exceeded limit of 2 columns. >>>> Use settings.setMaxColumns(int) to define the maximum number of columns >>>> your input can have >>>> Ensure your configuration is correct, with delimiters, quotes and >>>> escape sequences that match the input format you are trying to parse >>>> Parser Configuration: CsvParserSettings: >>>> >>>> >>>> I did some investigation in univocity >>>> <https://github.com/uniVocity/univocity-parsers> library but the way >>>> it handle is throw error that why spark stop the process. >>>> >>>> How to skip the invalid row and just continue to parse next valid one? >>>> Any libs can replace univocity in that job? >>>> >>>> Thanks & regards, >>>> Chanh >>>> -- >>>> Regards, >>>> Chanh >>>> >>>> >>> >>> >>> -- >>> --- >>> Takeshi Yamamuro >>> >> -- >> Regards, >> Chanh >> >> -- > Regards, > Chanh > > -- Regards, Chanh