If you have new lines in your files then the files becomes unsuitable for splitting. This means that the only parallelism available in a ctas statement is multiple files.
Do you have a fair number of files? Sent from my iPhone > On Feb 1, 2016, at 7:26, Nicolas Paris <nipari...@gmail.com> wrote: > > Hello Abdel, > > I am creating parquet file from those CSV files. (CREATE TABLE syntax). > Basically, I have a text column, with a maximum of 50k characters, > containing newlines (the texts come from pdf extracted). I have > multimilions tuples of texts. I am subseting texts containing some patterns > (LIKE '%foo%' or regex => sadly I haven't found mention about regex in > documentation (postgresql "~" operator equivalent)) > Usually I used postgresql or monetdb in order to mine the texts, but I am > benchmarking/studying apache drill too. > > Thanks, > > > 2016-02-01 15:54 GMT+01:00 Abdel Hakim Deneche <adene...@maprtech.com>: > >> Hey Nicolas, >> >> what kind of queries are you running on your csv file ? >> >> On Sun, Jan 31, 2016 at 12:14 PM, Nicolas Paris <nipari...@gmail.com> >> wrote: >> >>> Hello, >>> >>> I am trying to import a csv containing large texts. They contains newline >>> character "\n". >>> Apache Drill conplains about that. There is a jira issue opened on >> https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg >>> >>> Is there a workaround ? (different that removing \n from texts) >>> >>> Thanks by advance >> >> >> >> -- >> >> Abdelhakim Deneche >> >> Software Engineer >> >> <http://www.mapr.com/> >> >> >> Now Available - Free Hadoop On-Demand Training >> < >> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available >>