[ 
https://issues.apache.org/jira/browse/DRILL-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-5272:
----------------------------------

    Assignee: Paul Rogers

> Text file reader is inefficient
> -------------------------------
>
>                 Key: DRILL-5272
>                 URL: https://issues.apache.org/jira/browse/DRILL-5272
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>
> From inspection of the ScanBatch and CompliantTextReader.
> Every batch holds about five implicit vectors. These are repeated for every 
> row, which can greatly increase incoming data size.
> When populating the vectors, the allocation starts at 8 bytes and grows to 16 
> bytes, causing a (slow) memory reallocation for every vector:
> {code}
> [org.apache.drill.exec.vector.UInt4Vector] - 
> Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [8] -> [16]
> {code}
> Whether due to the above, or a different issues is causing memory growth in 
> the scan batch:
> {code}
> Entry Memory: 6,456,448
> Exit Memory: 7,636,312
> Entry Memory: 7570560
> Exit Memory: 8750424
> ...
> {code}
> Evidently the implicit vectors are added in response to a "SELECT *" query. 
> Perhaps provide them only if actually requested.
> The vectors are populated for every row, making a copy of a potentially long 
> file name and path for every record. Since the values are common to every 
> record, perhaps we can use the same data copy for each, but have the offset 
> vector for each record just point to the single copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to