Neal Richardson created ARROW-14705:
---------------------------------------

             Summary: [C++] unify_schemas can't handle int64 + double, affects 
CSV dataset
                 Key: ARROW-14705
                 URL: https://issues.apache.org/jira/browse/ARROW-14705
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, R
            Reporter: Neal Richardson


Twitter question of "how can I make arrow's csv reader not make int64 for 
integers", turns out to be originating from the scenario where some csvs in a 
directory may have all integer values for a column but there are decimals in 
others, and you can't use them together in a dataset.
{code:r}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

ds_dir <- tempfile()
dir.create(ds_dir)
cat("a\n1", file = file.path(ds_dir, "1.csv"))
cat("a\n1.1", file = file.path(ds_dir, "2.csv"))

ds <- open_dataset(ds_dir, format = "csv")
ds
#> FileSystemDataset with 2 csv files
#> a: int64

## It just picked the schema of the first file
collect(ds)
#> Error: Invalid: Could not open CSV input source 
'/private/var/folders/yv/b6mwztyj0r11r8pnsbmpltx00000gn/T/RtmpzENOMb/filea9c3292e06dd/2.csv':
 Invalid: In CSV column #0: Row #2: CSV conversion error to int64: invalid 
value '1.1'
#> ../src/arrow/csv/converter.cc:492  decoder_.Decode(data, size, quoted, 
&value)
#> ../src/arrow/csv/parser.h:123  status
#> ../src/arrow/csv/converter.cc:496  parser.VisitColumn(col_index, visit)
#> ../src/arrow/csv/reader.cc:462  internal::UnwrapOrRaise(maybe_decoded_arrays)
#> ../src/arrow/compute/exec/exec_plan.cc:398  iterator_.Next()
#> ../src/arrow/record_batch.cc:318  ReadNext(&batch)
#> ../src/arrow/record_batch.cc:329  ReadAll(&batches)

## Let's try again and tell it to unify schemas. Should result in a float64 type
ds <- open_dataset(ds_dir, format = "csv", unify_schemas = TRUE)
#> Error: Invalid: Unable to merge: Field a has incompatible types: int64 vs 
double
#> ../src/arrow/type.cc:1621  fields_[i]->MergeWith(field)
#> ../src/arrow/type.cc:1684  AddField(field)
#> ../src/arrow/type.cc:1755  builder.AddSchema(schema)
#> ../src/arrow/dataset/discovery.cc:251  Inspect(options.inspect_options)
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to