Well, I am not so sure about the use cases, but what about using StreamingContext.fileStream? https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream-java.lang.String-scala.Function1-boolean-org.apache.hadoop.conf.Configuration-scala.reflect.ClassTag-scala.reflect.ClassTag-scala.reflect.ClassTag-
> Am 19.11.2018 um 09:22 schrieb Nicolas Paris <nicolas.pa...@riseup.net>: > >> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: >> Why does it have to be a stream? >> > > Right now I manage the pipelines as spark batch processing. Mooving to > stream would add some improvements such: > - simplification of the pipeline > - more frequent data ingestion > - better resource management (?) > > >> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: >> Why does it have to be a stream? >> >>> Am 18.11.2018 um 23:29 schrieb Nicolas Paris <nicolas.pa...@riseup.net>: >>> >>> Hi >>> >>> I have pdf to load into spark with at least <filename, byte_array> >>> format. I have considered some options: >>> >>> - spark streaming does not provide a native file stream for binary with >>> variable size (binaryRecordStream specifies a constant size) and I >>> would have to write my own receiver. >>> >>> - Structured streaming allows to process avro/parquet/orc files >>> containing pdfs, but this makes things more complicated than >>> monitoring a simple folder containing pdfs >>> >>> - Kafka is not designed to handle messages > 100KB, and for this reason >>> it is not a good option to use in the stream pipeline. >>> >>> Somebody has a suggestion ? >>> >>> Thanks, >>> >>> -- >>> nicolas >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >> > > -- > nicolas > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >