And you have to write your own input format, but this is not so complicated (probably anyway recommended for the PDF case)
> Am 20.11.2018 um 08:06 schrieb Jörn Franke <jornfra...@gmail.com>: > > Well, I am not so sure about the use cases, but what about using > StreamingContext.fileStream? > https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream-java.lang.String-scala.Function1-boolean-org.apache.hadoop.conf.Configuration-scala.reflect.ClassTag-scala.reflect.ClassTag-scala.reflect.ClassTag- > > >> Am 19.11.2018 um 09:22 schrieb Nicolas Paris <nicolas.pa...@riseup.net>: >> >>> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: >>> Why does it have to be a stream? >>> >> >> Right now I manage the pipelines as spark batch processing. Mooving to >> stream would add some improvements such: >> - simplification of the pipeline >> - more frequent data ingestion >> - better resource management (?) >> >> >>> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: >>> Why does it have to be a stream? >>> >>>> Am 18.11.2018 um 23:29 schrieb Nicolas Paris <nicolas.pa...@riseup.net>: >>>> >>>> Hi >>>> >>>> I have pdf to load into spark with at least <filename, byte_array> >>>> format. I have considered some options: >>>> >>>> - spark streaming does not provide a native file stream for binary with >>>> variable size (binaryRecordStream specifies a constant size) and I >>>> would have to write my own receiver. >>>> >>>> - Structured streaming allows to process avro/parquet/orc files >>>> containing pdfs, but this makes things more complicated than >>>> monitoring a simple folder containing pdfs >>>> >>>> - Kafka is not designed to handle messages > 100KB, and for this reason >>>> it is not a good option to use in the stream pipeline. >>>> >>>> Somebody has a suggestion ? >>>> >>>> Thanks, >>>> >>>> -- >>>> nicolas >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>> >> >> -- >> nicolas >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>