Re: streaming pdf
And you have to write your own input format, but this is not so complicated (probably anyway recommended for the PDF case) > Am 20.11.2018 um 08:06 schrieb Jörn Franke : > > Well, I am not so sure about the use cases, but what about using > StreamingContext.fileStream? > https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream-java.lang.String-scala.Function1-boolean-org.apache.hadoop.conf.Configuration-scala.reflect.ClassTag-scala.reflect.ClassTag-scala.reflect.ClassTag- > > >> Am 19.11.2018 um 09:22 schrieb Nicolas Paris : >> >>> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: >>> Why does it have to be a stream? >>> >> >> Right now I manage the pipelines as spark batch processing. Mooving to >> stream would add some improvements such: >> - simplification of the pipeline >> - more frequent data ingestion >> - better resource management (?) >> >> >>> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: >>> Why does it have to be a stream? >>> Am 18.11.2018 um 23:29 schrieb Nicolas Paris : Hi I have pdf to load into spark with at least format. I have considered some options: - spark streaming does not provide a native file stream for binary with variable size (binaryRecordStream specifies a constant size) and I would have to write my own receiver. - Structured streaming allows to process avro/parquet/orc files containing pdfs, but this makes things more complicated than monitoring a simple folder containing pdfs - Kafka is not designed to handle messages > 100KB, and for this reason it is not a good option to use in the stream pipeline. Somebody has a suggestion ? Thanks, -- nicolas - To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >> >> -- >> nicolas >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>
Re: streaming pdf
Well, I am not so sure about the use cases, but what about using StreamingContext.fileStream? https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream-java.lang.String-scala.Function1-boolean-org.apache.hadoop.conf.Configuration-scala.reflect.ClassTag-scala.reflect.ClassTag-scala.reflect.ClassTag- > Am 19.11.2018 um 09:22 schrieb Nicolas Paris : > >> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: >> Why does it have to be a stream? >> > > Right now I manage the pipelines as spark batch processing. Mooving to > stream would add some improvements such: > - simplification of the pipeline > - more frequent data ingestion > - better resource management (?) > > >> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: >> Why does it have to be a stream? >> >>> Am 18.11.2018 um 23:29 schrieb Nicolas Paris : >>> >>> Hi >>> >>> I have pdf to load into spark with at least >>> format. I have considered some options: >>> >>> - spark streaming does not provide a native file stream for binary with >>> variable size (binaryRecordStream specifies a constant size) and I >>> would have to write my own receiver. >>> >>> - Structured streaming allows to process avro/parquet/orc files >>> containing pdfs, but this makes things more complicated than >>> monitoring a simple folder containing pdfs >>> >>> - Kafka is not designed to handle messages > 100KB, and for this reason >>> it is not a good option to use in the stream pipeline. >>> >>> Somebody has a suggestion ? >>> >>> Thanks, >>> >>> -- >>> nicolas >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >> > > -- > nicolas > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >
Re: streaming pdf
On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: > Why does it have to be a stream? > Right now I manage the pipelines as spark batch processing. Mooving to stream would add some improvements such: - simplification of the pipeline - more frequent data ingestion - better resource management (?) On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: > Why does it have to be a stream? > > > Am 18.11.2018 um 23:29 schrieb Nicolas Paris : > > > > Hi > > > > I have pdf to load into spark with at least > > format. I have considered some options: > > > > - spark streaming does not provide a native file stream for binary with > > variable size (binaryRecordStream specifies a constant size) and I > > would have to write my own receiver. > > > > - Structured streaming allows to process avro/parquet/orc files > > containing pdfs, but this makes things more complicated than > > monitoring a simple folder containing pdfs > > > > - Kafka is not designed to handle messages > 100KB, and for this reason > > it is not a good option to use in the stream pipeline. > > > > Somebody has a suggestion ? > > > > Thanks, > > > > -- > > nicolas > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > -- nicolas - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: streaming pdf
Why does it have to be a stream? > Am 18.11.2018 um 23:29 schrieb Nicolas Paris : > > Hi > > I have pdf to load into spark with at least > format. I have considered some options: > > - spark streaming does not provide a native file stream for binary with > variable size (binaryRecordStream specifies a constant size) and I > would have to write my own receiver. > > - Structured streaming allows to process avro/parquet/orc files > containing pdfs, but this makes things more complicated than > monitoring a simple folder containing pdfs > > - Kafka is not designed to handle messages > 100KB, and for this reason > it is not a good option to use in the stream pipeline. > > Somebody has a suggestion ? > > Thanks, > > -- > nicolas > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org