Re: streaming pdf

2018-11-19 Thread Jörn Franke
And you have to write your own input format, but this is not so complicated 
(probably anyway recommended for the PDF case)

> Am 20.11.2018 um 08:06 schrieb Jörn Franke :
> 
> Well, I am not so sure about the use cases, but what about using 
> StreamingContext.fileStream?
> https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream-java.lang.String-scala.Function1-boolean-org.apache.hadoop.conf.Configuration-scala.reflect.ClassTag-scala.reflect.ClassTag-scala.reflect.ClassTag-
> 
> 
>> Am 19.11.2018 um 09:22 schrieb Nicolas Paris :
>> 
>>> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
>>> Why does it have to be a stream?
>>> 
>> 
>> Right now I manage the pipelines as spark batch processing. Mooving to
>> stream would add some improvements such:
>> - simplification of the pipeline
>> - more frequent data ingestion
>> - better resource management (?)
>> 
>> 
>>> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
>>> Why does it have to be a stream?
>>> 
 Am 18.11.2018 um 23:29 schrieb Nicolas Paris :
 
 Hi
 
 I have pdf to load into spark with at least 
 format. I have considered some options:
 
 - spark streaming does not provide a native file stream for binary with
 variable size (binaryRecordStream specifies a constant size) and I
 would have to write my own receiver.
 
 - Structured streaming allows to process avro/parquet/orc files
 containing pdfs, but this makes things more complicated than
 monitoring a simple folder  containing pdfs
 
 - Kafka is not designed to handle messages > 100KB, and for this reason
 it is not a good option to use in the stream pipeline.
 
 Somebody has a suggestion ?
 
 Thanks,
 
 -- 
 nicolas
 
 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org
 
>>> 
>> 
>> -- 
>> nicolas
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 


Re: streaming pdf

2018-11-19 Thread Jörn Franke
Well, I am not so sure about the use cases, but what about using 
StreamingContext.fileStream?
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream-java.lang.String-scala.Function1-boolean-org.apache.hadoop.conf.Configuration-scala.reflect.ClassTag-scala.reflect.ClassTag-scala.reflect.ClassTag-


> Am 19.11.2018 um 09:22 schrieb Nicolas Paris :
> 
>> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
>> Why does it have to be a stream?
>> 
> 
> Right now I manage the pipelines as spark batch processing. Mooving to
> stream would add some improvements such:
> - simplification of the pipeline
> - more frequent data ingestion
> - better resource management (?)
> 
> 
>> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
>> Why does it have to be a stream?
>> 
>>> Am 18.11.2018 um 23:29 schrieb Nicolas Paris :
>>> 
>>> Hi
>>> 
>>> I have pdf to load into spark with at least 
>>> format. I have considered some options:
>>> 
>>> - spark streaming does not provide a native file stream for binary with
>>> variable size (binaryRecordStream specifies a constant size) and I
>>> would have to write my own receiver.
>>> 
>>> - Structured streaming allows to process avro/parquet/orc files
>>> containing pdfs, but this makes things more complicated than
>>> monitoring a simple folder  containing pdfs
>>> 
>>> - Kafka is not designed to handle messages > 100KB, and for this reason
>>> it is not a good option to use in the stream pipeline.
>>> 
>>> Somebody has a suggestion ?
>>> 
>>> Thanks,
>>> 
>>> -- 
>>> nicolas
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> 
>> 
> 
> -- 
> nicolas
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


Re: streaming pdf

2018-11-19 Thread Nicolas Paris
On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
> Why does it have to be a stream?
> 

Right now I manage the pipelines as spark batch processing. Mooving to
stream would add some improvements such:
- simplification of the pipeline
- more frequent data ingestion
- better resource management (?)


On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
> Why does it have to be a stream?
> 
> > Am 18.11.2018 um 23:29 schrieb Nicolas Paris :
> > 
> > Hi
> > 
> > I have pdf to load into spark with at least 
> > format. I have considered some options:
> > 
> > - spark streaming does not provide a native file stream for binary with
> >  variable size (binaryRecordStream specifies a constant size) and I
> >  would have to write my own receiver.
> > 
> > - Structured streaming allows to process avro/parquet/orc files
> >  containing pdfs, but this makes things more complicated than
> >  monitoring a simple folder  containing pdfs
> > 
> > - Kafka is not designed to handle messages > 100KB, and for this reason
> >  it is not a good option to use in the stream pipeline.
> > 
> > Somebody has a suggestion ?
> > 
> > Thanks,
> > 
> > -- 
> > nicolas
> > 
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> > 
> 

-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: streaming pdf

2018-11-18 Thread Jörn Franke
Why does it have to be a stream?

> Am 18.11.2018 um 23:29 schrieb Nicolas Paris :
> 
> Hi
> 
> I have pdf to load into spark with at least 
> format. I have considered some options:
> 
> - spark streaming does not provide a native file stream for binary with
>  variable size (binaryRecordStream specifies a constant size) and I
>  would have to write my own receiver.
> 
> - Structured streaming allows to process avro/parquet/orc files
>  containing pdfs, but this makes things more complicated than
>  monitoring a simple folder  containing pdfs
> 
> - Kafka is not designed to handle messages > 100KB, and for this reason
>  it is not a good option to use in the stream pipeline.
> 
> Somebody has a suggestion ?
> 
> Thanks,
> 
> -- 
> nicolas
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



streaming pdf

2018-11-18 Thread Nicolas Paris
Hi

I have pdf to load into spark with at least 
format. I have considered some options:

- spark streaming does not provide a native file stream for binary with
  variable size (binaryRecordStream specifies a constant size) and I
  would have to write my own receiver.

- Structured streaming allows to process avro/parquet/orc files
  containing pdfs, but this makes things more complicated than
  monitoring a simple folder  containing pdfs

- Kafka is not designed to handle messages > 100KB, and for this reason
  it is not a good option to use in the stream pipeline.

Somebody has a suggestion ?

Thanks,

-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org