Handling imperfect data

2020-04-07 Thread Cameron Bateman
I am trying to create a pipeline that intakes PDF files, parses the data
using Tika and processes the data.  A problem I have is that sometimes Tika
doesn't perfectly convert certain pieces of text correctly.

I can detect that this and would like to fork the output of my pipeline:
for correctly converted PDF files, I want to continue processing the data.
For the ones that have errors, I'd like to dump the intermediate XML data
to a directory and raise an alert.  For those files, I will go and manually
fix the file and effective restart the pipeline from where it failed as if
it was correct in the first place.

Is there any facility to do this sort of handling of imperfect data
inputs?  I see that I can try to use MultiOutputReceiver and TupleTags to
try to fork the data but I'm a little at a loss where to proceed.

Thanks,
Cameron


Re: Using Self signed root ca for https connection in eleasticsearchIO

2020-04-07 Thread Kenneth Knowles
Hi Mohil,

Thanks for the detailed report. I think most people are reduced capacity
right now. Filing a Jira would be helpful for tracking this.

Since I am writing, I will add a quick guess, but we should move to Jira.
It seems this has more to do with Dataflow than ElasticSearch. The default
for staging files is to scan the classpath. To do more, or to fix any
problem with the autodetection, you will need to specify --filesToStage on
the command line or setFilesToStage in Java code. Am I correct that this
symptom is confirmed?

Kenn

On Mon, Apr 6, 2020 at 5:04 PM Mohil Khare  wrote:

> Any update on this? Shall I open a jira for this support ?
>
> Thanks and regards
> Mohil
>
> On Sun, Mar 22, 2020 at 9:36 PM Mohil Khare  wrote:
>
>> Hi,
>> This is Mohil from Prosimo, a small bay area based stealth mode startup.
>> We use Beam (on version 2.19) with google dataflow in our analytics
>> pipeline with Kafka and PubSub as source while GCS, BigQuery and
>> ElasticSearch as our sink.
>>
>> We want to use our private self signed root ca for tls connections
>> between our internal services viz kafka, ElasticSearch, beam etc. We are
>> able to setup secure tls connection between beam and kafka using self
>> signed root certificate in keystore.jks and truststore.jks and transferring
>> it to worker VMs running kafkaIO using KafkaIO's read via
>> withConsumerFactorFn().
>>
>> We want to do similar things with elasticseachIO where we want to update
>> its worker VM's truststore with our self signed root certificate so that
>> when elasticsearchIO connects using HTTPS, it can connect successfully
>> without ssl handshake failure. Currently we couldn't find any way to do so
>> with ElasticsearchIO. We tried various possible workarounds like:
>>
>> 1. Trying JvmInitializer to initialise Jvm with truststore using
>> System.setproperty for javax.net.ssl.trustStore,
>> 2. Transferring our jar to GCP's appengine where we start jar using
>> Djavax.net.ssl.trustStore and then triggering beam job from there.
>> 3. Setting elasticsearchIO flag withTrustSelfSigned to true (I don't
>> think it will work because looking at the source code, it looks like it has
>> dependency with keystorePath)
>>
>> But nothing worked. When we logged in to worker VMs, it looked like our
>> trustStore never made it to worker VM. All elasticsearchIO connections
>> failed with the following exception:
>>
>> sun.security.validator.ValidatorException: PKIX path building failed:
>> sun.security.provider.certpath.SunCertPathBuilderException: unable to find
>> valid certification path to requested target
>> sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387)
>>
>>
>> Right now to unblock ourselves, we have added proxy with letsencrypt root
>> ca between beam and Elasticsearch cluster and beam's elasticsearchIO
>> connect successfully to proxy using letsencrypt root certificate. We won't
>> want to use Letsencrypt root certificater for internal services as it
>> expires every three months.  Is there a way, just like kafkaIO, to use
>> selfsigned root certificate with elasticsearchIO? Or is there a way to
>> update java cacerts on worker VMs where beam job is running?
>>
>> Looking forward for some suggestions soon.
>>
>> Thanks and Regards
>> Mohil Khare
>>
>


Side input of size around 50Mb causing long GC pause

2020-04-07 Thread Kiran Hurakadli
I have  facing issue related to side inputs  and mentioned it in this link
https://stackoverflow.com/questions/60900937/side-input-of-size-around-50mb-causing-long-gc-pause
Any help would be appreciated
-- 
Regards,
*Kiran M Hurakadli.*