On Wed, Jun 25, 2014 at 4:16 PM, boci <boci.b...@gmail.com> wrote:

> Hi guys, thanks the direction now I have some problem/question:
> - in local (test) mode I want to use ElasticClient.local to create es
> connection, but in prodution I want to use ElasticClient.remote, to this I
> want to pass ElasticClient to mapPartitions, or what is the best
> practices?
>
In this case you probably want to make the ElasticClient inside of
mapPartitions (since it isn't serializable) and if you want to use a
different client in local mode just have a flag that control what type of
client you create.

> - my stream output is write into elasticsearch. How can I
> test output.saveAsHadoopFile[ESOutputFormat]("-") in local environment?
>
- After store the enriched data into ES, I want to generate aggregated data
> (EsInputFormat) how can I test it in local?
>
I think the simplest thing to do would be use the same client in mode and
just start single node elastic search cluster.

>
> Thanks guys
>
> b0c1
>
>
>
>
> ----------------------------------------------------------------------------------------------------------------------------------
> Skype: boci13, Hangout: boci.b...@gmail.com
>
>
> On Wed, Jun 25, 2014 at 1:33 AM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> So I'm giving a talk at the Spark summit on using Spark & ElasticSearch,
>> but for now if you want to see a simple demo which uses elasticsearch for
>> geo input you can take a look at my quick & dirty implementation with
>> TopTweetsInALocation (
>> https://github.com/holdenk/elasticsearchspark/blob/master/src/main/scala/com/holdenkarau/esspark/TopTweetsInALocation.scala
>> ). This approach uses the ESInputFormat which avoids the difficulty of
>> having to manually create ElasticSearch clients.
>>
>> This approach might not work for your data, e.g. if you need to create a
>> query for each record in your RDD. If this is the case, you could instead
>> look at using mapPartitions and setting up your Elasticsearch connection
>> inside of that, so you could then re-use the client for all of the queries
>> on each partition. This approach will avoid having to serialize the
>> Elasticsearch connection because it will be local to your function.
>>
>> Hope this helps!
>>
>> Cheers,
>>
>> Holden :)
>>
>>
>> On Tue, Jun 24, 2014 at 4:28 PM, Mayur Rustagi <mayur.rust...@gmail.com>
>> wrote:
>>
>>> Its not used as default serializer for some issues with compatibility &
>>> requirement to register the classes..
>>>
>>> Which part are you getting as nonserializable... you need to serialize
>>> that class if you are sending it to spark workers inside a map, reduce ,
>>> mappartition or any of the operations on RDD.
>>>
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Wed, Jun 25, 2014 at 4:52 AM, Peng Cheng <pc...@uow.edu.au> wrote:
>>>
>>>> I'm afraid persisting connection across two tasks is a dangerous act as
>>>> they
>>>> can't be guaranteed to be executed on the same machine. Your ES server
>>>> may
>>>> think its a man-in-the-middle attack!
>>>>
>>>> I think its possible to invoke a static method that give you a
>>>> connection in
>>>> a local 'pool', so nothing will sneak into your closure, but its too
>>>> complex
>>>> and there should be a better option.
>>>>
>>>> Never use kryo before, if its that good perhaps we should use it as the
>>>> default serializer
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/ElasticSearch-enrich-tp8209p8222.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>
>>>
>>
>>
>> --
>> Cell : 425-233-8271
>>
>
>


-- 
Cell : 425-233-8271

Reply via email to