Re: ElasticSearch enrich

Nick Pentreath Thu, 26 Jun 2014 01:12:08 -0700

You can just add elasticsearch-hadoop as a dependency to your project to
user the ESInputFormat and ESOutputFormat (
https://github.com/elasticsearch/elasticsearch-hadoop). Some other basics
here:
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html


For testing, yes I think you will need to start ES in local mode (just
./bin/elasticsearch) and use the default config (host = localhost, port =
9200).


On Thu, Jun 26, 2014 at 9:04 AM, boci <boci.b...@gmail.com> wrote:

> That's okay, but hadoop has ES integration. what happened if I run
> saveAsHadoopFile without hadoop (or I must need to pull up hadoop
> programatically? (if I can))
>
> b0c1
>
>
> ----------------------------------------------------------------------------------------------------------------------------------
> Skype: boci13, Hangout: boci.b...@gmail.com
>
>
> On Thu, Jun 26, 2014 at 1:20 AM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>>
>>
>> On Wed, Jun 25, 2014 at 4:16 PM, boci <boci.b...@gmail.com> wrote:
>>
>>> Hi guys, thanks the direction now I have some problem/question:
>>> - in local (test) mode I want to use ElasticClient.local to create es
>>> connection, but in prodution I want to use ElasticClient.remote, to this I
>>> want to pass ElasticClient to mapPartitions, or what is the best
>>> practices?
>>>
>> In this case you probably want to make the ElasticClient inside of
>> mapPartitions (since it isn't serializable) and if you want to use a
>> different client in local mode just have a flag that control what type of
>> client you create.
>>
>>> - my stream output is write into elasticsearch. How can I
>>> test output.saveAsHadoopFile[ESOutputFormat]("-") in local environment?
>>>
>> - After store the enriched data into ES, I want to generate aggregated
>>> data (EsInputFormat) how can I test it in local?
>>>
>> I think the simplest thing to do would be use the same client in mode and
>> just start single node elastic search cluster.
>>
>>>
>>> Thanks guys
>>>
>>> b0c1
>>>
>>>
>>>
>>>
>>> ----------------------------------------------------------------------------------------------------------------------------------
>>> Skype: boci13, Hangout: boci.b...@gmail.com
>>>
>>>
>>> On Wed, Jun 25, 2014 at 1:33 AM, Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> So I'm giving a talk at the Spark summit on using Spark &
>>>> ElasticSearch, but for now if you want to see a simple demo which uses
>>>> elasticsearch for geo input you can take a look at my quick & dirty
>>>> implementation with TopTweetsInALocation (
>>>> https://github.com/holdenk/elasticsearchspark/blob/master/src/main/scala/com/holdenkarau/esspark/TopTweetsInALocation.scala
>>>> ). This approach uses the ESInputFormat which avoids the difficulty of
>>>> having to manually create ElasticSearch clients.
>>>>
>>>> This approach might not work for your data, e.g. if you need to create
>>>> a query for each record in your RDD. If this is the case, you could instead
>>>> look at using mapPartitions and setting up your Elasticsearch connection
>>>> inside of that, so you could then re-use the client for all of the queries
>>>> on each partition. This approach will avoid having to serialize the
>>>> Elasticsearch connection because it will be local to your function.
>>>>
>>>> Hope this helps!
>>>>
>>>> Cheers,
>>>>
>>>> Holden :)
>>>>
>>>>
>>>> On Tue, Jun 24, 2014 at 4:28 PM, Mayur Rustagi <mayur.rust...@gmail.com
>>>> > wrote:
>>>>
>>>>> Its not used as default serializer for some issues with compatibility
>>>>> & requirement to register the classes..
>>>>>
>>>>> Which part are you getting as nonserializable... you need to serialize
>>>>> that class if you are sending it to spark workers inside a map, reduce ,
>>>>> mappartition or any of the operations on RDD.
>>>>>
>>>>>
>>>>> Mayur Rustagi
>>>>> Ph: +1 (760) 203 3257
>>>>> http://www.sigmoidanalytics.com
>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 25, 2014 at 4:52 AM, Peng Cheng <pc...@uow.edu.au> wrote:
>>>>>
>>>>>> I'm afraid persisting connection across two tasks is a dangerous act
>>>>>> as they
>>>>>> can't be guaranteed to be executed on the same machine. Your ES
>>>>>> server may
>>>>>> think its a man-in-the-middle attack!
>>>>>>
>>>>>> I think its possible to invoke a static method that give you a
>>>>>> connection in
>>>>>> a local 'pool', so nothing will sneak into your closure, but its too
>>>>>> complex
>>>>>> and there should be a better option.
>>>>>>
>>>>>> Never use kryo before, if its that good perhaps we should use it as
>>>>>> the
>>>>>> default serializer
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/ElasticSearch-enrich-tp8209p8222.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Cell : 425-233-8271
>>>>
>>>
>>>
>>
>>
>> --
>> Cell : 425-233-8271
>>
>
>

Re: ElasticSearch enrich

Reply via email to