That's okay, but hadoop has ES integration. what happened if I run saveAsHadoopFile without hadoop (or I must need to pull up hadoop programatically? (if I can))
b0c1 ---------------------------------------------------------------------------------------------------------------------------------- Skype: boci13, Hangout: boci.b...@gmail.com On Thu, Jun 26, 2014 at 1:20 AM, Holden Karau <hol...@pigscanfly.ca> wrote: > > > On Wed, Jun 25, 2014 at 4:16 PM, boci <boci.b...@gmail.com> wrote: > >> Hi guys, thanks the direction now I have some problem/question: >> - in local (test) mode I want to use ElasticClient.local to create es >> connection, but in prodution I want to use ElasticClient.remote, to this I >> want to pass ElasticClient to mapPartitions, or what is the best >> practices? >> > In this case you probably want to make the ElasticClient inside of > mapPartitions (since it isn't serializable) and if you want to use a > different client in local mode just have a flag that control what type of > client you create. > >> - my stream output is write into elasticsearch. How can I >> test output.saveAsHadoopFile[ESOutputFormat]("-") in local environment? >> > - After store the enriched data into ES, I want to generate aggregated >> data (EsInputFormat) how can I test it in local? >> > I think the simplest thing to do would be use the same client in mode and > just start single node elastic search cluster. > >> >> Thanks guys >> >> b0c1 >> >> >> >> >> ---------------------------------------------------------------------------------------------------------------------------------- >> Skype: boci13, Hangout: boci.b...@gmail.com >> >> >> On Wed, Jun 25, 2014 at 1:33 AM, Holden Karau <hol...@pigscanfly.ca> >> wrote: >> >>> So I'm giving a talk at the Spark summit on using Spark & ElasticSearch, >>> but for now if you want to see a simple demo which uses elasticsearch for >>> geo input you can take a look at my quick & dirty implementation with >>> TopTweetsInALocation ( >>> https://github.com/holdenk/elasticsearchspark/blob/master/src/main/scala/com/holdenkarau/esspark/TopTweetsInALocation.scala >>> ). This approach uses the ESInputFormat which avoids the difficulty of >>> having to manually create ElasticSearch clients. >>> >>> This approach might not work for your data, e.g. if you need to create a >>> query for each record in your RDD. If this is the case, you could instead >>> look at using mapPartitions and setting up your Elasticsearch connection >>> inside of that, so you could then re-use the client for all of the queries >>> on each partition. This approach will avoid having to serialize the >>> Elasticsearch connection because it will be local to your function. >>> >>> Hope this helps! >>> >>> Cheers, >>> >>> Holden :) >>> >>> >>> On Tue, Jun 24, 2014 at 4:28 PM, Mayur Rustagi <mayur.rust...@gmail.com> >>> wrote: >>> >>>> Its not used as default serializer for some issues with compatibility & >>>> requirement to register the classes.. >>>> >>>> Which part are you getting as nonserializable... you need to serialize >>>> that class if you are sending it to spark workers inside a map, reduce , >>>> mappartition or any of the operations on RDD. >>>> >>>> >>>> Mayur Rustagi >>>> Ph: +1 (760) 203 3257 >>>> http://www.sigmoidanalytics.com >>>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>>> >>>> >>>> >>>> On Wed, Jun 25, 2014 at 4:52 AM, Peng Cheng <pc...@uow.edu.au> wrote: >>>> >>>>> I'm afraid persisting connection across two tasks is a dangerous act >>>>> as they >>>>> can't be guaranteed to be executed on the same machine. Your ES server >>>>> may >>>>> think its a man-in-the-middle attack! >>>>> >>>>> I think its possible to invoke a static method that give you a >>>>> connection in >>>>> a local 'pool', so nothing will sneak into your closure, but its too >>>>> complex >>>>> and there should be a better option. >>>>> >>>>> Never use kryo before, if its that good perhaps we should use it as the >>>>> default serializer >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/ElasticSearch-enrich-tp8209p8222.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>> >>>> >>> >>> >>> -- >>> Cell : 425-233-8271 >>> >> >> > > > -- > Cell : 425-233-8271 >