But I don't see how it works here with phoenix or hbase coprocessor. Remember we are joining 2 big data sets here, one is the big file in HDFS, and records in HBASE. The driving force comes from Hadoop cluster.
On Thu, Sep 3, 2015 at 11:37 AM, Jörn Franke <jornfra...@gmail.com> wrote: > If you use pig or spark you increase the complexity from an operations > management perspective significantly. Spark should be seen from a platform > perspective if it make sense. If you can do it directly with hbase/phoenix > or only hbase coprocessor then this should be preferred. Otherwise you pay > more money for maintenance and development. > > Le jeu. 3 sept. 2015 à 17:16, Tao Lu <taolu2...@gmail.com> a écrit : > >> Yes. Ayan, you approach will work. >> >> Or alternatively, use Spark, and write a Scala/Java function which >> implements similar logic in your Pig UDF. >> >> Both approaches look similar. >> >> Personally, I would go with Spark solution, it will be slightly faster, >> and easier if you already have Spark cluster setup on top of your hadoop >> cluster in your infrastructure. >> >> Cheers, >> Tao >> >> >> On Thu, Sep 3, 2015 at 1:15 AM, ayan guha <guha.a...@gmail.com> wrote: >> >>> Thanks for your info. I am planning to implement a pig udf to do record >>> look ups. Kindly let me know if this is a good idea. >>> >>> Best >>> Ayan >>> >>> On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke <jornfra...@gmail.com> >>> wrote: >>> >>>> >>>> You may check if it makes sense to write a coprocessor doing an upsert >>>> for you, if it does not exist already. Maybe phoenix for Hbase supports >>>> this already. >>>> >>>> Another alternative, if the records do not have an unique Id, is to put >>>> them into a text index engine, such as Solr or Elasticsearch, which does in >>>> this case a fast matching with relevancy scores. >>>> >>>> >>>> You can use also Spark and Pig there. However, I am not sure if Spark >>>> is suitable for these one row lookups. Same holds for Pig. >>>> >>>> >>>> Le mer. 2 sept. 2015 à 23:53, ayan guha <guha.a...@gmail.com> a écrit : >>>> >>>> Hello group >>>> >>>> I am trying to use pig or spark in order to achieve following: >>>> >>>> 1. Write a batch process which will read from a file >>>> 2. Lookup hbase to see if the record exists. If so then need to compare >>>> incoming values with hbase and update fields which do not match. Else >>>> create a new record. >>>> >>>> My questions: >>>> 1. Is this a good use case for pig or spark? >>>> 2. Is there any way to read hbase for each incoming record in pig >>>> without writing map reduce code? >>>> 3. In case of spark I think we have to connect to hbase for every >>>> record. Is thr any other way? >>>> 4. What is the best connector for hbase which gives this functionality? >>>> >>>> Best >>>> >>>> Ayan >>>> >>>> >>>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> >> >> -- >> ------------------------------------------------ >> Thanks! >> Tao >> > -- ------------------------------------------------ Thanks! Tao