Re: Bulk loading through HFiles

Yiannis Gkoufas Wed, 17 Jun 2015 02:26:36 -0700

Hi Dawid,

that's great!! Of course, whenever you can.
I have actually found out that the saveToPhoenix() function provided is not
so bad to use.


Thanks!

On 16 June 2015 at 15:08, Dawid Wysakowicz <[email protected]>
wrote:

> Hi Yiannis,
> I've resolved the issue when I've run the code on bigger set of data. I
> will try to post the code when I polish it a bit. The partitions should be
> sorted with KeyValue sorter before bulkSaving them.
>
> 2015-06-16 15:10 GMT+02:00 Yiannis Gkoufas <[email protected]>:
>
>> Hi,
>>
>> didn't realize that I only sent to Dawid.
>> Resending to the entire list in case someone else has encountered this
>> error before:
>>
>> 15/06/10 23:45:16 WARN TaskSetManager: Lost task 34.48 in stage 0.0 (TID
>> 816, iriclusnd20): java.io.IOException: Added a key not lexically larger
>> than previous 
>> key=\x00\x17\x083661310846GMP\x00\x00\x00\x01E\xF3jH@\x010GEN\x00\x00\x01M\xDF\xA6!\xFF\x04,
>> lastkey=\x00\x17\x1E7359530994GMP\x00\x00\x00\x01@
>> \xD4\xFE\xC0\xC0\x010_0\x00\x00\x01M\xDF\xA6!\xFF\x04
>>         at
>> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:202)
>>         at
>> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:288)
>>         at
>> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:253)
>>         at
>> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:935)
>>         at
>> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:196)
>>         at
>> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:149)
>>         at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
>>         at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
>>         at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>         at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> I get the above error multiple times.
>> The HDFS path is fine, there is no error about that.
>>
>> Thanks!
>>
>> On 11 June 2015 at 17:49, Dawid <[email protected]> wrote:
>>
>>>  Hi,
>>> Your code seems ok to me, the only difference with what I do is that I
>>> explicitly pass hdfs path to bulkSave, I am not sure how "/bulk is resolved.
>>> I am very beginner with spark, hbase, phoenix etc. but if you'd like to
>>> use this code I could try to investigate your problem, but I need the full
>>> stack trace.
>>>
>>>
>>>
>>> On 11.06.2015 00:53, Yiannis Gkoufas wrote:
>>>
>>>  Hi Dawid,
>>>
>>>  yes I have been using your code. Probably I am invoking the classes in
>>> a wrong way.
>>>
>>> val data =  readings.map(e => e.split(","))
>>> .map(e => (e(0),e(1).toLong,e(2).toDouble,e(3).toDouble))
>>> val tableName = "TABLE";val columns = Seq("SMID","DT","US","GEN");val zkUrl 
>>> = Some("localhost:2181");
>>> val functions = new ExtendedProductRDDFunctions(data);val hfiles = 
>>> functions.toHFile(tableName,columns,new Configuration,zkUrl);val loader = 
>>> new BulkPhoenixLoader(hfiles);
>>> loader.bulkSave(tableName,"/bulk",None);
>>>
>>>
>>> Does the above seem the correct way to you?
>>>
>>>
>>> Thanks a lot!
>>>
>>>
>>> On 10 June 2015 at 19:13, Dawid <[email protected]> wrote:
>>>
>>>> Thx a lot James. That's the case.
>>>>
>>>>
>>>>
>>>> On 10.06.2015 19:50, James Taylor wrote:
>>>>
>>>>> David,
>>>>> It might be timestamp related. Check the timestamp of the rows/cells
>>>>> you imported from the HBase shell. Are the timestamps later than the
>>>>> server timestamp? In that case, you wouldn't see that data. If this is
>>>>> the case, you can try specifying the CURRENT_SCN property at
>>>>> connection time with a timestamp later than the timestamp of the
>>>>> rows/cells to verify.
>>>>> Thanks,
>>>>> James
>>>>>
>>>>> On Wed, Jun 10, 2015 at 10:14 AM, Dawid <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Yes, that's right I have generated HFile's that I managed to load so
>>>>>> to be
>>>>>> visible in HBase. I can't make them 'visible' to phoenix.
>>>>>>
>>>>>> What I noticed today is I have rows loaded from the generated HFiles
>>>>>> and
>>>>>> upserted through sqlline when I run 'DELETE FROM TABLE' only the
>>>>>> upserted
>>>>>> one disappears. The loaded from HFiles still persist in HBase.
>>>>>>
>>>>>> Yiannis how do you generate the HFiles? You can see my code here:
>>>>>> https://gist.github.com/dawidwys/3aba8ba618140756da7c
>>>>>>
>>>>>>
>>>>>> On 10.06.2015 17:57, Yiannis Gkoufas wrote:
>>>>>>
>>>>>> Hi Dawid,
>>>>>>
>>>>>> I am trying to do the same thing but I hit a wall while writing the
>>>>>> Hfiles
>>>>>> getting the following error:
>>>>>>
>>>>>> java.io.IOException: Added a key not lexically larger than previous
>>>>>>
>>>>>> key=\x00\x168675230967GMP\x00\x00\x00\x01=\xF4h)\xE0\x010GEN\x00\x00\x01M\xDE.\xB4T\x04,
>>>>>>
>>>>>> lastkey=\x00\x168675230967GMP\x00\x00\x00\x01=\xF5\x0C\xF5`\x010_0\x00\x00\x01M\xDE.\xB4T\x04
>>>>>>
>>>>>> You have reached the point where you are generating the HFiles,
>>>>>> loading them
>>>>>> but you dont see any rows in the table?
>>>>>> Is that correct?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On 8 June 2015 at 18:09, Dawid <[email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>> Yes, I did. I also tried to execute some upserts using sqlline after
>>>>>>> importing HFiles, and rows from upserts are visible both in sqlline
>>>>>>> and
>>>>>>> hbase shell, but
>>>>>>> the rows imported from HFile are only in hbase shell.
>>>>>>>
>>>>>>>
>>>>>>> On 08.06.2015 19:06, James Taylor wrote:
>>>>>>>
>>>>>>>> Dawid,
>>>>>>>> Perhaps a dumb question, but did you execute a CREATE TABLE
>>>>>>>> statement
>>>>>>>> in sqlline for the tables you're importing into? Phoenix needs to be
>>>>>>>> told the schema of the table (i.e. it's not enough to just create
>>>>>>>> the
>>>>>>>> table in HBase).
>>>>>>>> Thanks,
>>>>>>>> James
>>>>>>>>
>>>>>>>> On Mon, Jun 8, 2015 at 10:02 AM, Dawid <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Any suggestions? Some clues what to check?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 05.06.2015 23:21, Dawid wrote:
>>>>>>>>>
>>>>>>>>> Yes I can see it in hbase-shell.
>>>>>>>>>
>>>>>>>>> Sorry for the bad links, i haven't used private repositories on
>>>>>>>>> github.
>>>>>>>>> So I
>>>>>>>>> moved the files to a gist:
>>>>>>>>> https://gist.github.com/dawidwys/3aba8ba618140756da7c
>>>>>>>>> Hope this times it will work.
>>>>>>>>>
>>>>>>>>> On 05.06.2015 23:09, Ravi Kiran wrote:
>>>>>>>>>
>>>>>>>>> Hi Dawid,
>>>>>>>>>      Do you see the data when you run a simple scan or count of
>>>>>>>>> the table
>>>>>>>>> in
>>>>>>>>> Hbase shell ?
>>>>>>>>>
>>>>>>>>> FYI. The links lead me to a 404 : File not found.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Ravi
>>>>>>>>>
>>>>>>>>> On Fri, Jun 5, 2015 at 1:17 PM, Dawid <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> I was trying to code some utilities to bulk load data through
>>>>>>>>>> HFiles
>>>>>>>>>> from
>>>>>>>>>> Spark RDDs.
>>>>>>>>>> I was trying to took the pattern of CSVBulkLoadTool. I managed to
>>>>>>>>>> generate
>>>>>>>>>> some HFiles and load them into HBase, but i can't see the rows
>>>>>>>>>> using
>>>>>>>>>> sqlline. I would be more than grateful for any suggestions.
>>>>>>>>>>
>>>>>>>>>> The classes can be accessed at:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://github.com/dawidwys/gate/blob/master/src/main/scala/pl/edu/pw/elka/phoenix/BulkPhoenixLoader.scala
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://github.com/dawidwys/gate/blob/master/src/main/scala/pl/edu/pw/elka/phoenix/ExtendedProductRDDFunctions.scala
>>>>>>>>>>
>>>>>>>>>> Thanks in advance
>>>>>>>>>>
>>>>>>>>>> Dawid Wysakowicz
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>>> Pozdrawiam
>>>>>>>>> Dawid
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Pozdrawiam
>>>>>>>>> Dawid
>>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> Pozdrawiam
>>>>>>> Dawid
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Pozdrawiam
>>>>>> Dawid
>>>>>>
>>>>>
>>>>   --
>>>> Pozdrawiam
>>>> Dawid
>>>>
>>>>
>>>
>>> --
>>> Pozdrawiam
>>> Dawid
>>>
>>>
>>
>

Re: Bulk loading through HFiles

Reply via email to