Re: Bulk loading through HFiles

Dawid Wysakowicz Tue, 16 Jun 2015 07:10:33 -0700

Hi Yiannis,
I've resolved the issue when I've run the code on bigger set of data. I
will try to post the code when I polish it a bit. The partitions should be
sorted with KeyValue sorter before bulkSaving them.


2015-06-16 15:10 GMT+02:00 Yiannis Gkoufas <[email protected]>:

> Hi,
>
> didn't realize that I only sent to Dawid.
> Resending to the entire list in case someone else has encountered this
> error before:
>
> 15/06/10 23:45:16 WARN TaskSetManager: Lost task 34.48 in stage 0.0 (TID
> 816, iriclusnd20): java.io.IOException: Added a key not lexically larger
> than previous 
> key=\x00\x17\x083661310846GMP\x00\x00\x00\x01E\xF3jH@\x010GEN\x00\x00\x01M\xDF\xA6!\xFF\x04,
> lastkey=\x00\x17\x1E7359530994GMP\x00\x00\x00\x01@
> \xD4\xFE\xC0\xC0\x010_0\x00\x00\x01M\xDF\xA6!\xFF\x04
>         at
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:202)
>         at
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:288)
>         at
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:253)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:935)
>         at
> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:196)
>         at
> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:149)
>         at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
>         at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>         at org.apache.spark.scheduler.Task.run(Task.scala:64)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> I get the above error multiple times.
> The HDFS path is fine, there is no error about that.
>
> Thanks!
>
> On 11 June 2015 at 17:49, Dawid <[email protected]> wrote:
>
>>  Hi,
>> Your code seems ok to me, the only difference with what I do is that I
>> explicitly pass hdfs path to bulkSave, I am not sure how "/bulk is resolved.
>> I am very beginner with spark, hbase, phoenix etc. but if you'd like to
>> use this code I could try to investigate your problem, but I need the full
>> stack trace.
>>
>>
>>
>> On 11.06.2015 00:53, Yiannis Gkoufas wrote:
>>
>>  Hi Dawid,
>>
>>  yes I have been using your code. Probably I am invoking the classes in a
>> wrong way.
>>
>> val data =  readings.map(e => e.split(","))
>> .map(e => (e(0),e(1).toLong,e(2).toDouble,e(3).toDouble))
>> val tableName = "TABLE";val columns = Seq("SMID","DT","US","GEN");val zkUrl 
>> = Some("localhost:2181");
>> val functions = new ExtendedProductRDDFunctions(data);val hfiles = 
>> functions.toHFile(tableName,columns,new Configuration,zkUrl);val loader = 
>> new BulkPhoenixLoader(hfiles);
>> loader.bulkSave(tableName,"/bulk",None);
>>
>>
>> Does the above seem the correct way to you?
>>
>>
>> Thanks a lot!
>>
>>
>> On 10 June 2015 at 19:13, Dawid <[email protected]> wrote:
>>
>>> Thx a lot James. That's the case.
>>>
>>>
>>>
>>> On 10.06.2015 19:50, James Taylor wrote:
>>>
>>>> David,
>>>> It might be timestamp related. Check the timestamp of the rows/cells
>>>> you imported from the HBase shell. Are the timestamps later than the
>>>> server timestamp? In that case, you wouldn't see that data. If this is
>>>> the case, you can try specifying the CURRENT_SCN property at
>>>> connection time with a timestamp later than the timestamp of the
>>>> rows/cells to verify.
>>>> Thanks,
>>>> James
>>>>
>>>> On Wed, Jun 10, 2015 at 10:14 AM, Dawid <[email protected]>
>>>> wrote:
>>>>
>>>>> Yes, that's right I have generated HFile's that I managed to load so
>>>>> to be
>>>>> visible in HBase. I can't make them 'visible' to phoenix.
>>>>>
>>>>> What I noticed today is I have rows loaded from the generated HFiles
>>>>> and
>>>>> upserted through sqlline when I run 'DELETE FROM TABLE' only the
>>>>> upserted
>>>>> one disappears. The loaded from HFiles still persist in HBase.
>>>>>
>>>>> Yiannis how do you generate the HFiles? You can see my code here:
>>>>> https://gist.github.com/dawidwys/3aba8ba618140756da7c
>>>>>
>>>>>
>>>>> On 10.06.2015 17:57, Yiannis Gkoufas wrote:
>>>>>
>>>>> Hi Dawid,
>>>>>
>>>>> I am trying to do the same thing but I hit a wall while writing the
>>>>> Hfiles
>>>>> getting the following error:
>>>>>
>>>>> java.io.IOException: Added a key not lexically larger than previous
>>>>>
>>>>> key=\x00\x168675230967GMP\x00\x00\x00\x01=\xF4h)\xE0\x010GEN\x00\x00\x01M\xDE.\xB4T\x04,
>>>>>
>>>>> lastkey=\x00\x168675230967GMP\x00\x00\x00\x01=\xF5\x0C\xF5`\x010_0\x00\x00\x01M\xDE.\xB4T\x04
>>>>>
>>>>> You have reached the point where you are generating the HFiles,
>>>>> loading them
>>>>> but you dont see any rows in the table?
>>>>> Is that correct?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On 8 June 2015 at 18:09, Dawid <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>> Yes, I did. I also tried to execute some upserts using sqlline after
>>>>>> importing HFiles, and rows from upserts are visible both in sqlline
>>>>>> and
>>>>>> hbase shell, but
>>>>>> the rows imported from HFile are only in hbase shell.
>>>>>>
>>>>>>
>>>>>> On 08.06.2015 19:06, James Taylor wrote:
>>>>>>
>>>>>>> Dawid,
>>>>>>> Perhaps a dumb question, but did you execute a CREATE TABLE statement
>>>>>>> in sqlline for the tables you're importing into? Phoenix needs to be
>>>>>>> told the schema of the table (i.e. it's not enough to just create the
>>>>>>> table in HBase).
>>>>>>> Thanks,
>>>>>>> James
>>>>>>>
>>>>>>> On Mon, Jun 8, 2015 at 10:02 AM, Dawid <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Any suggestions? Some clues what to check?
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05.06.2015 23:21, Dawid wrote:
>>>>>>>>
>>>>>>>> Yes I can see it in hbase-shell.
>>>>>>>>
>>>>>>>> Sorry for the bad links, i haven't used private repositories on
>>>>>>>> github.
>>>>>>>> So I
>>>>>>>> moved the files to a gist:
>>>>>>>> https://gist.github.com/dawidwys/3aba8ba618140756da7c
>>>>>>>> Hope this times it will work.
>>>>>>>>
>>>>>>>> On 05.06.2015 23:09, Ravi Kiran wrote:
>>>>>>>>
>>>>>>>> Hi Dawid,
>>>>>>>>      Do you see the data when you run a simple scan or count of the
>>>>>>>> table
>>>>>>>> in
>>>>>>>> Hbase shell ?
>>>>>>>>
>>>>>>>> FYI. The links lead me to a 404 : File not found.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Ravi
>>>>>>>>
>>>>>>>> On Fri, Jun 5, 2015 at 1:17 PM, Dawid <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> I was trying to code some utilities to bulk load data through
>>>>>>>>> HFiles
>>>>>>>>> from
>>>>>>>>> Spark RDDs.
>>>>>>>>> I was trying to took the pattern of CSVBulkLoadTool. I managed to
>>>>>>>>> generate
>>>>>>>>> some HFiles and load them into HBase, but i can't see the rows
>>>>>>>>> using
>>>>>>>>> sqlline. I would be more than grateful for any suggestions.
>>>>>>>>>
>>>>>>>>> The classes can be accessed at:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/dawidwys/gate/blob/master/src/main/scala/pl/edu/pw/elka/phoenix/BulkPhoenixLoader.scala
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/dawidwys/gate/blob/master/src/main/scala/pl/edu/pw/elka/phoenix/ExtendedProductRDDFunctions.scala
>>>>>>>>>
>>>>>>>>> Thanks in advance
>>>>>>>>>
>>>>>>>>> Dawid Wysakowicz
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>>>>> Pozdrawiam
>>>>>>>> Dawid
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Pozdrawiam
>>>>>>>> Dawid
>>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Pozdrawiam
>>>>>> Dawid
>>>>>>
>>>>>>
>>>>> --
>>>>> Pozdrawiam
>>>>> Dawid
>>>>>
>>>>
>>>   --
>>> Pozdrawiam
>>> Dawid
>>>
>>>
>>
>> --
>> Pozdrawiam
>> Dawid
>>
>>
>

Re: Bulk loading through HFiles

Reply via email to