It is dangerous to use HBase directly because the schema may change at any 
time. Export the data as json and examine it there. To see how many events are 
in the stream you can just export then using bash to count lines (wc -l). Each 
line is a JSON event. Or import the data as a dataframe in Spark and use Spark 
SQL. 

There is no published contract about how events are stored in HBase.


On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <[email protected]> wrote:

We are also facing the exact same issue. We have confirmed 1.5 million records 
in HBase. However, I see only 19k records being fed for training 
(eventsRDD.count()).

With Regards,

     Sachin
⚜KTBFFH⚜

On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <[email protected] 
<mailto:[email protected]>> wrote:
Hi guys,

 

I have encoded some JPEG images in json and imported to HBase, which shows 6500 
records. When I read those data in DataSource with Pio, however only some 1500 
records were fed in PIO.

I use PEventStore.find(appName, entityType, eventNames), and all the records 
have  the same entityType, eventNames.

 

Any idea what could go wrong? The encoded string from JPEG is very wrong, 
hundreds of thousands of characters, could this be a reason for the data lost?

 

Thank you for looking into my question.

 

Best,

Weiguang



Reply via email to