1596 is how many events were accepted by the EventServer, look at the exported 
format and compare with the ones you imported. There must be a formatting error 
or an error when importing (did you check responses for each event import?)

Looking below I see you are importing JPEG??? This is almost always a bad idea. 
Image data is usually kept in a filesystems like HDFS and a reference kept in 
the DB, there are too may serialization questions to do otherwise in my 
experience. If your Engine requires this you are asking for the kind of trouble 
you are seeing.


On Nov 28, 2017, at 7:16 PM, Huang, Weiguang <[email protected]> wrote:

Hi Pat,
 
Here is the result when we tried out your suggestion.
 
We checked the data from the Hbase, and the count of the records is exactly the 
same as we imported into the Hbase, that is 6500.
2017-11-29 10:42:19 INFO  DAGScheduler:54 - Job 0 finished: count at 
ImageDataFromHBaseChecker.scala:27, took 12.016679 s
Number of Records found : 6500
 
We exported data from Pio and checked, but got only 1596 – see at the bottom of 
the below screen record.
$ ls -al
total 412212
drwxr-xr-x  2 root root      4096 Nov 29 02:48 .
drwxr-xr-x 23 root root      4096 Nov 29 02:48 ..
-rw-r--r--  1 root root         8 Nov 29 02:48 ._SUCCESS.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00000.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00001.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00002.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00003.crc
-rw-r--r--  1 root root         0 Nov 29 02:48 _SUCCESS
-rw-r--r--  1 root root 104699844 Nov 29 02:48 part-00000
-rw-r--r--  1 root root 104699877 Nov 29 02:48 part-00001
-rw-r--r--  1 root root 104699843 Nov 29 02:48 part-00002
-rw-r--r--  1 root root 104699863 Nov 29 02:48 part-00003
$ wc -l part-00000
399 part-00000
$ wc -l part-00001
399 part-00001
$ wc -l part-00002
399 part-00002
$ wc -l part-00003
399 part-00003
That is 399 * 4 = 1596
 
Is this data lost caused by schema changed, or ill data contents, or other 
possible reasons? Appreciate for your thoughts.
 
Thanks,
Weiguang
  <>
 <>From: Pat Ferrel [mailto:[email protected]] 
Sent: Wednesday, November 29, 2017 10:16 AM
To: [email protected]
Cc: [email protected]
Subject: Re: Data lost from HBase to DataSource
 
Try my suggestion with export and see if the number of events looks correct. I 
am suggesting that you may not be counting what you think you are using HBase.
 
 
On Nov 28, 2017, at 5:53 PM, Huang, Weiguang <[email protected] 
<mailto:[email protected]>> wrote:
 
Hi Pat,
 
Thanks for your advice.  However, we are not using HBase directly. We use pio 
to import data into HBase by below command:
pio import --appid 7 --input hdfs://[host]:9000/pio/ <hdfs://[host]:9000/pio/> 
applicationName /recordFile.json
Could things go wrong here or somewhere else?
 
Thanks,
Weiguang
From: Pat Ferrel [mailto:[email protected] 
<mailto:[email protected]>] 
Sent: Tuesday, November 28, 2017 11:54 PM
To: [email protected] <mailto:[email protected]>
Cc: [email protected] 
<mailto:[email protected]>
Subject: Re: Data lost from HBase to DataSource
 
It is dangerous to use HBase directly because the schema may change at any 
time. Export the data as json and examine it there. To see how many events are 
in the stream you can just export then using bash to count lines (wc -l). Each 
line is a JSON event. Or import the data as a dataframe in Spark and use Spark 
SQL. 
 
There is no published contract about how events are stored in HBase.
 
 
On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <[email protected] 
<mailto:[email protected]>> wrote:
 
We are also facing the exact same issue. We have confirmed 1.5 million records 
in HBase. However, I see only 19k records being fed for training 
(eventsRDD.count()).

With Regards,
 
     Sachin
⚜KTBFFH⚜
 
On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <[email protected] 
<mailto:[email protected]>> wrote:
Hi guys,
 
I have encoded some JPEG images in json and imported to HBase, which shows 6500 
records. When I read those data in DataSource with Pio, however only some 1500 
records were fed in PIO.
I use PEventStore.find(appName, entityType, eventNames), and all the records 
have  the same entityType, eventNames.
 
Any idea what could go wrong? The encoded string from JPEG is very wrong, 
hundreds of thousands of characters, could this be a reason for the data lost?
 
Thank you for looking into my question.
 
Best,
Weiguang

Reply via email to