Which version of HBase are you using? I guess it is because libraries of the storage/hbase subproject are too old that this causes. If you are using HBase 1.2.6, running assembly task against hbase-common, hbase-client and hbase-server 1.2.6 would work.
2017-11-30 17:25 GMT+09:00 Huang, Weiguang <[email protected]>: > Hi Pat, > > > > We have compared the format of 2 records as attached from the json file for > import. The first one is imported and successfully read in $pio train as we > printed out its entityID in logger, and the other should not have been read > in pio successfully as its entityId is absent in logger. But the two records > have the same json format, as every record has been generated by the same > program. > > And here is an quick illustration of a record in json, with "encodedImage" > being shortened from its actual 262,156 characters > > {"event": "imageNet", "entityId": 10004, "entityType": "JPEG", "properties": > {"label": "n01484850", "encodedImage": "AAABAAA…..Oynz4=”}} > > Only "entityId", "properties": {"label", "encodedImage"} could be different > among every record. > > > > We also noticed another weird thing. After the one-time $pio import of 6500 > records, we $pio export immediately and got 399 + 399 = 798 records in 2 > $pio exported files. > > As we $pio train for a couple of rounds, the number of records in pio > increased to 399 + 399 + 399 = 1197 in 3 $pio exported files, > > and may to 399 + 399 + 399 + 399 = 1596 after more $pio train. > > > > Please see below the system logger for $pio import. It seems everything is > all right. > > $pio import --appid 8 --input > ../imageNetTemplate/data/imagenet_5_class_resized.json > > > > /opt/work/spark-2.1.1 is probably an Apache Spark development tree. Please > make sure you are using at least 1.3.0. > > SLF4J: Class path contains multiple SLF4J bindings. > > SLF4J: Found binding in > [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: Found binding in > [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > > [INFO] [Runner$] Submission command: /opt/work/spark-2.1.1/bin/spark-submit > --class org.apache.predictionio.tools.imprt.FileToEvents --jars > file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-localfs-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-jdbc-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-elasticsearch1-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hbase-assembly-0.11.0-incubating.jar > --files > file:/opt/work/PredictionIO-0.11.0-incubating/conf/log4j.properties,file:/opt/work/hbase-1.3.1/conf/hbase-site.xml > --driver-class-path > /opt/work/PredictionIO-0.11.0-incubating/conf:/opt/work/hbase-1.3.1/conf > --driver-java-options -Dpio.log.dir=/root > file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-incubating.jar > --appid 8 --input > file:/opt/work/arda-data/pio-templates/dataImportTest/../imageNetTemplate/data/imagenet_5_class_resized.json > --env > PIO_STORAGE_SOURCES_HBASE_TYPE=hbase,PIO_ENV_LOADED=1,PIO_STORAGE_SOURCES_HBASE_HOSTS=Gondolin-Node-050,PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta,PIO_VERSION=0.11.0,PIO_FS_BASEDIR=/root/.pio_store,PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost,PIO_STORAGE_SOURCES_HBASE_HOME=/opt/work/hbase-1.3.1,PIO_HOME=/opt/work/PredictionIO-0.11.0-incubating,PIO_FS_ENGINESDIR=/root/.pio_store/engines,PIO_STORAGE_SOURCES_LOCALFS_PATH=/root/.pio_store/models,PIO_STORAGE_SOURCES_HBASE_PORTS=16000,PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch,PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH,PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS,PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event,PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=predictionio,PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/opt/work/elasticsearch-1.7.6,PIO_FS_TMPDIR=/root/.pio_store/tmp,PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model,PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE,PIO_CONF_DIR=/opt/work/PredictionIO-0.11.0-incubating/conf,PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300,PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs > > [INFO] [log] Logging initialized @4913ms > > [INFO] [Server] jetty-9.2.z-SNAPSHOT > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@11787b64{/storage,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@319642db{/environment,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@357bc488{/static,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@4ea17147{/,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,AVAILABLE,@Spark} > > [INFO] [ServerConnector] Started Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040} > > [INFO] [Server] Started @5086ms > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@4f114b{/metrics/json,null,AVAILABLE,@Spark} > > [INFO] [FileToEvents$] Events are imported. > > [INFO] [FileToEvents$] Done. > > [INFO] [ServerConnector] Stopped Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@4ea17147{/,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@357bc488{/static,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@319642db{/environment,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@11787b64{/storage,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,UNAVAILABLE,@Spark} > > > > Thanks for your advice. > > > > Weiguang > > > > From: Pat Ferrel [mailto:[email protected]] > Sent: Thursday, November 30, 2017 2:06 AM > To: [email protected] > Cc: Shi, Dongjie <[email protected]> > > > Subject: Re: Data lost from HBase to DataSource > > > > 1596 is how many events were accepted by the EventServer, look at the > exported format and compare with the ones you imported. There must be a > formatting error or an error when importing (did you check responses for > each event import?) > > > > Looking below I see you are importing JPEG??? This is almost always a bad > idea. Image data is usually kept in a filesystems like HDFS and a reference > kept in the DB, there are too may serialization questions to do otherwise in > my experience. If your Engine requires this you are asking for the kind of > trouble you are seeing. > > > > > > On Nov 28, 2017, at 7:16 PM, Huang, Weiguang <[email protected]> > wrote: > > > > Hi Pat, > > > > Here is the result when we tried out your suggestion. > > > > We checked the data from the Hbase, and the count of the records is exactly > the same as we imported into the Hbase, that is 6500. > > 2017-11-29 10:42:19 INFO DAGScheduler:54 - Job 0 finished: count at > ImageDataFromHBaseChecker.scala:27, took 12.016679 s > > Number of Records found : 6500 > > > > We exported data from Pio and checked, but got only 1596 – see at the bottom > of the below screen record. > > $ ls -al > > total 412212 > > drwxr-xr-x 2 root root 4096 Nov 29 02:48 . > > drwxr-xr-x 23 root root 4096 Nov 29 02:48 .. > > -rw-r--r-- 1 root root 8 Nov 29 02:48 ._SUCCESS.crc > > -rw-r--r-- 1 root root 817976 Nov 29 02:48 .part-00000.crc > > -rw-r--r-- 1 root root 817976 Nov 29 02:48 .part-00001.crc > > -rw-r--r-- 1 root root 817976 Nov 29 02:48 .part-00002.crc > > -rw-r--r-- 1 root root 817976 Nov 29 02:48 .part-00003.crc > > -rw-r--r-- 1 root root 0 Nov 29 02:48 _SUCCESS > > -rw-r--r-- 1 root root 104699844 Nov 29 02:48 part-00000 > > -rw-r--r-- 1 root root 104699877 Nov 29 02:48 part-00001 > > -rw-r--r-- 1 root root 104699843 Nov 29 02:48 part-00002 > > -rw-r--r-- 1 root root 104699863 Nov 29 02:48 part-00003 > > $ wc -l part-00000 > > 399 part-00000 > > $ wc -l part-00001 > > 399 part-00001 > > $ wc -l part-00002 > > 399 part-00002 > > $ wc -l part-00003 > > 399 part-00003 > > That is 399 * 4 = 1596 > > > > Is this data lost caused by schema changed, or ill data contents, or other > possible reasons? Appreciate for your thoughts. > > > > Thanks, > > Weiguang > > > > From: Pat Ferrel [mailto:[email protected]] > Sent: Wednesday, November 29, 2017 10:16 AM > To: [email protected] > Cc: [email protected] > Subject: Re: Data lost from HBase to DataSource > > > > Try my suggestion with export and see if the number of events looks correct. > I am suggesting that you may not be counting what you think you are using > HBase. > > > > > > On Nov 28, 2017, at 5:53 PM, Huang, Weiguang <[email protected]> > wrote: > > > > Hi Pat, > > > > Thanks for your advice. However, we are not using HBase directly. We use > pio to import data into HBase by below command: > > pio import --appid 7 --input hdfs://[host]:9000/pio/ applicationName > /recordFile.json > > Could things go wrong here or somewhere else? > > > > Thanks, > > Weiguang > > From: Pat Ferrel [mailto:[email protected]] > Sent: Tuesday, November 28, 2017 11:54 PM > To: [email protected] > Cc: [email protected] > Subject: Re: Data lost from HBase to DataSource > > > > It is dangerous to use HBase directly because the schema may change at any > time. Export the data as json and examine it there. To see how many events > are in the stream you can just export then using bash to count lines (wc > -l). Each line is a JSON event. Or import the data as a dataframe in Spark > and use Spark SQL. > > > > There is no published contract about how events are stored in HBase. > > > > > > On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <[email protected]> wrote: > > > > We are also facing the exact same issue. We have confirmed 1.5 million > records in HBase. However, I see only 19k records being fed for training > (eventsRDD.count()). > > > With Regards, > > > > Sachin > > ⚜KTBFFH⚜ > > > > On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <[email protected]> > wrote: > > Hi guys, > > > > I have encoded some JPEG images in json and imported to HBase, which shows > 6500 records. When I read those data in DataSource with Pio, however only > some 1500 records were fed in PIO. > > I use PEventStore.find(appName, entityType, eventNames), and all the records > have the same entityType, eventNames. > > > > Any idea what could go wrong? The encoded string from JPEG is very wrong, > hundreds of thousands of characters, could this be a reason for the data > lost? > > > > Thank you for looking into my question. > > > > Best, > > Weiguang > >
