Hi Peyman, Sure , will try using two Hive tables for the conversion. It was awesome discussing with you . Thanks a lot.
Shashi On Sat, Jan 3, 2015 at 10:53 PM, Peyman Mohajerian <[email protected]> wrote: > I would recommend as the first step not to use Flume, but rather land the > data in hdfs in the source format, XML and use Hive to convert the format > from XML to Parquet. That is much simpler to do than using Flume. Flume > only makes sense if you don't care for the original file format and want to > ingest the data fast, meet some SLA. > Flume has a good user guide page if you google it. > In Hive you need two tables, one that reads XML data using XML serd > (external table), a second one that is Parquet format, you do insert into > the second table from the source, and that will easily do the format > conversion. > > On Sat, Jan 3, 2015 at 9:16 AM, Shashidhar Rao <[email protected] > > wrote: > >> Hi Peyman, >> >> Really appreciate your suggestion. >> But say , if Tableau has to be used to generate reports then Tableau >> works great with Hive. >> >> Just one more question, can flume be used to convert xml data to parquet ? >> I will store these into Hive as parquet and generate reports using >> Tableau. >> >> If flume can convert xml to parquet , do I need external tools , can you >> please provide me some links on how to convert xml to parquet using flume. >> Because , Predictive analytics may be used on Hive data in the end phase of >> the project. >> >> Thanks >> Shashi >> >> On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <[email protected]> >> wrote: >> >>> Hi Shashi, >>> Sure you can use json instead of Parquet, I was thinking in terms of >>> using Hive for processing the data, but if you'd like to use Drill (which i >>> heard is a good choice), then just convert the data from to json. You don't >>> have to deal with parquet or Hive in that case, just use Flume to convert >>> XML to json (there are many other choices to do that within the cluster >>> too) and then use Drill to read and process the data. >>> >>> Thanks, >>> Peyman >>> >>> >>> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao < >>> [email protected]> wrote: >>> >>>> Hi Peyman, >>>> >>>> Thanks a lot for your suggestions, really appreciate and got some idea >>>> from your suggestions. Here's what I want to proceed. >>>> 1. Using Flume convert xml to JSON/Parquet before it reaches HDFS. >>>> 2. Store parquet converted files into Hive. >>>> 3. Query using Apache Drill in SQL dialect. >>>> >>>> But one thing can you please help me if instead of converting to >>>> parquet if I convert into json and store in Hive as Parquet format , is >>>> this a feasible option. >>>> The reason I want to convert to json is that Apache Drill works very >>>> well with JSON format. >>>> >>>> Thanks >>>> Shashi >>>> >>>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <[email protected]> >>>> wrote: >>>> >>>>> You can land the data in HDFS as XML files and use 'hive xml serde' to >>>>> read the data and write it back in a more optimal format, e.g. ORC or >>>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML >>>>> data directly via Hive is also doable but slow. Converting to Avro is also >>>>> doable but in my experience not as fast as ORC or Parquet. Columnar >>>>> formats >>>>> work give you better performance but Avro has its own strength, e.g. >>>>> managing schema changes better. >>>>> You can also convert the format before you land the data in HDFS, e.g. >>>>> using Flume or some other tool for changing the format in flight. >>>>> >>>>> >>>>> >>>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao < >>>>> [email protected]> wrote: >>>>> >>>>>> Sorry , not Hive files but xml files to some Avro format and store >>>>>> these into Hive will be fast . >>>>>> >>>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Exact number of files is not known but it will run into millions of >>>>>>> files depending on client's request who collects terabytes of xml data >>>>>>> every day. Basically, storing is just one part but the main part will be >>>>>>> how to query these data like aggregation, count and do some analytics >>>>>>> over >>>>>>> these data. Fast retrieval is required , say for e.g for a particular >>>>>>> year >>>>>>> what are the top 10 products, top ten manufacturers and top ten stores >>>>>>> etc. >>>>>>> >>>>>>> Will Hive be a better choice ? And will converting these Hive files >>>>>>> to some format work out. >>>>>>> >>>>>>> Thanks >>>>>>> Shashi >>>>>>> >>>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> how many xml files are you planning to store? Perhaps it is >>>>>>>> possible to >>>>>>>> store them directly on hdfs and save meta data in hbase. This sounds >>>>>>>> more reasonable to me. >>>>>>>> >>>>>>>> If the number of xml files is to large (millions and billions), >>>>>>>> then you >>>>>>>> can use hadoop map files to put files together. E.g. based on >>>>>>>> years, or >>>>>>>> month. >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Wilm >>>>>>>> >>>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao: >>>>>>>> > Hi, >>>>>>>> > >>>>>>>> > Can someone help me by suggesting the best way to solve this use >>>>>>>> case >>>>>>>> > >>>>>>>> > 1. XML files keep flowing from external system and need to be >>>>>>>> stored >>>>>>>> > into HDFS. >>>>>>>> > 2. These files can be directly stored using NoSql database e.g >>>>>>>> any >>>>>>>> > xml supported NoSql. or >>>>>>>> > 3. These files need to be processed and stored in one of the >>>>>>>> database >>>>>>>> > HBase, Hive etc. >>>>>>>> > 4. There won't be any updates only read and has to be retrieved >>>>>>>> based >>>>>>>> > on some queries and a dashboard has to be created , bits of >>>>>>>> analytics >>>>>>>> > >>>>>>>> > The xml files are huge and expected number of nodes is roughly >>>>>>>> around >>>>>>>> > 12 nodes. >>>>>>>> > I am stuck in the storage part say if I convert xml to json and >>>>>>>> store >>>>>>>> > it into HBase , the processing part from xml to json will be huge. >>>>>>>> > >>>>>>>> > It will be only reading and no updates. >>>>>>>> > >>>>>>>> > Please suggest how to store these xml files. >>>>>>>> > >>>>>>>> > Thanks >>>>>>>> > Shashi >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
