Hi Peyman, Really appreciate your suggestion. But say , if Tableau has to be used to generate reports then Tableau works great with Hive.
Just one more question, can flume be used to convert xml data to parquet ? I will store these into Hive as parquet and generate reports using Tableau. If flume can convert xml to parquet , do I need external tools , can you please provide me some links on how to convert xml to parquet using flume. Because , Predictive analytics may be used on Hive data in the end phase of the project. Thanks Shashi On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <[email protected]> wrote: > Hi Shashi, > Sure you can use json instead of Parquet, I was thinking in terms of using > Hive for processing the data, but if you'd like to use Drill (which i heard > is a good choice), then just convert the data from to json. You don't have > to deal with parquet or Hive in that case, just use Flume to convert XML to > json (there are many other choices to do that within the cluster too) and > then use Drill to read and process the data. > > Thanks, > Peyman > > > On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <[email protected] > > wrote: > >> Hi Peyman, >> >> Thanks a lot for your suggestions, really appreciate and got some idea >> from your suggestions. Here's what I want to proceed. >> 1. Using Flume convert xml to JSON/Parquet before it reaches HDFS. >> 2. Store parquet converted files into Hive. >> 3. Query using Apache Drill in SQL dialect. >> >> But one thing can you please help me if instead of converting to parquet >> if I convert into json and store in Hive as Parquet format , is this a >> feasible option. >> The reason I want to convert to json is that Apache Drill works very well >> with JSON format. >> >> Thanks >> Shashi >> >> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <[email protected]> >> wrote: >> >>> You can land the data in HDFS as XML files and use 'hive xml serde' to >>> read the data and write it back in a more optimal format, e.g. ORC or >>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML >>> data directly via Hive is also doable but slow. Converting to Avro is also >>> doable but in my experience not as fast as ORC or Parquet. Columnar formats >>> work give you better performance but Avro has its own strength, e.g. >>> managing schema changes better. >>> You can also convert the format before you land the data in HDFS, e.g. >>> using Flume or some other tool for changing the format in flight. >>> >>> >>> >>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao < >>> [email protected]> wrote: >>> >>>> Sorry , not Hive files but xml files to some Avro format and store >>>> these into Hive will be fast . >>>> >>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao < >>>> [email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> Exact number of files is not known but it will run into millions of >>>>> files depending on client's request who collects terabytes of xml data >>>>> every day. Basically, storing is just one part but the main part will be >>>>> how to query these data like aggregation, count and do some analytics >>>>> over >>>>> these data. Fast retrieval is required , say for e.g for a particular year >>>>> what are the top 10 products, top ten manufacturers and top ten stores >>>>> etc. >>>>> >>>>> Will Hive be a better choice ? And will converting these Hive files to >>>>> some format work out. >>>>> >>>>> Thanks >>>>> Shashi >>>>> >>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> how many xml files are you planning to store? Perhaps it is possible >>>>>> to >>>>>> store them directly on hdfs and save meta data in hbase. This sounds >>>>>> more reasonable to me. >>>>>> >>>>>> If the number of xml files is to large (millions and billions), then >>>>>> you >>>>>> can use hadoop map files to put files together. E.g. based on years, >>>>>> or >>>>>> month. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Wilm >>>>>> >>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao: >>>>>> > Hi, >>>>>> > >>>>>> > Can someone help me by suggesting the best way to solve this use >>>>>> case >>>>>> > >>>>>> > 1. XML files keep flowing from external system and need to be stored >>>>>> > into HDFS. >>>>>> > 2. These files can be directly stored using NoSql database e.g any >>>>>> > xml supported NoSql. or >>>>>> > 3. These files need to be processed and stored in one of the >>>>>> database >>>>>> > HBase, Hive etc. >>>>>> > 4. There won't be any updates only read and has to be retrieved >>>>>> based >>>>>> > on some queries and a dashboard has to be created , bits of >>>>>> analytics >>>>>> > >>>>>> > The xml files are huge and expected number of nodes is roughly >>>>>> around >>>>>> > 12 nodes. >>>>>> > I am stuck in the storage part say if I convert xml to json and >>>>>> store >>>>>> > it into HBase , the processing part from xml to json will be huge. >>>>>> > >>>>>> > It will be only reading and no updates. >>>>>> > >>>>>> > Please suggest how to store these xml files. >>>>>> > >>>>>> > Thanks >>>>>> > Shashi >>>>>> >>>>>> >>>>> >>>> >>> >> >
