Re: XML files in Hadoop

Shashidhar Rao Sat, 03 Jan 2015 09:17:33 -0800

Hi Peyman,

Really appreciate your suggestion.
But say , if Tableau has to be used to generate reports then Tableau works
great with Hive.


Just one more question, can flume be used to convert xml data to parquet ?
I will store these into Hive as parquet and generate reports using Tableau.

If flume can convert xml to parquet , do I need external tools , can you
please provide me some links on how to convert xml to parquet using flume.
Because , Predictive analytics may be used on Hive data in the end phase of
the project.

Thanks
Shashi

On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <[email protected]>
wrote:

> Hi Shashi,
> Sure you can use json instead of Parquet, I was thinking in terms of using
> Hive for processing the data, but if you'd like to use Drill (which i heard
> is a good choice), then just convert the data from to json. You don't have
> to deal with parquet or Hive in that case, just use Flume to convert XML to
> json (there are many other choices to do that within the cluster too) and
> then use Drill to read and process the data.
>
> Thanks,
> Peyman
>
>
> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <[email protected]
> > wrote:
>
>> Hi Peyman,
>>
>> Thanks a lot for your suggestions, really appreciate and got some idea
>> from your suggestions. Here's what I want to proceed.
>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>> 2.  Store parquet converted files into Hive.
>> 3.  Query using Apache Drill in SQL dialect.
>>
>> But one thing can you please help me if instead of converting to parquet
>> if I convert into json and store in Hive as Parquet format , is this a
>> feasible option.
>> The reason I want to convert to json is that Apache Drill works very well
>> with JSON format.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <[email protected]>
>> wrote:
>>
>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>> read the data and write it back in a more optimal format, e.g. ORC or
>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>>> data directly via Hive is also doable but slow. Converting to Avro is also
>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>> work give you better performance but Avro has its own strength, e.g.
>>> managing schema changes better.
>>> You can also convert the format before you land the data in HDFS, e.g.
>>> using Flume or some other tool for changing the format in flight.
>>>
>>>
>>>
>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>> [email protected]> wrote:
>>>
>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>> these into Hive will be fast .
>>>>
>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Exact number of files is not known but it will run into millions of
>>>>> files depending on client's request who collects terabytes of xml data
>>>>> every day. Basically, storing is just one part but the main part will be
>>>>> how to query these data like  aggregation, count and do some analytics 
>>>>> over
>>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>>> what are the top 10 products, top ten manufacturers and top ten stores 
>>>>> etc.
>>>>>
>>>>> Will Hive be a better choice ? And will converting these Hive files to
>>>>> some format work out.
>>>>>
>>>>> Thanks
>>>>> Shashi
>>>>>
>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> how many xml files are you planning to store? Perhaps it is possible
>>>>>> to
>>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>>> more reasonable to me.
>>>>>>
>>>>>> If the number of xml files is to large (millions and billions), then
>>>>>> you
>>>>>> can use hadoop map files to put files together. E.g. based on years,
>>>>>> or
>>>>>> month.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Wilm
>>>>>>
>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>> > Hi,
>>>>>> >
>>>>>> > Can someone help me by suggesting the best way to solve this use
>>>>>> case
>>>>>> >
>>>>>> > 1. XML files keep flowing from external system and need to be stored
>>>>>> > into HDFS.
>>>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>>>> > xml supported NoSql. or
>>>>>> > 3. These files need to be processed and stored in one of the
>>>>>> database
>>>>>> > HBase, Hive etc.
>>>>>> > 4. There won't be any updates only read and has to be retrieved
>>>>>> based
>>>>>> > on some queries and a dashboard has to be created , bits of
>>>>>> analytics
>>>>>> >
>>>>>> > The xml files are huge and expected number of nodes is roughly
>>>>>> around
>>>>>> > 12 nodes.
>>>>>> > I am stuck in the storage part say if I convert xml to json and
>>>>>> store
>>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>>> >
>>>>>> > It will be only reading and no updates.
>>>>>> >
>>>>>> > Please suggest how to store these xml files.
>>>>>> >
>>>>>> > Thanks
>>>>>> > Shashi
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Reply via email to