Anand,

Thank you very much for the clarification. Can you please explain how would I be able to add new files to the parquet file? Since the files today won't be the same as the files that were used yesterday, since new files are added since yesterday?

Thanks,
Marko

On Fri 24 Apr 2015 11:33:03 AM CEST, Chandra Mohan, Ananda Vel Murugan wrote:
Marko,

Parquet file would be created once when you load the data. You don’t
have to store your small files in HDFS just for the reason of
subseting the data by time range. You can store data and metadata in
same Parquet file. As already pointed out, parquet files work well
other tools in Hadoop ecosystem. Apart from performance of your map
reduce jobs, other aspect is storage efficiency. Serialization formats
like Avro and Parquet provide better compression and hence data
occupies less space.

Regards,

Anand

*From:*Alexander Alten-Lorenz [mailto:[email protected]]
*Sent:* Friday, April 24, 2015 2:49 PM
*To:* [email protected]
*Subject:* Re: Large number of small files

Marko,

Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons
wouldn't be discussed here.

Parquet is an columnar based storage format. It is - high level - a
bit like a NoSQL DB, but on the storage level. it allows users to
"query" the data with MR, Pig or similar tools. Additionally, Parquet
works perfectly with Hive and Cloudera Impala as well as Apache Dremel.

https://parquet.incubator.apache.org/documentation/latest/

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html

https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table


--

Alexander Alten-Lorenz
m: [email protected] <mailto:[email protected]>
b: mapredit.blogspot.com <http://mapredit.blogspot.com>

    On Apr 24, 2015, at 11:10 AM, Marko Dinic
    <[email protected] <mailto:[email protected]>> wrote:

    Anand,

    Thank you for your answer, but wouldn't that mean that I would
    have to serialize the files each time I need to run the job? And I
    would still need to save the original files, so the NameNode still
    needs to take care of them?

    Please correct me if I'm missing something, I'm not very
    experienced with Hadoop.

    What do you think about using Cassandra?

    Thanks

    On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel
    Murugan wrote:

    Apart from databases like Cassandra, you may check serialization
    formats like Avro or Parquet

    Regards,
    Anand

    -----Original Message-----
    From: Marko Dinic [mailto:[email protected]]
    Sent: Friday, April 24, 2015 2:23 PM
    To: [email protected] <mailto:[email protected]>
    Subject: Large number of small files

    Hello,

    I'm not sure if this is the place to ask this question, but I'm
    still hopping for an answer/advice.

    Large number of small files are uploaded, about 8KB. I am aware
    that this is not something that you're hopping for when working
    with Hadoop.

    I was thinking about using HAR files and combined input, or
    sequence files. The problem is, files are timestamped, and I need
    different subset in different time, for example - one job needs to
    run on files that are uploaded during last 3 months, while next
    job might consider last 6 months. Naturally, as time passes
    different subset of files is needed.

    This means that I would need to make a sequence file (or a HAR)
    each time I run a job, to have smaller number of mappers. On the
    other hand, I need the original files so I could subset them. This
    means that DataNode is at constant pressure, saving all of this in
    its memory.

    How can I solve this problem?

    I was also considering using Cassandra, or something like that,
    and to save the file content inside of it, instead of saving it to
    files on HDFS. FIle content is actually some measurement, that is,
    a vector of numbers, with some metadata.

    Thanks

Reply via email to