Hi Shashank,

Let me make sure I understand the question. You have to large JSON data files? 
You are on a distributed Drill cluster. You want to know why you are seeiing a 
billion rows in one fragment rather than the work being distributed across 
multiple fragments? Is this an accurate summary?

The key thing to know is that Drill (and most Hadoop-based systems) rely on 
files to be "block-splittable". That is, if your file is 1 GB in size, Drill 
needs to be able to read, say, blocks of 256 MB from the file so that we can 
have four Drill fragments read that single 1 GB file. This is true even if you 
store the files in S3.


CSV, Parquet, Sequence File and others are block splittable. As it turns out, 
JSON is not. The reason is simple: there is no way to jump into a typical JSON 
file and scan for the start of the next record. With CSV, newlines are record 
separators. Parquet has row groups. With JSON, there may or may not be newlines 
between records, and there may or may not be newlines within records.

It turns out that there is an emerging standard called jsonlines [1] which 
requires that there be newlines between, but not within, JSON records. Using 
jsonlines would make JSON into a block-splittable format. Drill does not yet 
support this specialized JSON format, but doing so would be good enhancement 
for data files that adhere to the jsonlines format. Is your data in jsonlines 
format?


For now, the solution is simple: rather than storing your data in a single 
large JSON file, simply split the data into multiple small files within a 
single directory. Drill will read each file in a separate fragment, giving you 
the parallelism you want. Make each file on the order of 100MB, say. The key is 
to ensure that you have at least as many files as you have minor fragments. The 
number of minor fragments will be 70% of your CPU count per node. If you have 
10 CPUs, say, Drill will create 7 fragments per node. Then, multiply this by 
the number of nodes. If you have 4 nodes, say, you'll have 28 minor fragments 
total. You want to have at least 28 JSON files so you can keep each fragment 
busy.

If your code generates the JSON, then you can change the code to split the data 
into smaller files. If you obtain the JSON from somewhere else, then your 
options may be more limited.

Will any of this help resolve your issue?


Thanks,
- Paul

 
[1] http://jsonlines.org/

    On Wednesday, April 15, 2020, 12:32:35 PM PDT, Shashank Sharma 
<[email protected]> wrote:  
 
 Hi folks,

I have a two large big json data set and querying on distributed apache
drill system, can anyone explain why it isĀ  making or build billion of
records to scan in fragment when join between two big records by hash join
as well as merge join with only 60,000 record data set through s3 bucket
file distributed system?

-- 

[image: https://jungleworks.com/] <https://jungleworks.com/>

Shashank Sharma

Software Engineer

Phone: +91 8968101068

<https://www.facebook.com/jungleworks1> <https://twitter.com/jungleworks1>
<https://www.linkedin.com/company/jungleworks/>
  

Reply via email to