Re: Optimal Config for the worker

Jihoon Son Fri, 09 Oct 2015 09:14:50 -0700

You mean, dfs-dir-aware doesn't work, so you set resource.disks as some
value by yourself, right? If so, I'll check dfs-dir-aware configuration.


Regarding on space cleaning, you can delete any directories. Some system
directories and files will be automatically created by tajo if they are
necessary.
In contrast, deleting data means that tajo works normally but you cannot
see deleted data anymore. For example, if you delete the query detail
directory, you cannot see query details on the web ui anymore. This query
detail directories are automatically deleted as time goes by, so you don't
need to clean up unless you are suffering from the low available space.

In addition, you may want to delete tajo's temporal data which are stored
during query execution. The default temporal directory is created at
/tmp/tajo-${user.name}/tmpdir. So you can delete by yourself, or set
'tajo.worker.tmpdir.cleanup-at-startup' for auto cleanup.

Jihoon

2015년 10월 10일 (토) 오전 12:50, Odin Guillermo Caudillo Gallegos <
[email protected]>님이 작성:

> Hi.
> I put the dfs-dir-aware to true, but the performance wasn't the expected.
> So for test purposes, i let it with resource.disks
> About the hdfs space cleaning, which directories can i delete from my
> hadoop?
> Like, is there a problem if i delete the query detail? Can i delete
> another folder?
> Thanks
>
> 2015-10-09 10:15 GMT-05:00 Jihoon Son <[email protected]>:
>
>> Hi Odin, yes you can make your query faster.
>>
>> First of all, you can increase disk resource for tajo workers by setting '
>> *tajo.worker.resource.**disks*'. This disk resource is related to the
>> number of tasks which are executed in parallel. A high disk resource
>> increases the number of tasks which are executed in parallel. For example,
>> given 10 tasks each of which reads data from hdfs, a tajo worker will
>> execute those tasks one by one. With a disk resource of 2, two tasks can be
>> executed simultaneously. So, it can improve the performance.
>> However, as you may know, if too many tasks access a single disk at the
>> same time, there will be a lot of random accesses which make the query
>> performance worse.
>> So, I recommend to use the real number of physical disks for this
>> configuration. Or, if you already configured multiple disks for hdfs, tajo
>> can automatically detect it and use for tajo worker's disk resource by
>> setting '*tajo.worker.resource.dfs-dir-aware*' as true. Please refer to
>> http://tajo.apache.org/docs/devel/configuration/worker_configuration.html
>> for more information.
>> After changing configuration values, you need to restart your tajo
>> cluster.
>>
>> In addition, I *strongly recommend* to enable '
>> *dfs.datanode.hdfs-blocks-metadata.enabled*' for your HDFS. With this
>> configuration, tajo can achieve higher data locality when assigning its
>> tasks to workers. This will improve tajo's performance significantly. You
>> need to restart your hdfs after configuring this, too.
>>
>> Best regards,
>> Jihoon
>>
>> 2015년 10월 9일 (금) 오후 11:43, Odin Guillermo Caudillo Gallegos <
>> [email protected]>님이 작성:
>>
>>> Hi.
>>> I did a select count from a hdfs wich returns me a total record of
>>> almost 17 million.
>>> The count was done in 2 minutes.
>>> I have the current config for the worker:
>>>
>>> <property>
>>>   <name>tajo.worker.resource.memory-mb</name>
>>>   <value>4096</value>
>>>   <description>Available memory size (MB)</description>
>>> </property>
>>>
>>> <property>
>>>   <name>tajo.worker.resource.disks</name>
>>>   <value>1</value>
>>>   <description>Available disk capacity (usually number of
>>> disks)</description>
>>> </property>
>>>
>>> <property>
>>>   <name>tajo.worker.tmpdir.locations</name>
>>>
>>> <value>/tmp/tajo-11/tmpdir,/tmp/tajo-11/tmpdir1,/tmp/tajo-11/tmpdir2</value>
>>>   <description>A base for other temporary directories.</description>
>>> </property>
>>>
>>> Is there anyway to give the query more power to make it faster?
>>> Do i need to do another configuration?
>>>
>>>
>

Re: Optimal Config for the worker

Reply via email to