Re: Optimal Config for the worker

Jihoon Son Fri, 09 Oct 2015 19:20:07 -0700

Not yet.
We are currently working to support authentication (
https://issues.apache.org/jira/browse/TAJO-600).
I expect that 0.12 release will include the basic authentication feature.


Thanks!
Jihoon

2015년 10월 10일 (토) 오전 2:43, Odin Guillermo Caudillo Gallegos <
[email protected]>님이 작성:

> Good. thank you for the tips about cleaning.
> Is there anyway to configure tajo with kerberos or some security tool like
> Sentry already?
>
> 2015-10-09 11:13 GMT-05:00 Jihoon Son <[email protected]>:
>
>> You mean, dfs-dir-aware doesn't work, so you set resource.disks as some
>> value by yourself, right? If so, I'll check dfs-dir-aware configuration.
>>
>> Regarding on space cleaning, you can delete any directories. Some system
>> directories and files will be automatically created by tajo if they are
>> necessary.
>> In contrast, deleting data means that tajo works normally but you cannot
>> see deleted data anymore. For example, if you delete the query detail
>> directory, you cannot see query details on the web ui anymore. This query
>> detail directories are automatically deleted as time goes by, so you don't
>> need to clean up unless you are suffering from the low available space.
>>
>> In addition, you may want to delete tajo's temporal data which are stored
>> during query execution. The default temporal directory is created at
>> /tmp/tajo-${user.name}/tmpdir. So you can delete by yourself, or set
>> 'tajo.worker.tmpdir.cleanup-at-startup' for auto cleanup.
>>
>> Jihoon
>>
>> 2015년 10월 10일 (토) 오전 12:50, Odin Guillermo Caudillo Gallegos <
>> [email protected]>님이 작성:
>>
>>> Hi.
>>> I put the dfs-dir-aware to true, but the performance wasn't the
>>> expected. So for test purposes, i let it with resource.disks
>>> About the hdfs space cleaning, which directories can i delete from my
>>> hadoop?
>>> Like, is there a problem if i delete the query detail? Can i delete
>>> another folder?
>>> Thanks
>>>
>>> 2015-10-09 10:15 GMT-05:00 Jihoon Son <[email protected]>:
>>>
>>>> Hi Odin, yes you can make your query faster.
>>>>
>>>> First of all, you can increase disk resource for tajo workers by
>>>> setting '*tajo.worker.resource.**disks*'. This disk resource is
>>>> related to the number of tasks which are executed in parallel. A high disk
>>>> resource increases the number of tasks which are executed in parallel. For
>>>> example, given 10 tasks each of which reads data from hdfs, a tajo worker
>>>> will execute those tasks one by one. With a disk resource of 2, two tasks
>>>> can be executed simultaneously. So, it can improve the performance.
>>>> However, as you may know, if too many tasks access a single disk at the
>>>> same time, there will be a lot of random accesses which make the query
>>>> performance worse.
>>>> So, I recommend to use the real number of physical disks for this
>>>> configuration. Or, if you already configured multiple disks for hdfs, tajo
>>>> can automatically detect it and use for tajo worker's disk resource by
>>>> setting '*tajo.worker.resource.dfs-dir-aware*' as true. Please refer
>>>> to
>>>> http://tajo.apache.org/docs/devel/configuration/worker_configuration.html
>>>> for more information.
>>>> After changing configuration values, you need to restart your tajo
>>>> cluster.
>>>>
>>>> In addition, I *strongly recommend* to enable '
>>>> *dfs.datanode.hdfs-blocks-metadata.enabled*' for your HDFS. With this
>>>> configuration, tajo can achieve higher data locality when assigning its
>>>> tasks to workers. This will improve tajo's performance significantly. You
>>>> need to restart your hdfs after configuring this, too.
>>>>
>>>> Best regards,
>>>> Jihoon
>>>>
>>>> 2015년 10월 9일 (금) 오후 11:43, Odin Guillermo Caudillo Gallegos <
>>>> [email protected]>님이 작성:
>>>>
>>>>> Hi.
>>>>> I did a select count from a hdfs wich returns me a total record of
>>>>> almost 17 million.
>>>>> The count was done in 2 minutes.
>>>>> I have the current config for the worker:
>>>>>
>>>>> <property>
>>>>>   <name>tajo.worker.resource.memory-mb</name>
>>>>>   <value>4096</value>
>>>>>   <description>Available memory size (MB)</description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>   <name>tajo.worker.resource.disks</name>
>>>>>   <value>1</value>
>>>>>   <description>Available disk capacity (usually number of
>>>>> disks)</description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>   <name>tajo.worker.tmpdir.locations</name>
>>>>>
>>>>> <value>/tmp/tajo-11/tmpdir,/tmp/tajo-11/tmpdir1,/tmp/tajo-11/tmpdir2</value>
>>>>>   <description>A base for other temporary directories.</description>
>>>>> </property>
>>>>>
>>>>> Is there anyway to give the query more power to make it faster?
>>>>> Do i need to do another configuration?
>>>>>
>>>>>
>>>
>

Re: Optimal Config for the worker

Reply via email to