[jira] [Updated] (IMPALA-7642) Optimize UDF jar handling in Catalog

2018-10-01 Thread Balazs Jeszenszky (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balazs Jeszenszky updated IMPALA-7642:
--
Attachment: test.html

> Optimize UDF jar handling in Catalog
> 
>
> Key: IMPALA-7642
> URL: https://issues.apache.org/jira/browse/IMPALA-7642
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Affects Versions: Impala 3.0
>Reporter: Miklos Szurap
>Priority: Major
>
> 1. Optimize UDF jar loading
> During startup and global invalidate metadata calls, for each database the 
> [CatalogServiceCatalog.loadJavaFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L956]
>  is called, which calls 
> [extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L68]
>  for each function found in HMS, and for each function the related UDF jar 
> file is downloaded from HDFS to the localLibraryPath (file:///tmp). It is not 
> uncommon that the UDFs are not packaged separately, but in everything-in-one 
> big-fat jars, so they can be 10-50 MB of size. Sometimes there are hundreds 
> of functions in a database (which usually related to the same project) and 
> all functions are pointing to the same UDF jar. The above method hundreds of 
> times downloads the same jar, "extracts the function" and deletes the local 
> file.
> The suggestion would be to improve this by:
> - creating a local "cache" in CatalogServiceCatalog.loadJavaFunctions() as a 
> HashMap (map of jarUri -> localJarPath)
> - pass this cache to FunctionUtils.extractFunctions, which checks if the 
> cache already contains the jarUri. If not, downloads the jar, and puts it 
> into the cache (and does everything else needed)
> - move the FileSystemUtil.deleteIfExists(localJarPath) from extractFunctions 
> to loadJavaFunctions - in a finally block iterate over the cache entries 
> (values) and delete the local files, and on the end clear the cache.
> 2. Use {{Set}} instead of {{List}} for addedSignatures in 
> [FunctionUtils.extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L73]:
> It just tracks which function signatures were added, for that purpose a Set 
> is fine. 
> {noformat}
> if (!addedSignatures.contains(fn.signatureString())){noformat}
> This would be faster ( {{O( 1 )}} ) with a HashSet (compared to ArrayList's 
> {{O( n )}} for the contains method).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-7642) Optimize UDF jar handling in Catalog

2018-10-01 Thread Balazs Jeszenszky (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balazs Jeszenszky updated IMPALA-7642:
--
Attachment: (was: test.html)

> Optimize UDF jar handling in Catalog
> 
>
> Key: IMPALA-7642
> URL: https://issues.apache.org/jira/browse/IMPALA-7642
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Affects Versions: Impala 3.0
>Reporter: Miklos Szurap
>Priority: Major
>
> 1. Optimize UDF jar loading
> During startup and global invalidate metadata calls, for each database the 
> [CatalogServiceCatalog.loadJavaFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L956]
>  is called, which calls 
> [extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L68]
>  for each function found in HMS, and for each function the related UDF jar 
> file is downloaded from HDFS to the localLibraryPath (file:///tmp). It is not 
> uncommon that the UDFs are not packaged separately, but in everything-in-one 
> big-fat jars, so they can be 10-50 MB of size. Sometimes there are hundreds 
> of functions in a database (which usually related to the same project) and 
> all functions are pointing to the same UDF jar. The above method hundreds of 
> times downloads the same jar, "extracts the function" and deletes the local 
> file.
> The suggestion would be to improve this by:
> - creating a local "cache" in CatalogServiceCatalog.loadJavaFunctions() as a 
> HashMap (map of jarUri -> localJarPath)
> - pass this cache to FunctionUtils.extractFunctions, which checks if the 
> cache already contains the jarUri. If not, downloads the jar, and puts it 
> into the cache (and does everything else needed)
> - move the FileSystemUtil.deleteIfExists(localJarPath) from extractFunctions 
> to loadJavaFunctions - in a finally block iterate over the cache entries 
> (values) and delete the local files, and on the end clear the cache.
> 2. Use {{Set}} instead of {{List}} for addedSignatures in 
> [FunctionUtils.extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L73]:
> It just tracks which function signatures were added, for that purpose a Set 
> is fine. 
> {noformat}
> if (!addedSignatures.contains(fn.signatureString())){noformat}
> This would be faster ( {{O( 1 )}} ) with a HashSet (compared to ArrayList's 
> {{O( n )}} for the contains method).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-7642) Optimize UDF jar handling in Catalog

2018-09-28 Thread Miklos Szurap (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szurap updated IMPALA-7642:
--
Description: 
1. Optimize UDF jar loading
During startup and global invalidate metadata calls, for each database the 
[CatalogServiceCatalog.loadJavaFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L956]
 is called, which calls 
[extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L68]
 for each function found in HMS, and for each function the related UDF jar file 
is downloaded from HDFS to the localLibraryPath (file:///tmp). It is not 
uncommon that the UDFs are not packaged separately, but in everything-in-one 
big-fat jars, so they can be 10-50 MB of size. Sometimes there are hundreds of 
functions in a database (which usually related to the same project) and all 
functions are pointing to the same UDF jar. The above method hundreds of times 
downloads the same jar, "extracts the function" and deletes the local file.
The suggestion would be to improve this by:
- creating a local "cache" in CatalogServiceCatalog.loadJavaFunctions() as a 
HashMap (map of jarUri -> localJarPath)
- pass this cache to FunctionUtils.extractFunctions, which checks if the cache 
already contains the jarUri. If not, downloads the jar, and puts it into the 
cache (and does everything else needed)
- move the FileSystemUtil.deleteIfExists(localJarPath) from extractFunctions to 
loadJavaFunctions - in a finally block iterate over the cache entries (values) 
and delete the local files, and on the end clear the cache.

2. Use {{Set}} instead of {{List}} for addedSignatures in 
[FunctionUtils.extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L73]:
It just tracks which function signatures were added, for that purpose a Set is 
fine. 
{noformat}
if (!addedSignatures.contains(fn.signatureString())){noformat}
This would be faster ( {{O(1)}} ) with a HashSet (compared to ArrayList's 
{{O(n)}} for the contains method).

  was:
1. Optimize UDF jar loading
During startup and global invalidate metadata calls, for each database the 
[CatalogServiceCatalog.loadJavaFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L956]
 is called, which calls 
[extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L68]
 for each function found in HMS, and for each function the related UDF jar file 
is downloaded from HDFS to the localLibraryPath (file:///tmp). It is not 
uncommon that the UDFs are not packaged separately, but in everything-in-one 
big-fat jars, so they can be 10-50 MB of size. Sometimes there are hundreds of 
functions in a database (which usually related to the same project) and all 
functions are pointing to the same UDF jar. The above method hundreds of times 
downloads the same jar, "extracts the function" and deletes the local file.
The suggestion would be to improve this by:
- creating a local "cache" in CatalogServiceCatalog.loadJavaFunctions() as a 
HashMap (map of jarUri -> localJarPath)
- pass this cache to FunctionUtils.extractFunctions, which checks if the cache 
already contains the jarUri. If not, downloads the jar, and puts it into the 
cache (and does everything else needed)
- move the FileSystemUtil.deleteIfExists(localJarPath) from extractFunctions to 
loadJavaFunctions - in a finally block iterate over the cache entries (values) 
and delete the local files, and on the end clear the cache.

2. Use Set instead of List for addedSignatures in 
FunctionUtils.extractFunctions():
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L73
It just tracks which function signatures were added, for that purpose a Set is 
fine. 
{noformat}
if (!addedSignatures.contains(fn.signatureString())){noformat}
This would be faster (O(1)) with a HashSet (compared to ArrayList's O(n) for 
the contains method).


> Optimize UDF jar handling in Catalog
> 
>
> Key: IMPALA-7642
> URL: https://issues.apache.org/jira/browse/IMPALA-7642
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Affects Versions: Impala 3.0
>Reporter: Miklos Szurap
>Priority: Major
>
> 1. Optimize UDF jar loading
> During startup and global invalidate metadata calls, for each database the 
> [CatalogServiceCatalog.loadJavaFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L956]
>  is called, which calls 
> 

[jira] [Updated] (IMPALA-7642) Optimize UDF jar handling in Catalog

2018-09-28 Thread Miklos Szurap (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szurap updated IMPALA-7642:
--
Description: 
1. Optimize UDF jar loading
During startup and global invalidate metadata calls, for each database the 
[CatalogServiceCatalog.loadJavaFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L956]
 is called, which calls 
[extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L68]
 for each function found in HMS, and for each function the related UDF jar file 
is downloaded from HDFS to the localLibraryPath (file:///tmp). It is not 
uncommon that the UDFs are not packaged separately, but in everything-in-one 
big-fat jars, so they can be 10-50 MB of size. Sometimes there are hundreds of 
functions in a database (which usually related to the same project) and all 
functions are pointing to the same UDF jar. The above method hundreds of times 
downloads the same jar, "extracts the function" and deletes the local file.
The suggestion would be to improve this by:
- creating a local "cache" in CatalogServiceCatalog.loadJavaFunctions() as a 
HashMap (map of jarUri -> localJarPath)
- pass this cache to FunctionUtils.extractFunctions, which checks if the cache 
already contains the jarUri. If not, downloads the jar, and puts it into the 
cache (and does everything else needed)
- move the FileSystemUtil.deleteIfExists(localJarPath) from extractFunctions to 
loadJavaFunctions - in a finally block iterate over the cache entries (values) 
and delete the local files, and on the end clear the cache.

2. Use {{Set}} instead of {{List}} for addedSignatures in 
[FunctionUtils.extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L73]:
It just tracks which function signatures were added, for that purpose a Set is 
fine. 
{noformat}
if (!addedSignatures.contains(fn.signatureString())){noformat}
This would be faster ( {{O( 1 )}} ) with a HashSet (compared to ArrayList's 
{{O( n )}} for the contains method).

  was:
1. Optimize UDF jar loading
During startup and global invalidate metadata calls, for each database the 
[CatalogServiceCatalog.loadJavaFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L956]
 is called, which calls 
[extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L68]
 for each function found in HMS, and for each function the related UDF jar file 
is downloaded from HDFS to the localLibraryPath (file:///tmp). It is not 
uncommon that the UDFs are not packaged separately, but in everything-in-one 
big-fat jars, so they can be 10-50 MB of size. Sometimes there are hundreds of 
functions in a database (which usually related to the same project) and all 
functions are pointing to the same UDF jar. The above method hundreds of times 
downloads the same jar, "extracts the function" and deletes the local file.
The suggestion would be to improve this by:
- creating a local "cache" in CatalogServiceCatalog.loadJavaFunctions() as a 
HashMap (map of jarUri -> localJarPath)
- pass this cache to FunctionUtils.extractFunctions, which checks if the cache 
already contains the jarUri. If not, downloads the jar, and puts it into the 
cache (and does everything else needed)
- move the FileSystemUtil.deleteIfExists(localJarPath) from extractFunctions to 
loadJavaFunctions - in a finally block iterate over the cache entries (values) 
and delete the local files, and on the end clear the cache.

2. Use {{Set}} instead of {{List}} for addedSignatures in 
[FunctionUtils.extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L73]:
It just tracks which function signatures were added, for that purpose a Set is 
fine. 
{noformat}
if (!addedSignatures.contains(fn.signatureString())){noformat}
This would be faster ( {{O(1)}} ) with a HashSet (compared to ArrayList's 
{{O(n)}} for the contains method).


> Optimize UDF jar handling in Catalog
> 
>
> Key: IMPALA-7642
> URL: https://issues.apache.org/jira/browse/IMPALA-7642
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Affects Versions: Impala 3.0
>Reporter: Miklos Szurap
>Priority: Major
>
> 1. Optimize UDF jar loading
> During startup and global invalidate metadata calls, for each database the 
> [CatalogServiceCatalog.loadJavaFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L956]
>  is called, which calls 
>