[jira] [Created] (ARROW-17721) [C++][Gandiva] Expression Evaluation Performance Improvement using Mimalloc

2022-09-14 Thread Jiangtao Peng (Jira)
Jiangtao Peng created ARROW-17721:
-

 Summary: [C++][Gandiva] Expression Evaluation Performance 
Improvement using Mimalloc
 Key: ARROW-17721
 URL: https://issues.apache.org/jira/browse/ARROW-17721
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: Jiangtao Peng


Arrow use jemalloc as default memory allocator. For some reason, I am going to 
use mimalloc instead. But there seems have big performance difference between 
two memory allocators.

Here are my steps.

I use simple compile options:
{code:java}
-DCMAKE_BUILD_TYPE=debug
-DARROW_JEMALLOC=OFF|ON
-DARROW_MIMALLOC=ON|OFF
-DARROW_GANDIVA=ON
-DARROW_GANDIVA_STATIC_LIBSTDCPP=ON
-DARROW_BUILD_TESTS=ON
{code}
 
Then I write a simple case:
{code:cpp}
#include 
#include "arrow/memory_pool.h"
#include "arrow/status.h"

#include "gandiva/projector.h"
#include "gandiva/tests/test_util.h"
#include "gandiva/tree_expr_builder.h"

#include 
#include 

namespace gandiva {

using arrow::boolean;
using arrow::date64;
using arrow::int32;
using arrow::int64;
using arrow::utf8;

class TestUtf8Perf : public ::testing::Test {
 public:
  void SetUp() { pool_ = arrow::default_memory_pool(); }

 protected:
  arrow::MemoryPool* pool_;
};

void TestPerf(int64_t char_length, int64_t num_records) {
  // schema for input fields
  auto field_a = field("a", utf8());
  auto schema = arrow::schema({field_a});

  // output fields
  auto res = field("res", utf8());

  auto node_a = TreeExprBuilder::MakeField(field_a);
  auto upper_a = TreeExprBuilder::MakeFunction("upper", {node_a}, utf8());
  auto expr = TreeExprBuilder::MakeExpression(upper_a, res);

  // Build a projector for the expressions.
  std::shared_ptr projector;
  auto status = Projector::Make(schema, {expr}, TestConfiguration(), 
);
  EXPECT_TRUE(status.ok()) << status.message();

  std::string val = std::string(char_length, 'a');
  arrow::StringBuilder builder;
  for (int i = 0; i < num_records; i++) {
auto _ = builder.Append(val);
  }
  std::shared_ptr array_a;
  auto _ = builder.Finish(_a);

  // prepare input record batch
  auto in_batch = arrow::RecordBatch::Make(schema, num_records, {array_a});

  auto start_epoch = std::chrono::duration_cast(
 std::chrono::system_clock::now().time_since_epoch())
 .count();
  // Evaluate expression
  arrow::ArrayVector outputs;
  status = projector->Evaluate(*in_batch, pool_, );
  EXPECT_TRUE(status.ok()) << status.message();

  std::cout << std::chrono::duration_cast(
   std::chrono::system_clock::now().time_since_epoch())
   .count() -
   start_epoch
<< "ms" << std::endl;
}
TEST_F(TestUtf8Perf, TestMemoryAllocsPerf) {
  TestPerf(20, 1);
  TestPerf(20, 10);
  TestPerf(200, 1);
  TestPerf(200, 10);
  TestPerf(2000, 1);
}

}  // namespace gandiva
{code}
this case is going to calculate expression {*}upper(a){*}, *a* has different 
size with 20/200/2000. Evaluation time results are:
|char_length|num_records|Using Mimalloc (ms)|Using Jemalloc(ms)|
|20|1|29|3|
|20|10|2686|26|
|200|1|954|11|
|200|10|220153|118|
|2000|1|21162|89|

 
Is this performance gap expected? Or any other compile options should I note? 
How to make performance better using mimalloc?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-13306) [Java][JDBC] use ResultSetMetaData.getColumnLabel instead of ResultSetMetaData.getColumnName

2021-07-12 Thread Jiangtao Peng (Jira)
Jiangtao Peng created ARROW-13306:
-

 Summary: [Java][JDBC] use ResultSetMetaData.getColumnLabel instead 
of ResultSetMetaData.getColumnName
 Key: ARROW-13306
 URL: https://issues.apache.org/jira/browse/ARROW-13306
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Jiangtao Peng


when using JDBC to Arrow utils, sometimes, column alias can not be displayed in 
final arrow results. 

For example, here is a result set from query 
{code:sql}
SELECT col AS a FROM table{code}
postgres can works properly, arrow result schema contains "a", but mysql arrow 
result schema contains "col".

This is because postgres use field label as column name and column label 
([postgres 
jdbc|https://github.com/pgjdbc/pgjdbc/blob/f61fbfe7b72ccf2ca0ac2e2c366230fdb93260e5/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSetMetaData.java#L144]),
 but mysql use column name as label, original column name as name ([mysql 
jdbc|https://github.com/mysql/mysql-connector-j/blob/18bbd5e68195d0da083cbd5bd0d05d76320df7cd/src/main/user-impl/java/com/mysql/cj/jdbc/result/ResultSetMetaData.java#L176]).

Maybe "getColumnLabel" is more fittable for arrow results, instead of 
"getColumnName".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11089) [C++][Gandiva] Support list datatype for gandiva UDF

2020-12-31 Thread Jiangtao Peng (Jira)
Jiangtao Peng created ARROW-11089:
-

 Summary: [C++][Gandiva] Support list datatype for gandiva UDF 
 Key: ARROW-11089
 URL: https://issues.apache.org/jira/browse/ARROW-11089
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Jiangtao Peng


Hope to add arrow list type for gandiva expression inputs and outputs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)