Re: Any standard way for min/max values per record-batch?

2021-07-19 Thread Kohei KaiGai
Hello,

Let me share our trial to support the min/max statistics per record batch.
https://github.com/heterodb/pg-strom/wiki/806:-Apache-Arrow-Min-Max-Statistics-Hint

The latest pg2arrow supports --stat option that can specify the
columns to include min/max statistics
for each record batch.
The statistics are embedded in the custom_metadata[] of the Field (in
the Footer area), if any.

The example below shows database table dump with statistics on the
lo_orderdate column.

$ pg2arrow -d postgres -o /dev/shm/flineorder_sort.arrow -t
lineorder_sort --stat=lo_orderdate --progress
RecordBatch[0]: offset=1640 length=268437080 (meta=920,
body=268436160) nitems=1303085
RecordBatch[1]: offset=268438720 length=268437080 (meta=920,
body=268436160) nitems=1303085
  :
RecordBatch[9]: offset=2415935360 length=5566 (meta=920,
body=55667968) nitems=270231

Then, you can find out the custom-metadata on the lo_orderdate field;
min_values and max_values.
These are comma separated integer lists with 10 elements as many as
the number of record batches.
So, the first item of min_values and max_values are min-/max-datum of
the record-batch[0].

$ python3
Python 3.6.8 (default, Aug 24 2020, 17:57:11)
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> X = pa.RecordBatchFileReader('/dev/shm/flineorder_sort.arrow')
>>> X.schema
lo_orderkey: decimal(30, 8)
lo_linenumber: int32
lo_custkey: decimal(30, 8)
lo_partkey: int32
lo_suppkey: decimal(30, 8)
lo_orderdate: int32
  -- field metadata --
  min_values: '19920101,19920919,19930608,19940223,1994,19950730,1996' + 31
  max_values: '19920919,19930608,19940223,1994,19950730,19960417,1997' + 31
lo_orderpriority: fixed_size_binary[15]
lo_shippriority: fixed_size_binary[1]
lo_quantity: decimal(30, 8)
lo_extendedprice: decimal(30, 8)
lo_ordertotalprice: decimal(30, 8)
  :

When we scan the arrow_fdw foreign table that maps that Apache Arrow
file with the above min/max
statistics, it automatically checks WHERE-clause and skipps
record-batches that contain no rows
to be survived.

postgres=# EXPLAIN ANALYZE
   SELECT count(*) FROM flineorder_sort
WHERE lo_orderpriority = '2-HIGH'
  AND lo_orderdate BETWEEN 19940101 AND 19940630;

 QUERY PLAN
--
 Aggregate  (cost=33143.09..33143.10 rows=1 width=8) (actual
time=115.591..115.593 rows=1loops=1)
   ->  Custom Scan (GpuPreAgg) on flineorder_sort
(cost=33139.52..33142.07 rows=204 width=8) (actual
time=115.580..115.585 rows=1 loops=1)
 Reduction: NoGroup
 Outer Scan: flineorder_sort  (cost=4000.00..33139.42 rows=300
width=0) (actual time=10.682..21.555 rows=2606170 loops=1)
 Outer Scan Filter: ((lo_orderdate >= 19940101) AND
(lo_orderdate <= 19940630) AND (lo_orderpriority = '2-HIGH'::bpchar))
 Rows Removed by Outer Scan Filter: 2425885
 referenced: lo_orderdate, lo_orderpriority
 Stats-Hint: (lo_orderdate >= 19940101), (lo_orderdate <=
19940630)  [loaded: 2, skipped: 8]
 files0: /dev/shm/flineorder_sort.arrow (read: 217.52MB, size:
2357.11MB)
 Planning Time: 0.210 ms
 Execution Time: 153.508 ms
(11 rows)

This EXPLAIN ANALYZE displays the Stats-Hint line.
It says Arrow_Fdw could use (lo_orderdate >= 19940101) and
(lo_orderdate <= 19940630) to check
the min/max statistics, then actually skipped 8 record-batches but
only 2 record-batches were loaded.


Our expectation is IoT/M2M grade time-series log-processing because
they always contain timestamp
values for each entry, and physically closed rows tend to have similar values.
Not only Apache Arrow files generated by pg2arrow, this min/max
statistics values are appendable by
rewrite of the Footer portion, without relocation of record-batches.
So, we plan to provide a standalone
command to attach the min/max statistics onto the existing Apache
Arrow generated by other tools.

Best regards,

2021年2月18日(木) 13:33 Kohei KaiGai :
>
> Thanks for the clarification.
>
> > There is key-value metadata available on Message which might be able to
> > work in the short term (some sort of encoded message).  I think
> > standardizing how we store statistics per batch does make sense.
> >
> For example, JSON array of min/max values as a key-value metadata
> in the Footer->Schema->Fields[]->custom_metadata?
> Even though the metadata field must be less than INT_MAX, I think it
> is enough portable and not invasive way.
>
> > We unfortunately can't add anything to field-node without breaking
> > compatibility.  But  another option would be to add a new structure as a
> > parallel list on

Re: Any standard way for min/max values per record-batch?

2021-02-17 Thread Kohei KaiGai
Thanks for the clarification.

> There is key-value metadata available on Message which might be able to
> work in the short term (some sort of encoded message).  I think
> standardizing how we store statistics per batch does make sense.
>
For example, JSON array of min/max values as a key-value metadata
in the Footer->Schema->Fields[]->custom_metadata?
Even though the metadata field must be less than INT_MAX, I think it
is enough portable and not invasive way.

> We unfortunately can't add anything to field-node without breaking
> compatibility.  But  another option would be to add a new structure as a
> parallel list on RecordBatch itself.
>
> If we do add a new structure or arbitrary key-value pair we should not use
> KeyValue but should have something where the values can be bytes.
>
What is the parallel-list means?
If we would have a standardized binary structure, like DictionaryBatch,
to store the statistics including min/max values, it exactly makes sense
more than text-encoded key-value metadata, of course.

Best regards,

2021年2月18日(木) 12:37 Micah Kornfield :
>
> There is key-value metadata available on Message which might be able to
> work in the short term (some sort of encoded message).  I think
> standardizing how we store statistics per batch does make sense.
>
> We unfortunately can't add anything to field-node without breaking
> compatibility.  But  another option would be to add a new structure as a
> parallel list on RecordBatch itself.
>
> If we do add a new structure or arbitrary key-value pair we should not use
> KeyValue but should have something where the values can be bytes.
>
> On Wed, Feb 17, 2021 at 7:17 PM Kohei KaiGai  wrote:
>
> > Hello,
> >
> > Does Apache Arrow have any standard way to embed min/max values of the
> > fields
> > per record-batch basis?
> > It looks FieldNode supports neither dedicated min/max attribute nor
> > custom-metadata.
> > https://github.com/apache/arrow/blob/master/format/Message.fbs#L28
> >
> > If we embed an array of min/max values into the custom-metadata of the
> > Field-node,
> > we may be able to implement.
> > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344
> >
> > What I like to implement is something like BRIN index at PostgreSQL.
> > http://heterodb.github.io/pg-strom/brin/
> >
> > This index contains only min/max values for a particular block ranges, and
> > query
> > executor can skip blocks that obviously don't contain the target data.
> > If we can skip 9990 of 1 record batch by checking metadata on a query
> > that
> > tries to fetch items in very narrow timestamps, it is a great
> > acceleration more than
> > full file scans.
> >
> > Best regards,
> > --
> > HeteroDB, Inc / The PG-Strom Project
> > KaiGai Kohei 
> >



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


Any standard way for min/max values per record-batch?

2021-02-17 Thread Kohei KaiGai
Hello,

Does Apache Arrow have any standard way to embed min/max values of the fields
per record-batch basis?
It looks FieldNode supports neither dedicated min/max attribute nor
custom-metadata.
https://github.com/apache/arrow/blob/master/format/Message.fbs#L28

If we embed an array of min/max values into the custom-metadata of the
Field-node,
we may be able to implement.
https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344

What I like to implement is something like BRIN index at PostgreSQL.
http://heterodb.github.io/pg-strom/brin/

This index contains only min/max values for a particular block ranges, and query
executor can skip blocks that obviously don't contain the target data.
If we can skip 9990 of 1 record batch by checking metadata on a query that
tries to fetch items in very narrow timestamps, it is a great
acceleration more than
full file scans.

Best regards,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


Pcap2Arrow - Packet capture and data conversion tool to Apache Arrow on the fly

2021-02-15 Thread Kohei KaiGai
Hello,

Let me share my recent works below:
https://github.com/heterodb/pg-strom/wiki/804:-Pcap2Arrow

This standalone command-line tool allows to capture network packets
from network interface devices,
and convert them into Apache Arrow data format according to the
pre-defined data schema for each
supported protocol (TCP, UDP, ICMP x IPv4, IPv6), then write out the
destination files.

It internally uses PF_RING [*1] to support fast network interface card
(> 10Gb), and to minimize
packet losses by utilization of multi-core CPUs.
Even though I confirmed that Pcap2Arrow write out the captured network
packets more than
50Gb/s ratio, my test cases are artificial and biased traffic patterns.
If you can test the software on your environment, it makes sense to
improve the software.
[*1] https://www.ntop.org/products/packet-capture/pf_ring/

As you may know, network traffic data tends to grow so large, thus, it
is not easy to import
them into database systems for analytics. Once we can convert them
into Apache Arrow,
we don't need to import the captured data again. Just map the files
prior to analytics.

Best regards,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


Re: Human-readable version of Arrow Schema?

2020-01-08 Thread Kohei KaiGai
Hello,

pg2arrow [*1] has '--dump' mode to print out schema definition of the
given Apache Arrow file.
Does it make sense for you?

$ ./pg2arrow --dump ~/hoge.arrow
[Footer]
{Footer: version=V4, schema={Schema: endianness=little,
fields=[{Field: name="id", nullable=true, type={Int32}, children=[],
custom_metadata=[]}, {Field: name="a", nullable=true, type={Float64},
children=[], custom_metadata=[]}, {Field: name="b", nullable=true,
type={Decimal: precision=11, scale=7}, children=[],
custom_metadata=[]}, {Field: name="c", nullable=true, type={Struct},
children=[{Field: name="x", nullable=true, type={Int32}, children=[],
custom_metadata=[]}, {Field: name="y", nullable=true, type={Float32},
children=[], custom_metadata=[]}, {Field: name="z", nullable=true,
type={Utf8}, children=[], custom_metadata=[]}], custom_metadata=[]},
{Field: name="d", nullable=true, type={Utf8},
dictionary={DictionaryEncoding: id=0, indexType={Int32},
isOrdered=false}, children=[], custom_metadata=[]}, {Field: name="e",
nullable=true, type={Timestamp: unit=us}, children=[],
custom_metadata=[]}, {Field: name="f", nullable=true, type={Utf8},
children=[], custom_metadata=[]}, {Field: name="random",
nullable=true, type={Float64}, children=[], custom_metadata=[]}],
custom_metadata=[{KeyValue: key="sql_command" value="SELECT *,random()
FROM t"}]}, dictionaries=[{Block: offset=920, metaDataLength=184
bodyLength=128}], recordBatches=[{Block: offset=1232,
metaDataLength=648 bodyLength=386112}]}
[Dictionary Batch 0]
{Block: offset=920, metaDataLength=184 bodyLength=128}
{Message: version=V4, body={DictionaryBatch: id=0, data={RecordBatch:
length=6, nodes=[{FieldNode: length=6, null_count=0}],
buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=64},
{Buffer: offset=64, length=64}]}, isDelta=false}, bodyLength=128}
[Record Batch 0]
{Block: offset=1232, metaDataLength=648 bodyLength=386112}
{Message: version=V4, body={RecordBatch: length=3000,
nodes=[{FieldNode: length=3000, null_count=0}, {FieldNode:
length=3000, null_count=60}, {FieldNode: length=3000, null_count=62},
{FieldNode: length=3000, null_count=0}, {FieldNode: length=3000,
null_count=56}, {FieldNode: length=3000, null_count=66}, {FieldNode:
length=3000, null_count=0}, {FieldNode: length=3000, null_count=0},
{FieldNode: length=3000, null_count=64}, {FieldNode: length=3000,
null_count=0}, {FieldNode: length=3000, null_count=0}],
buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0,
length=12032}, {Buffer: offset=12032, length=384}, {Buffer:
offset=12416, length=24000}, {Buffer: offset=36416, length=384},
{Buffer: offset=36800, length=48000}, {Buffer: offset=84800,
length=0}, {Buffer: offset=84800, length=384}, {Buffer: offset=85184,
length=12032}, {Buffer: offset=97216, length=384}, {Buffer:
offset=97600, length=12032}, {Buffer: offset=109632, length=0},
{Buffer: offset=109632, length=12032}, {Buffer: offset=121664,
length=96000}, {Buffer: offset=217664, length=0}, {Buffer:
offset=217664, length=12032}, {Buffer: offset=229696, length=384},
{Buffer: offset=230080, length=24000}, {Buffer: offset=254080,
length=0}, {Buffer: offset=254080, length=12032}, {Buffer:
offset=266112, length=96000}, {Buffer: offset=362112, length=0},
{Buffer: offset=362112, length=24000}]}, bodyLength=386112}

[*1] https://heterodb.github.io/pg-strom/arrow_fdw/#using-pg2arrow

2019年12月7日(土) 6:26 Christian Hudon :
>
> Hi,
>
> For the uses I would like to make of Arrow, I would need a human-readable
> and -writable version of an Arrow Schema, that could be converted to and
> from the Arrow Schema C++ object. Going through the doc for 0.15.1, I don't
> see anything to that effect, with the closest being the ToString() method
> on DataType instances, but which is meant for debugging only. (I need an
> expression of an Arrow Schema that people can read, and that can live
> outside of the code for a particular operation.)
>
> Is a text representation of an Arrow Schema something that is being worked
> on now? If not, would you folks be interested in me putting up an initial
> proposal for discussion? Any design constraints I should pay attention to,
> then?
>
> Thanks,
>
>   Christian
> --
>
>
> │ Christian Hudon
>
> │ Applied Research Scientist
>
>Element AI, 6650 Saint-Urbain #500
>
>Montréal, QC, H2S 3G9, Canada
>Elementai.com



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


Re: How about inet4/inet6/macaddr data types?

2019-04-30 Thread Kohei KaiGai
Hello Wes,

@ktou also introduced me your work.
As long as the custom_metadata format to declare the custom datatype
is well defined
in the specification or document somewhere, independent from the
library implementation,
it looks to me sufficient.
Does your UUID example use FixedSizeBinary raw-data type to wrap UUID and put
"arrow_extension_name=uuid" and "arrow_extension_data=uuid-type-unique-code"
on the custrom_metadata of Field "f0", right?
If it is documented somewhere, people can reproduce the custom datatype by their
applications, and other folks can also read the custom datatype.

Thanks,

2019年4月30日(火) 23:47 Wes McKinney :
>
> hi Kohei,
>
> Since the introduction of arrow::ExtensionType in ARROW-585 [1] we
> have a well-defined method of creating new data types without having
> to manually interact with the custom_metadata Schema information. Can
> you have a look at that and see if it meets your requirements? This
> can be a useful way of extending the Arrow format for your use cases
> while the community may discuss formally adding new logical types to
> the format (or not).
>
> In the unit tests you can see a UUID type I have defined and
> serialized through Arrow's binary protocol machinery
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/extension_type-test.cc
>
> Thanks
> Wes
>
> [1]: 
> https://github.com/apache/arrow/commit/a79cc809883192417920b501e41a0e8b63cd0ad1
>
> On Tue, Apr 30, 2019 at 1:34 AM Kohei KaiGai  wrote:
> >
> > Hello,
> >
> > It is an proposition to add new logical types for the Apache Arrow data 
> > format.
> >
> > As Melik-Adamyan said, it is quite easy to convert 5-bytes
> > FixedSizeBinary to PostgreSQL's inet
> > data type by the Arrow_Fdw module (an extension of PostgreSQL
> > responsible to data conversion),
> > however, it is not obvious for readers whether it is network-address
> > or just a bunch of small binary.
> >
> > https://www.postgresql.org/docs/11/sql-importforeignschema.html
> > PostgreSQL has IMPORT FOREIGN SCHEMA command; that allows to define a
> > foreign table
> > according to schema information of the external data source.
> > In case of Arrow_Fdw, we can define a foreign table without manual
> > listing of columns with data
> > types as follows:
> >
> >   IMPORT FOREIGN SCHEMA foo FROM arrow_fdw INTO public
> >   OPTIONS (file '/opt/nvme/foo.arrow');
> >
> > In this case, Schema definition in the 'foo.arrow' can tell PostgreSQL
> > how many columns are
> > defined and its name, data types and so on. However, PostgreSQL may be
> > confusing to convert
> > the FixedSizeBinary (width=5) without any metadata support. It may be
> > 'inet4' data type, and
> > it also may be 'char(5)'.
> >
> > One idea is utilization of custom_metadata field in the Field-node. We
> > may be able to mark it is
> > a network address, not a blob. However, I didn't find out
> > specification of the custom_metadata.
> >
> > I expect network address is widely used for log-data processing area,
> > and not small number of
> > application will support it. If so, it is not too niche requirement
> > for a new logical data type definition
> > in the Apache Arrow data format.
> >
> > Best regards,
> >
> > 2019年4月30日(火) 15:13 Micah Kornfield :
> > >
> > > Hi KaiGai Kohei,
> > > Can you clarify if you are looking for advice on modelling these types or
> > > proposing to add new logical types to the Arrow specification?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Monday, April 29, 2019, Kohei KaiGai  wrote:
> > >
> > > > Hello folks,
> > > >
> > > > How about your opinions about network address types support in Apache
> > > > Arrow data format?
> > > > Network address always appears at network logs massively generated by
> > > > any network facilities,
> > > > and it is a significant information when people analyze their backward
> > > > logs.
> > > >
> > > > I'm working on Apache Arrow format mapping on PostgreSQL.
> > > > http://heterodb.github.io/pg-strom/arrow_fdw/
> > > >
> > > > This extension allows to read Arrow files as if PostgreSQL's table
> > > > using foreign table.
> > > > Data types of Arrow shall be mapped to relevant PostgreSQL's data type
> > > > according to the above
> > > > documentation.
> > > >
> > > > https://www.postgresql.org/docs/current/datatype-net-types.

Re: How about inet4/inet6/macaddr data types?

2019-04-30 Thread Kohei KaiGai
Hello,

It is an proposition to add new logical types for the Apache Arrow data format.

As Melik-Adamyan said, it is quite easy to convert 5-bytes
FixedSizeBinary to PostgreSQL's inet
data type by the Arrow_Fdw module (an extension of PostgreSQL
responsible to data conversion),
however, it is not obvious for readers whether it is network-address
or just a bunch of small binary.

https://www.postgresql.org/docs/11/sql-importforeignschema.html
PostgreSQL has IMPORT FOREIGN SCHEMA command; that allows to define a
foreign table
according to schema information of the external data source.
In case of Arrow_Fdw, we can define a foreign table without manual
listing of columns with data
types as follows:

  IMPORT FOREIGN SCHEMA foo FROM arrow_fdw INTO public
  OPTIONS (file '/opt/nvme/foo.arrow');

In this case, Schema definition in the 'foo.arrow' can tell PostgreSQL
how many columns are
defined and its name, data types and so on. However, PostgreSQL may be
confusing to convert
the FixedSizeBinary (width=5) without any metadata support. It may be
'inet4' data type, and
it also may be 'char(5)'.

One idea is utilization of custom_metadata field in the Field-node. We
may be able to mark it is
a network address, not a blob. However, I didn't find out
specification of the custom_metadata.

I expect network address is widely used for log-data processing area,
and not small number of
application will support it. If so, it is not too niche requirement
for a new logical data type definition
in the Apache Arrow data format.

Best regards,

2019年4月30日(火) 15:13 Micah Kornfield :
>
> Hi KaiGai Kohei,
> Can you clarify if you are looking for advice on modelling these types or
> proposing to add new logical types to the Arrow specification?
>
> Thanks,
> Micah
>
> On Monday, April 29, 2019, Kohei KaiGai  wrote:
>
> > Hello folks,
> >
> > How about your opinions about network address types support in Apache
> > Arrow data format?
> > Network address always appears at network logs massively generated by
> > any network facilities,
> > and it is a significant information when people analyze their backward
> > logs.
> >
> > I'm working on Apache Arrow format mapping on PostgreSQL.
> > http://heterodb.github.io/pg-strom/arrow_fdw/
> >
> > This extension allows to read Arrow files as if PostgreSQL's table
> > using foreign table.
> > Data types of Arrow shall be mapped to relevant PostgreSQL's data type
> > according to the above
> > documentation.
> >
> > https://www.postgresql.org/docs/current/datatype-net-types.html
> > PostgreSQL supports some network address types and operators.
> > For example, we can put a qualifier like:   WHERE addr <<= inet
> > '192.168.1.0/24' , to find out all
> > the records in the subnet of '192.168.1.0/24'.
> >
> > Probably, these three data types are now sufficient for most network
> > logs: inet4, inet6 and macaddr.
> > * inet4 is 32bit + optional 8bit (for netmask) fixed length array
> > * inet6 is 128bit + optional 8bit (for netmask) fixed length array
> > * macaddr is 48bit fixed length array.
> >
> > I don't favor to map the inetX types on flexible length Binary data
> > type, because it takes 32bit offset
> > to indicate 32 or 40bit value, inefficient so much, even though
> > PostgreSQL allows to mix inet4/inet6
> > data types in a same column.
> >
> > Thanks,
> > --
> > HeteroDB, Inc / The PG-Strom Project
> > KaiGai Kohei 
> >



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


How about inet4/inet6/macaddr data types?

2019-04-29 Thread Kohei KaiGai
Hello folks,

How about your opinions about network address types support in Apache
Arrow data format?
Network address always appears at network logs massively generated by
any network facilities,
and it is a significant information when people analyze their backward logs.

I'm working on Apache Arrow format mapping on PostgreSQL.
http://heterodb.github.io/pg-strom/arrow_fdw/

This extension allows to read Arrow files as if PostgreSQL's table
using foreign table.
Data types of Arrow shall be mapped to relevant PostgreSQL's data type
according to the above
documentation.

https://www.postgresql.org/docs/current/datatype-net-types.html
PostgreSQL supports some network address types and operators.
For example, we can put a qualifier like:   WHERE addr <<= inet
'192.168.1.0/24' , to find out all
the records in the subnet of '192.168.1.0/24'.

Probably, these three data types are now sufficient for most network
logs: inet4, inet6 and macaddr.
* inet4 is 32bit + optional 8bit (for netmask) fixed length array
* inet6 is 128bit + optional 8bit (for netmask) fixed length array
* macaddr is 48bit fixed length array.

I don't favor to map the inetX types on flexible length Binary data
type, because it takes 32bit offset
to indicate 32 or 40bit value, inefficient so much, even though
PostgreSQL allows to mix inet4/inet6
data types in a same column.

Thanks,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


Re: Format specification document?

2019-01-05 Thread Kohei KaiGai
Hello McKinney,

After the post of my first message, I could find out a significant
documentation here:
https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#example

Unlike my expectation, flatbuffer mechanism has much different
structure on-memory image.
So, let's review the Apache Arrow file binary according to the documentation...

00  41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00
10  00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00
20  00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00

The first 8bytes are signature of "ARROW1\0\0\0", then following
4bytes are length of
the metadata regardless of the flatbuffer. Then, we could fetch
0x0010(int) at 0x000c.
It indicates 0x000c + 0x0010 is the root table.

A int value at 0x001c is 0x000a. It means 0x001c - 0x000a = 0x0012 begins vtable
structure.
0x0012  0a 00  --> vtable length = 10bytes (5 items)
0x0014  0e 00  --> table length = 14 bytes; including the negative
offset (4bytes)
0x0016  06 00  --> table 0x001c + 0x0006 is metadata version (short)
0x0018  05 00  --> table 0x001c + 0x0005 is message header (byte)
0x001a  08 00  --> table 0x001c + 0x0008 is header offset (int)
0x001c  0a 00 00 00  --> negative offset to the vtable

So, we can know this file contains Apache Arrow V4 format, then header
begins from
at 0x0024 + 0x0010.

20  00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00
30  04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00

Next, 0x0034 is position of the current table. It indicates 0x0034 -
0x000a is vtable.

0x002a  0a 00  --> vtable length = 10bytes (5items)
0x002c  0c 00  --> table length = 14bytes; including the negative
offset (4bytes)
0x002e  00 00  --> Schema::endianness is default (0 = little endian)
0x0030  04 00  --> Schema::fields[]
0x0032  08 00  --> Schema::custom_metadata[]

It says Schema::fields[] begins at 0x0038 + 0x03ec = 0x0424, and also says
Schema::custom_metadata[] begins at 0x003a + 0x0004 = 0x0040.

>From 0x0040:
40  01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00
50  08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00
60  70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61
70  6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22
80  30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e
90  73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22
a0  3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74
b0  79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22
c0  6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69
d0  65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c

The binary from 0x0060 is a cstring ("pandas\0"), and the binary from
0x006c is also a cstring of JSON.

The location indicated by 0x0040 has number of vector element.
So, this metadata contains one key-value pair.
Next int word indicates the sub-table at 0x0050. Its vtable is below:
0x0048  08 00  --> vtable length = 8bytes (4items)
0x004a  0c 00  --> table length  = 12bytes; including the negative
offset (4bytes)
0x004c  04 00  --> cstring offset (key) is at 0x0050 + 0x0004
0x004e  08 00  --> cstring offset (value) is at 0x0050 + 0x0008

Key is at 0x0054 + 0x0008. Here is a int value: 0x0006. It means
cstring length is
6bytes and the next byte (0x0060) begins the cstring body. ("pandas\0").
Value is at 0x0058 + 0x0010. Here is a int value: 0x03b4 (= 948byes), then
the next byte (0x006c) begins the cstring body. ("{pandas_version ... ).


I didn't follow the entire data file, however, it makes me more clear.
Best regards,

2019年1月6日(日) 8:50 Wes McKinney :
>
> hi Kohei,
>
> On Thu, Jan 3, 2019 at 7:14 PM Kohei KaiGai  wrote:
> >
> > Hello,
> >
> > I'm now trying to understand the Apache Arrow format for my application.
> > Is there a format specification document including meta-data layout?
> >
> > I checked out the description at:
> > https://github.com/apache/arrow/tree/master/docs/source/format
> > https://github.com/apache/arrow/tree/master/format
> >
> > The format/IPC.rst says an arrow file has the format below:
> >
> > 
> > 
> > 
> > 
> > 
> > 
> >
> > Then, STREAMING FORMAT begins from SCHEMA-message.
> > The message chunk has the format below:
> >
> > 
> > 
> > 
> > 
> >
> > I made an arrow file using pyarrow [*1]. It has the following binary.
> >
> > [kaigai@saba ~]$ cat /tmp/sample.arrow | od -Ax -t x1 | head -16
> > 00  41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00
> > 10  00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00
> > 20  00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00
> > 30  04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00
> > 40  01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00
> > 50  08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00
> > 60  70 61 6e 64 61 73 00 00 

Format specification document?

2019-01-03 Thread Kohei KaiGai
Hello,

I'm now trying to understand the Apache Arrow format for my application.
Is there a format specification document including meta-data layout?

I checked out the description at:
https://github.com/apache/arrow/tree/master/docs/source/format
https://github.com/apache/arrow/tree/master/format

The format/IPC.rst says an arrow file has the format below:








Then, STREAMING FORMAT begins from SCHEMA-message.
The message chunk has the format below:






I made an arrow file using pyarrow [*1]. It has the following binary.

[kaigai@saba ~]$ cat /tmp/sample.arrow | od -Ax -t x1 | head -16
00  41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00
10  00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00
20  00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00
30  04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00
40  01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00
50  08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00
60  70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61
70  6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22
80  30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e
90  73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22
a0  3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74
b0  79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22
c0  6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69
d0  65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c
e0  20 22 70 61 6e 64 61 73 5f 74 79 70 65 22 3a 20
f0  22 69 6e 74 36 34 22 7d 2c 20 7b 22 6d 65 74 61

The first 64bit is "ARROW1\0\0\0", and the next 32bit is 0x058c (=1420)
that is reasonable for SCHEMA-message length.
The next 32bit is 0x0010 (=16). It may be metadata_size of the FlatBuffer.
The IPC.rst does not mention about FlatBuffer metadata, so I tried to skip
next 16bytes, expecting message body begins at 0x20.
However, the first 16bit (version) is 0x0001 (=V2), the next byte is 0x03
(= RecordBatch, not Schema!), and the following 64bit is 0x0a0010(!).
It is obviously I'm understanding incorrectly.

Is there documentation stuff to introduce detailed layout of the arrow format?

Thanks,

[*1] Steps to make a sample arrow file
$ python3.5
>>> import pyarrow as pa
>>> import pandas as pd
>>> X = pd.read_sql(sql="SELECT * FROM hogehoge LIMIT 1000", 
>>> con="postgresql://localhost/postgres")
>>> Y = pa.Table.from_pandas(X)
>>> f = pa.RecordBatchFileWriter('/tmp/sample.arrow', Y.schema)
>>> f.write_table(Y)
>>> f.close()

-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei