Re: [openstack-dev] [MagnetoDB] Best practices for uploading large amounts of data

2014-03-28 Thread Serge Kovaleff
Hi Iliia,

I would take a look into BSON http://bsonspec.org/

Cheers,
Serge Kovaleff

On Thu, Mar 27, 2014 at 8:23 PM, Illia Khudoshyn ikhudos...@mirantis.comwrote:

 Hi, Openstackers,

 I'm currently working on adding bulk data load functionality to MagnetoDB.
 This functionality implies inserting huge amounts of data (billions of
 rows, gigabytes of data). The data being uploaded is a set of JSON's (for
 now). The question I'm interested in is a way of data transportation. For
 now I do streaming HTTP POST request from the client side with
 gevent.pywsgi on the server side.

 Could anybody suggest any (better?) approach for the transportation,
 please?
 What are best practices for that.

 Thanks in advance.

 --

 Best regards,

 Illia Khudoshyn,
 Software Engineer, Mirantis, Inc.



 38, Lenina ave. Kharkov, Ukraine

 www.mirantis.com http://www.mirantis.ru/

 www.mirantis.ru



 Skype: gluke_work

 ikhudos...@mirantis.com

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [MagnetoDB] Best practices for uploading large amounts of data

2014-03-28 Thread Maksym Iarmak
2014-03-28 11:29 GMT+02:00 Serge Kovaleff skoval...@mirantis.com:

 Hi Iliia,

 I would take a look into BSON http://bsonspec.org/

 Cheers,
 Serge Kovaleff

 On Thu, Mar 27, 2014 at 8:23 PM, Illia Khudoshyn 
 ikhudos...@mirantis.comwrote:

 Hi, Openstackers,

 I'm currently working on adding bulk data load functionality to
 MagnetoDB. This functionality implies inserting huge amounts of data
 (billions of rows, gigabytes of data). The data being uploaded is a set of
 JSON's (for now). The question I'm interested in is a way of data
 transportation. For now I do streaming HTTP POST request from the client
 side with gevent.pywsgi on the server side.

 Could anybody suggest any (better?) approach for the transportation,
 please?
 What are best practices for that.

 Thanks in advance.

 --

 Best regards,

 Illia Khudoshyn,
 Software Engineer, Mirantis, Inc.



 38, Lenina ave. Kharkov, Ukraine

 www.mirantis.com http://www.mirantis.ru/

 www.mirantis.ru



 Skype: gluke_work

 ikhudos...@mirantis.com

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [MagnetoDB] Best practices for uploading large amounts of data

2014-03-28 Thread Maksym Iarmak
Hi guys,

I suggest taking a look, how Swift and Ceph do such things.


2014-03-28 12:33 GMT+02:00 Maksym Iarmak miar...@mirantis.com:




 2014-03-28 11:29 GMT+02:00 Serge Kovaleff skoval...@mirantis.com:

 Hi Iliia,

 I would take a look into BSON http://bsonspec.org/

 Cheers,
 Serge Kovaleff

 On Thu, Mar 27, 2014 at 8:23 PM, Illia Khudoshyn ikhudos...@mirantis.com
  wrote:

 Hi, Openstackers,

 I'm currently working on adding bulk data load functionality to
 MagnetoDB. This functionality implies inserting huge amounts of data
 (billions of rows, gigabytes of data). The data being uploaded is a set of
 JSON's (for now). The question I'm interested in is a way of data
 transportation. For now I do streaming HTTP POST request from the client
 side with gevent.pywsgi on the server side.

 Could anybody suggest any (better?) approach for the transportation,
 please?
 What are best practices for that.

 Thanks in advance.

 --

 Best regards,

 Illia Khudoshyn,
 Software Engineer, Mirantis, Inc.



 38, Lenina ave. Kharkov, Ukraine

 www.mirantis.com http://www.mirantis.ru/

 www.mirantis.ru



 Skype: gluke_work

 ikhudos...@mirantis.com

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [MagnetoDB] Best practices for uploading large amounts of data

2014-03-28 Thread Chmouel Boudjnah
Maksym Iarmak wrote:
 I suggest taking a look, how Swift and Ceph do such things.
under swift (and CEPH via the radosgw which implement swift API) we are
using POST and PUT which has been working relatively well

Chmouel

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [MagnetoDB] Best practices for uploading large amounts of data

2014-03-28 Thread Dmitriy Ukhlov

On 03/28/2014 11:29 AM, Serge Kovaleff wrote:

Hi Iliia,

I would take a look into BSON http://bsonspec.org/

Cheers,
Serge Kovaleff

On Thu, Mar 27, 2014 at 8:23 PM, Illia Khudoshyn 
ikhudos...@mirantis.com mailto:ikhudos...@mirantis.com wrote:


Hi, Openstackers,

I'm currently working on adding bulk data load functionality to
MagnetoDB. This functionality implies inserting huge amounts of
data (billions of rows, gigabytes of data). The data being
uploaded is a set of JSON's (for now). The question I'm interested
in is a way of data transportation. For now I do streaming HTTP
POST request from the client side with gevent.pywsgi on the server
side.

Could anybody suggest any (better?) approach for the
transportation, please?
What are best practices for that.

Thanks in advance.

-- 


Best regards,

Illia Khudoshyn,
Software Engineer, Mirantis, Inc.

38, Lenina ave. Kharkov, Ukraine

www.mirantis.com http://www.mirantis.ru/

www.mirantis.ru http://www.mirantis.ru/

Skype: gluke_work

ikhudos...@mirantis.com mailto:ikhudos...@mirantis.com


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
mailto:OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Hi Iliia,
I guess if we a talking about cassandra batch loading the fastest way is 
to generate sstables locally and load it into Cassandra via JMX or 
sstableloader

http://www.datastax.com/dev/blog/bulk-loading

If you want to implement bulk load via magnetodb layer (not to cassandra 
directly) you could try to use simple tcp socket and implement your 
binary protocol (using bson for example). Http is text protocol so using 
tcp socket can help you to avoid overhead of base64 encoding. In my 
opinion, working with HTTP and BSON is doubtful solution
because you wil use 2 phase encoddung and decoding: 1) object to bson, 
2) bson to base64, 3) base64 to bson, 4) bson to object 1) obect  
to json instead of 1) object to json, 2) json to object in case of 
HTTP + json


Http streaming as I know is asynchronous type of http. You can expect 
performance growing thanks to skipping generation of http response on 
server side and waiting on for that response on client side for each 
chunk. But you still need to send almost the same amount of data. So if 
network throughput is your bottleneck - it doesn't help. If server side 
is your bottleneck - it doesn't help too.


Also pay your attention that in any case, now MagnetoDB Cassandra 
Storage convert your data to CQL query which is also text. It would be 
nice to implement MagnetoDB BatchWriteItem operation via Cassandra 
sstable generation and loading via sstableloader, but unfortunately as I 
know this functionality support implemented only for Java world


--
Best regards,
Dmitriy Ukhlov
Mirantis Inc.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [MagnetoDB] Best practices for uploading large amounts of data

2014-03-28 Thread Aleksandr Chudnovets
Dmitriy Ukhlov wrote:

   I guess if we a talking about cassandra batch loading the fastest way
 is to generate sstables locally and load it into Cassandra via JMX or
 sstableloader
 http://www.datastax.com/dev/blog/bulk-loading


 Good idea, Dmitriy. IMHO bulk load is back-end specific task. So using
specialized tools seems good idea for me.

Regards,
Alexander Chudnovets
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [MagnetoDB] Best practices for uploading large amounts of data

2014-03-28 Thread Romain Hardouin
Bulk loading with sstableloader is blazingly fast (the price to pay is that's 
not portable of course). 
Also it's network efficient thanks to SSTable compression. If the network is 
not a limiting factor then LZ4 will be great.




Le Vendredi 28 mars 2014 13h46, Aleksandr Chudnovets achudnov...@mirantis.com 
a écrit :
 
Dmitriy Ukhlov wrote:

 I guess if we a talking about cassandra batch loading the fastest way is to 
 generate sstables locally and load it into Cassandra via JMX or sstableloader
http://www.datastax.com/dev/blog/bulk-loading



 Good idea, Dmitriy. IMHO bulk load is back-end specific task. So using 
specialized tools seems good idea for me.

Regards,
Alexander Chudnovets

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [MagnetoDB] Best practices for uploading large amounts of data

2014-03-27 Thread Illia Khudoshyn
Hi, Openstackers,

I'm currently working on adding bulk data load functionality to MagnetoDB.
This functionality implies inserting huge amounts of data (billions of
rows, gigabytes of data). The data being uploaded is a set of JSON's (for
now). The question I'm interested in is a way of data transportation. For
now I do streaming HTTP POST request from the client side with
gevent.pywsgi on the server side.

Could anybody suggest any (better?) approach for the transportation, please?
What are best practices for that.

Thanks in advance.

-- 

Best regards,

Illia Khudoshyn,
Software Engineer, Mirantis, Inc.



38, Lenina ave. Kharkov, Ukraine

www.mirantis.com http://www.mirantis.ru/

www.mirantis.ru



Skype: gluke_work

ikhudos...@mirantis.com
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev