Re: [ANNOUNCEMENT] A query system for BSP processing

Leonidas Fegaras Fri, 14 Sep 2012 09:53:39 -0700

I don't have any plans for supporting such queries, but I would like totry new applications.

Leonidas

On 09/13/2012 06:20 AM, Edward J. Yoon wrote:

Just curious, is there a plan to support sophisticated queries for
unstructured spatial datasets?


On Wed, Sep 12, 2012 at 4:13 AM, Leonidas Fegaras <[email protected]> wrote:

I created a project on Github:
https://github.com/fegaras/mrql.git

Thank you for your help
Leonidas Fegaras


On Sep 7, 2012, at 11:20 AM, Thomas Jungblut wrote:

Yep, a subproject would be the alternative.
In this case we would give you PMC and committer rights so you can
actively
work on that.
However this would make the mapreduce part more or less useless, so if you
want to go the hybrid way, feel free to submit an incubation request.

2012/9/7 Suraj Menon <[email protected]>

I think Thomas has a point. How about making it a sub-module/sub-project
of
Hama for now? If/When it gains enough community support to make it a top
level project, you can fork it as a separate project.
I am not completely aware of the procedures and requirements for getting
external project as sub-project.
We can look into it if you are ready to take this route.

Could you please send me a link for setting up an open-source Apache

project?
If I am right this is what you are looking for -
http://incubator.apache.org/guides/proposal.html
http://incubator.apache.org/sitemap.html

Good luck,
Suraj

On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut
<[email protected]>wrote:

Although I think this is a great project, I think that you will not meet
the requirements.
You need a community and a charter to get it into the incubation.

What about hosting it on Github?

2012/9/7 Leonidas Fegaras <[email protected]>

Yes, this is a great idea. I have used GIT on my own server but I don't
know how to do this for ASF. Could you please send me a link for

setting

up

an open-source Apache project?


On 09/05/2012 10:51 AM, Edward J. Yoon wrote:

If you can open source this then I'm sure the ASF community can help
you and make this software better.

Pls feel free to ask us if you need any assistance donating source
code to the ASF or contributing to the Hama project in the future.

On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<

[email protected]>

wrote:

Yes sure. I have fixed the bug with the repeat stopping condition

but I

have
only tested pagerank on my small cluster. I still need to fix the

k-means

clustering (it's a special case because you improve a fixed number of
points).
Leonidas


On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:

Shall we work together?


On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<

[email protected]

wrote:

Thank you very much for your interest and for testing my system.
It seems that my release was premature: It worked for some random

data

but
didn't for some others. It's a minor logical error that I will try

to

fix
in
the next few days. The problem is with the stopping condition of

the

repeat
expression that calculates the new pagerank from the old. It must

stop

if
ALL peers reach  the specified precision. This is done by having

those

peers
that need to continue send a message to others to continue. It

seems

that
now when all peers agree at the same time, the program works fine.

But

if
one finishes sooner, instead of continuing the repeat loop, it runs
away
to
the next BSP step that follows the repeat, then exits prematurely

and

the
system hangs. The casting errors are due to the run-away peers
executing
the
wrong BSP steps reading wrong messages. Queries without repeat

though

are
OK.
By the way, I had a problem exchanging large amount of data during

sync

(I
discussed this with Thomas).  My solution was to to break a BSP
superstep
into multiple substeps so that each substep can handle a max number

of

messages. Of course my program has to collect all messages in a

vector

in
memory. When the vector is too big, it is spilled in a local file.

This

moved the problem from the Hama side to my side and allowed me to
handle
larger data, especially in joins. I think this problem of

exchanging

large
amount of data during a superstep is currently a weakness of Hama.
Leonidas



On 08/24/2012 04:15 AM, Thomas Jungblut wrote:

BTW, should we feature this on our website?

2012/8/24 Thomas Jungblut<thomas.jungblut@**gmail.com<

[email protected]>

Hi Leonidas!


I have to admit that I have known what is going on (and had to

keep

silent), but I have to say: Thank you very much!
This will help many people writing BSPs in a more easier way.

Of course this is not as fast as the native BSP code, Hive and

Pig

suffer
from the same problems in MR.
But it gives people the opportunity to develop faster and get

their

code
in production with just a minor time expense.

And I think, that we will help you gladly on improving the BSP

part

of
your framework. At least I would do ;)

Thanks!

2012/8/24 Edward J. Yoon<[email protected]>

Here's my few test results on Oracle BDA (40G/s infiniband

network).

It seems slow than our PageRank example.

P.S., There are some errors so I couldn't test large-scale.
(java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast

to

hadoop.mrql.Inv and java.lang.Error: Cannot clear a

non-materialized

sequence ..., etc.)



== 100K nodes and 1M edges ==

*** Using 10 BSP tasks (out of a max 10). Each task will handle
about
2383611 bytes of input data.

Run time: 30.384 secs

*** Using 20 BSP tasks (out of a max 20). Each task will handle
about
1191805 bytes of input data.

Run time: 24.412 secs

On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
<[email protected]>
wrote:

Wow, very interesting. I'm going to install and test on my

large

cluster.

On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras
<[email protected]>

wrote:

Dear Hama users,
I am pleased to announce that the MRQL query processing system

can

now
evaluate SQL-like queries on a Hama cluster. MRQL is available

at:


http://lambda.uta.edu/mrql/

MRQL (the Map-Reduce Query Language) is an SQL-like query

language

for
large-scale, distributed data analysis. MRQL is powerful

enough

to

express most common data analysis tasks over many different

kinds

of
raw data, including hierarchical data and nested collections,

such

as
XML data. MRQL can run in two modes: in MR (Map-Reduce) mode

using

Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode

using

Apache
Hama. Both modes use Apache's HDFS to read and write their

data.


Note that, the BSP mode is currently experimental (not

fine-tuned

yet)
and lacks any fault-tolerance (if an error occurs, the entire

job

must
be restarted). Due to our limited resources, MRQL has only

been

tested
on a small cluster (7-nodes/28-cores). We compared the BSP

mode

with
the MR mode by evaluating a pagerank query over a small graph
(100K
nodes, 1M edges) and found that BSP mode is about 4.5 times

faster

than the MR mode. Please let me know if you'd like to

contribute

to
this project by testing MRQL on a larger cluster.
Best regards,
Leonidas Fegaras
University of Texas at Arlington

--
Best Regards, Edward J. Yoon
@eddieyoon



--
Best Regards, Edward J. Yoon
@eddieyoon

.

--
Best Regards, Edward J. Yoon
@eddieyoon



--
Best Regards, Edward J. Yoon
@eddieyoon

Re: [ANNOUNCEMENT] A query system for BSP processing

Reply via email to