Re: [ANNOUNCEMENT] A query system for BSP processing

Edward J. Yoon Thu, 13 Sep 2012 04:20:54 -0700

Just curious, is there a plan to support sophisticated queries for
unstructured spatial datasets?


On Wed, Sep 12, 2012 at 4:13 AM, Leonidas Fegaras <[email protected]> wrote:
> I created a project on Github:
> https://github.com/fegaras/mrql.git
>
> Thank you for your help
> Leonidas Fegaras
>
>
> On Sep 7, 2012, at 11:20 AM, Thomas Jungblut wrote:
>
>> Yep, a subproject would be the alternative.
>> In this case we would give you PMC and committer rights so you can
>> actively
>> work on that.
>> However this would make the mapreduce part more or less useless, so if you
>> want to go the hybrid way, feel free to submit an incubation request.
>>
>> 2012/9/7 Suraj Menon <[email protected]>
>>
>>> I think Thomas has a point. How about making it a sub-module/sub-project
>>> of
>>> Hama for now? If/When it gains enough community support to make it a top
>>> level project, you can fork it as a separate project.
>>> I am not completely aware of the procedures and requirements for getting
>>> external project as sub-project.
>>> We can look into it if you are ready to take this route.
>>>
>>>> Could you please send me a link for setting up an open-source Apache
>>>
>>> project?
>>> If I am right this is what you are looking for -
>>> http://incubator.apache.org/guides/proposal.html
>>> http://incubator.apache.org/sitemap.html
>>>
>>> Good luck,
>>> Suraj
>>>
>>> On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut
>>> <[email protected]>wrote:
>>>
>>>> Although I think this is a great project, I think that you will not meet
>>>> the requirements.
>>>> You need a community and a charter to get it into the incubation.
>>>>
>>>> What about hosting it on Github?
>>>>
>>>> 2012/9/7 Leonidas Fegaras <[email protected]>
>>>>
>>>>> Yes, this is a great idea. I have used GIT on my own server but I don't
>>>>> know how to do this for ASF. Could you please send me a link for
>>>
>>> setting
>>>>
>>>> up
>>>>>
>>>>> an open-source Apache project?
>>>>>
>>>>>
>>>>> On 09/05/2012 10:51 AM, Edward J. Yoon wrote:
>>>>>
>>>>>> If you can open source this then I'm sure the ASF community can help
>>>>>> you and make this software better.
>>>>>>
>>>>>> Pls feel free to ask us if you need any assistance donating source
>>>>>> code to the ASF or contributing to the Hama project in the future.
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<
>>>
>>> [email protected]>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes sure. I have fixed the bug with the repeat stopping condition
>>>
>>> but I
>>>>>>>
>>>>>>> have
>>>>>>> only tested pagerank on my small cluster. I still need to fix the
>>>>
>>>> k-means
>>>>>>>
>>>>>>> clustering (it's a special case because you improve a fixed number of
>>>>>>> points).
>>>>>>> Leonidas
>>>>>>>
>>>>>>>
>>>>>>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
>>>>>>>
>>>>>>> Shall we work together?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<
>>>
>>> [email protected]
>>>>>
>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thank you very much for your interest and for testing my system.
>>>>>>>>> It seems that my release was premature: It worked for some random
>>>>
>>>> data
>>>>>>>>>
>>>>>>>>> but
>>>>>>>>> didn't for some others. It's a minor logical error that I will try
>>>
>>> to
>>>>>>>>>
>>>>>>>>> fix
>>>>>>>>> in
>>>>>>>>> the next few days. The problem is with the stopping condition of
>>>
>>> the
>>>>>>>>>
>>>>>>>>> repeat
>>>>>>>>> expression that calculates the new pagerank from the old. It must
>>>>
>>>> stop
>>>>>>>>>
>>>>>>>>> if
>>>>>>>>> ALL peers reach  the specified precision. This is done by having
>>>>
>>>> those
>>>>>>>>>
>>>>>>>>> peers
>>>>>>>>> that need to continue send a message to others to continue. It
>>>
>>> seems
>>>>>>>>>
>>>>>>>>> that
>>>>>>>>> now when all peers agree at the same time, the program works fine.
>>>>
>>>> But
>>>>>>>>>
>>>>>>>>> if
>>>>>>>>> one finishes sooner, instead of continuing the repeat loop, it runs
>>>>>>>>> away
>>>>>>>>> to
>>>>>>>>> the next BSP step that follows the repeat, then exits prematurely
>>>
>>> and
>>>>>>>>>
>>>>>>>>> the
>>>>>>>>> system hangs. The casting errors are due to the run-away peers
>>>>>>>>> executing
>>>>>>>>> the
>>>>>>>>> wrong BSP steps reading wrong messages. Queries without repeat
>>>
>>> though
>>>>>>>>>
>>>>>>>>> are
>>>>>>>>> OK.
>>>>>>>>> By the way, I had a problem exchanging large amount of data during
>>>>
>>>> sync
>>>>>>>>>
>>>>>>>>> (I
>>>>>>>>> discussed this with Thomas).  My solution was to to break a BSP
>>>>>>>>> superstep
>>>>>>>>> into multiple substeps so that each substep can handle a max number
>>>>
>>>> of
>>>>>>>>>
>>>>>>>>> messages. Of course my program has to collect all messages in a
>>>>
>>>> vector
>>>>>>>>>
>>>>>>>>> in
>>>>>>>>> memory. When the vector is too big, it is spilled in a local file.
>>>>
>>>> This
>>>>>>>>>
>>>>>>>>> moved the problem from the Hama side to my side and allowed me to
>>>>>>>>> handle
>>>>>>>>> larger data, especially in joins. I think this problem of
>>>
>>> exchanging
>>>>>>>>>
>>>>>>>>> large
>>>>>>>>> amount of data during a superstep is currently a weakness of Hama.
>>>>>>>>> Leonidas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> BTW, should we feature this on our website?
>>>>>>>>>>
>>>>>>>>>> 2012/8/24 Thomas Jungblut<thomas.jungblut@**gmail.com<
>>>>
>>>> [email protected]>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Leonidas!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have to admit that I have known what is going on (and had to
>>>
>>> keep
>>>>>>>>>>>
>>>>>>>>>>> silent), but I have to say: Thank you very much!
>>>>>>>>>>> This will help many people writing BSPs in a more easier way.
>>>>>>>>>>>
>>>>>>>>>>> Of course this is not as fast as the native BSP code, Hive and
>>>
>>> Pig
>>>>>>>>>>>
>>>>>>>>>>> suffer
>>>>>>>>>>> from the same problems in MR.
>>>>>>>>>>> But it gives people the opportunity to develop faster and get
>>>
>>> their
>>>>>>>>>>>
>>>>>>>>>>> code
>>>>>>>>>>> in production with just a minor time expense.
>>>>>>>>>>>
>>>>>>>>>>> And I think, that we will help you gladly on improving the BSP
>>>
>>> part
>>>>>>>>>>>
>>>>>>>>>>> of
>>>>>>>>>>> your framework. At least I would do ;)
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> 2012/8/24 Edward J. Yoon<[email protected]>
>>>>>>>>>>>
>>>>>>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband
>>>>
>>>> network).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> It seems slow than our PageRank example.
>>>>>>>>>>>>
>>>>>>>>>>>> P.S., There are some errors so I couldn't test large-scale.
>>>>>>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast
>>>>
>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a
>>>>
>>>> non-materialized
>>>>>>>>>>>>
>>>>>>>>>>>> sequence ..., etc.)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> == 100K nodes and 1M edges ==
>>>>>>>>>>>>
>>>>>>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle
>>>>>>>>>>>> about
>>>>>>>>>>>> 2383611 bytes of input data.
>>>>>>>>>>>>
>>>>>>>>>>>> Run time: 30.384 secs
>>>>>>>>>>>>
>>>>>>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle
>>>>>>>>>>>> about
>>>>>>>>>>>> 1191805 bytes of input data.
>>>>>>>>>>>>
>>>>>>>>>>>> Run time: 24.412 secs
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wow, very interesting. I'm going to install and test on my
>>>
>>> large
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> cluster.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras
>>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dear Hama users,
>>>>>>>>>>>>>> I am pleased to announce that the MRQL query processing system
>>>>
>>>> can
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> now
>>>>>>>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available
>>>>
>>>> at:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://lambda.uta.edu/mrql/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query
>>>>
>>>> language
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> large-scale, distributed data analysis. MRQL is powerful
>>>
>>> enough
>>>>
>>>> to
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> express most common data analysis tasks over many different
>>>>
>>>> kinds
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>> raw data, including hierarchical data and nested collections,
>>>>
>>>> such
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode
>>>>
>>>> using
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode
>>>
>>> using
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their
>>>
>>> data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note that, the BSP mode is currently experimental (not
>>>>
>>>> fine-tuned
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> yet)
>>>>>>>>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire
>>>>
>>>> job
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> must
>>>>>>>>>>>>>> be restarted). Due to our limited resources, MRQL has only
>>>
>>> been
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> tested
>>>>>>>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP
>>>
>>> mode
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> the MR mode by evaluating a pagerank query over a small graph
>>>>>>>>>>>>>> (100K
>>>>>>>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times
>>>>
>>>> faster
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> than the MR mode. Please let me know if you'd like to
>>>
>>> contribute
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> this project by testing MRQL on a larger cluster.
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Leonidas Fegaras
>>>>>>>>>>>>>> University of Texas at Arlington
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>
>>>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>> @eddieyoon
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: [ANNOUNCEMENT] A query system for BSP processing

Reply via email to