Just curious, is there a plan to support sophisticated queries for unstructured spatial datasets?
On Wed, Sep 12, 2012 at 4:13 AM, Leonidas Fegaras <[email protected]> wrote: > I created a project on Github: > https://github.com/fegaras/mrql.git > > Thank you for your help > Leonidas Fegaras > > > On Sep 7, 2012, at 11:20 AM, Thomas Jungblut wrote: > >> Yep, a subproject would be the alternative. >> In this case we would give you PMC and committer rights so you can >> actively >> work on that. >> However this would make the mapreduce part more or less useless, so if you >> want to go the hybrid way, feel free to submit an incubation request. >> >> 2012/9/7 Suraj Menon <[email protected]> >> >>> I think Thomas has a point. How about making it a sub-module/sub-project >>> of >>> Hama for now? If/When it gains enough community support to make it a top >>> level project, you can fork it as a separate project. >>> I am not completely aware of the procedures and requirements for getting >>> external project as sub-project. >>> We can look into it if you are ready to take this route. >>> >>>> Could you please send me a link for setting up an open-source Apache >>> >>> project? >>> If I am right this is what you are looking for - >>> http://incubator.apache.org/guides/proposal.html >>> http://incubator.apache.org/sitemap.html >>> >>> Good luck, >>> Suraj >>> >>> On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut >>> <[email protected]>wrote: >>> >>>> Although I think this is a great project, I think that you will not meet >>>> the requirements. >>>> You need a community and a charter to get it into the incubation. >>>> >>>> What about hosting it on Github? >>>> >>>> 2012/9/7 Leonidas Fegaras <[email protected]> >>>> >>>>> Yes, this is a great idea. I have used GIT on my own server but I don't >>>>> know how to do this for ASF. Could you please send me a link for >>> >>> setting >>>> >>>> up >>>>> >>>>> an open-source Apache project? >>>>> >>>>> >>>>> On 09/05/2012 10:51 AM, Edward J. Yoon wrote: >>>>> >>>>>> If you can open source this then I'm sure the ASF community can help >>>>>> you and make this software better. >>>>>> >>>>>> Pls feel free to ask us if you need any assistance donating source >>>>>> code to the ASF or contributing to the Hama project in the future. >>>>>> >>>>>> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras< >>> >>> [email protected]> >>>>>> >>>>>> wrote: >>>>>> >>>>>>> Yes sure. I have fixed the bug with the repeat stopping condition >>> >>> but I >>>>>>> >>>>>>> have >>>>>>> only tested pagerank on my small cluster. I still need to fix the >>>> >>>> k-means >>>>>>> >>>>>>> clustering (it's a special case because you improve a fixed number of >>>>>>> points). >>>>>>> Leonidas >>>>>>> >>>>>>> >>>>>>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote: >>>>>>> >>>>>>> Shall we work together? >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras< >>> >>> [email protected] >>>>> >>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thank you very much for your interest and for testing my system. >>>>>>>>> It seems that my release was premature: It worked for some random >>>> >>>> data >>>>>>>>> >>>>>>>>> but >>>>>>>>> didn't for some others. It's a minor logical error that I will try >>> >>> to >>>>>>>>> >>>>>>>>> fix >>>>>>>>> in >>>>>>>>> the next few days. The problem is with the stopping condition of >>> >>> the >>>>>>>>> >>>>>>>>> repeat >>>>>>>>> expression that calculates the new pagerank from the old. It must >>>> >>>> stop >>>>>>>>> >>>>>>>>> if >>>>>>>>> ALL peers reach the specified precision. This is done by having >>>> >>>> those >>>>>>>>> >>>>>>>>> peers >>>>>>>>> that need to continue send a message to others to continue. It >>> >>> seems >>>>>>>>> >>>>>>>>> that >>>>>>>>> now when all peers agree at the same time, the program works fine. >>>> >>>> But >>>>>>>>> >>>>>>>>> if >>>>>>>>> one finishes sooner, instead of continuing the repeat loop, it runs >>>>>>>>> away >>>>>>>>> to >>>>>>>>> the next BSP step that follows the repeat, then exits prematurely >>> >>> and >>>>>>>>> >>>>>>>>> the >>>>>>>>> system hangs. The casting errors are due to the run-away peers >>>>>>>>> executing >>>>>>>>> the >>>>>>>>> wrong BSP steps reading wrong messages. Queries without repeat >>> >>> though >>>>>>>>> >>>>>>>>> are >>>>>>>>> OK. >>>>>>>>> By the way, I had a problem exchanging large amount of data during >>>> >>>> sync >>>>>>>>> >>>>>>>>> (I >>>>>>>>> discussed this with Thomas). My solution was to to break a BSP >>>>>>>>> superstep >>>>>>>>> into multiple substeps so that each substep can handle a max number >>>> >>>> of >>>>>>>>> >>>>>>>>> messages. Of course my program has to collect all messages in a >>>> >>>> vector >>>>>>>>> >>>>>>>>> in >>>>>>>>> memory. When the vector is too big, it is spilled in a local file. >>>> >>>> This >>>>>>>>> >>>>>>>>> moved the problem from the Hama side to my side and allowed me to >>>>>>>>> handle >>>>>>>>> larger data, especially in joins. I think this problem of >>> >>> exchanging >>>>>>>>> >>>>>>>>> large >>>>>>>>> amount of data during a superstep is currently a weakness of Hama. >>>>>>>>> Leonidas >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> BTW, should we feature this on our website? >>>>>>>>>> >>>>>>>>>> 2012/8/24 Thomas Jungblut<thomas.jungblut@**gmail.com< >>>> >>>> [email protected]> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Leonidas! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I have to admit that I have known what is going on (and had to >>> >>> keep >>>>>>>>>>> >>>>>>>>>>> silent), but I have to say: Thank you very much! >>>>>>>>>>> This will help many people writing BSPs in a more easier way. >>>>>>>>>>> >>>>>>>>>>> Of course this is not as fast as the native BSP code, Hive and >>> >>> Pig >>>>>>>>>>> >>>>>>>>>>> suffer >>>>>>>>>>> from the same problems in MR. >>>>>>>>>>> But it gives people the opportunity to develop faster and get >>> >>> their >>>>>>>>>>> >>>>>>>>>>> code >>>>>>>>>>> in production with just a minor time expense. >>>>>>>>>>> >>>>>>>>>>> And I think, that we will help you gladly on improving the BSP >>> >>> part >>>>>>>>>>> >>>>>>>>>>> of >>>>>>>>>>> your framework. At least I would do ;) >>>>>>>>>>> >>>>>>>>>>> Thanks! >>>>>>>>>>> >>>>>>>>>>> 2012/8/24 Edward J. Yoon<[email protected]> >>>>>>>>>>> >>>>>>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband >>>> >>>> network). >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> It seems slow than our PageRank example. >>>>>>>>>>>> >>>>>>>>>>>> P.S., There are some errors so I couldn't test large-scale. >>>>>>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast >>>> >>>> to >>>>>>>>>>>> >>>>>>>>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a >>>> >>>> non-materialized >>>>>>>>>>>> >>>>>>>>>>>> sequence ..., etc.) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> == 100K nodes and 1M edges == >>>>>>>>>>>> >>>>>>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle >>>>>>>>>>>> about >>>>>>>>>>>> 2383611 bytes of input data. >>>>>>>>>>>> >>>>>>>>>>>> Run time: 30.384 secs >>>>>>>>>>>> >>>>>>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle >>>>>>>>>>>> about >>>>>>>>>>>> 1191805 bytes of input data. >>>>>>>>>>>> >>>>>>>>>>>> Run time: 24.412 secs >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon >>>>>>>>>>>> <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Wow, very interesting. I'm going to install and test on my >>> >>> large >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> cluster. >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras >>>>>>>>>>>>> <[email protected]> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> Dear Hama users, >>>>>>>>>>>>>> I am pleased to announce that the MRQL query processing system >>>> >>>> can >>>>>>>>>>>>>> >>>>>>>>>>>>>> now >>>>>>>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available >>>> >>>> at: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> http://lambda.uta.edu/mrql/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query >>>> >>>> language >>>>>>>>>>>>>> >>>>>>>>>>>>>> for >>>>>>>>>>>>>> large-scale, distributed data analysis. MRQL is powerful >>> >>> enough >>>> >>>> to >>>>>>>>>>>>>> >>>>>>>>>>>>>> express most common data analysis tasks over many different >>>> >>>> kinds >>>>>>>>>>>>>> >>>>>>>>>>>>>> of >>>>>>>>>>>>>> raw data, including hierarchical data and nested collections, >>>> >>>> such >>>>>>>>>>>>>> >>>>>>>>>>>>>> as >>>>>>>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode >>>> >>>> using >>>>>>>>>>>>>> >>>>>>>>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode >>> >>> using >>>>>>>>>>>>>> >>>>>>>>>>>>>> Apache >>>>>>>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their >>> >>> data. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Note that, the BSP mode is currently experimental (not >>>> >>>> fine-tuned >>>>>>>>>>>>>> >>>>>>>>>>>>>> yet) >>>>>>>>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire >>>> >>>> job >>>>>>>>>>>>>> >>>>>>>>>>>>>> must >>>>>>>>>>>>>> be restarted). Due to our limited resources, MRQL has only >>> >>> been >>>>>>>>>>>>>> >>>>>>>>>>>>>> tested >>>>>>>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP >>> >>> mode >>>>>>>>>>>>>> >>>>>>>>>>>>>> with >>>>>>>>>>>>>> the MR mode by evaluating a pagerank query over a small graph >>>>>>>>>>>>>> (100K >>>>>>>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times >>>> >>>> faster >>>>>>>>>>>>>> >>>>>>>>>>>>>> than the MR mode. Please let me know if you'd like to >>> >>> contribute >>>>>>>>>>>>>> >>>>>>>>>>>>>> to >>>>>>>>>>>>>> this project by testing MRQL on a larger cluster. >>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>> Leonidas Fegaras >>>>>>>>>>>>>> University of Texas at Arlington >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Best Regards, Edward J. Yoon >>>>>>>>>>>>> @eddieyoon >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Best Regards, Edward J. Yoon >>>>>>>>>>>> @eddieyoon >>>>>>>>>>>> >>>>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best Regards, Edward J. Yoon >>>>>>>> @eddieyoon >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> > -- Best Regards, Edward J. Yoon @eddieyoon
