[jira] Commented: (PIG-824) SQL interface for Pig

2010-08-01 Thread Amr Awadallah (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894455#action_12894455
 ] 

Amr Awadallah commented on PIG-824:
---

I am out of office on vacation and will be slower than usual in
responding to emails. If this is urgent then please call my cell phone
(or send an sms), otherwise I will reply to your email when I get
back.

Thanks for your patience,

-- amr


> SQL interface for Pig
> -
>
> Key: PIG-824
> URL: https://issues.apache.org/jira/browse/PIG-824
> Project: Pig
>  Issue Type: New Feature
>Reporter: Olga Natkovich
>Assignee: Thejas M Nair
> Attachments: java-cup-11a-runtime.jar, java-cup-11a.jar, 
> PIG-824.1.patch, PIG-824.binfiles.tar.gz, pig_sql_beta.pdf, pigsql.patch, 
> pigsql_tutorial.txt, SQL_IN_PIG.html, students2.bin, students_attr.bin
>
>
> In the last 18 month PigLatin has gained significant popularity within the 
> open source community. Many users like its data flow model, its rich type 
> system and its ability to work with any data available on HDFS or outside. We 
> have also heard from many users that having Pig speak SQL would bring many 
> more users. Having a single system that exports multiple interfaces is a big 
> advantage as it guarantees consistent semantics, custom code reuse, and 
> reduces the amount of maintenance. This is especially relevant for project 
> where using both interfaces for different parts of the system is relevant.  
> For instance, in a 
> data warehousing system, you would have ETL component that brings data  into 
> the warehouse and a component that analyzes the data and produces reports. 
> PigLatin is uniquely suited for ETL processing while SQL might be a better 
> fit for report generation.
> To start, it would make sense to implement a subset of SQL92 standard and to 
> be as much as possible standard compliant. This would include all the 
> standard constructs: select, from, where, group-by + having, order by, limit, 
> join (inner + outer). Several extensions  such as support for pig's UDFs and 
> possibly streaming, multiquery and support for pig's complex types would be 
> helpful.
> This work is dependent on metadata support outlined in 
> https://issues.apache.org/jira/browse/PIG-823

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-833) Storage access layer

2009-08-12 Thread Amr Awadallah (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742321#action_12742321
 ] 

Amr Awadallah commented on PIG-833:
---

I am out of office until Aug 14th. I will be checking my email
intermittently. If this is urgent then please call my cell phone,
otherwise I will reply to your email when I get back.

Thanks for your patience,

-- amr


> Storage access layer
> 
>
> Key: PIG-833
> URL: https://issues.apache.org/jira/browse/PIG-833
> Project: Pig
>  Issue Type: New Feature
>Reporter: Jay Tang
> Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, 
> PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, 
> TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a 
> tabular view of data in Hadoop, and could free Pig users from implementing 
> their own data storage/retrieval code.  This layer should also include a 
> columnar storage format in order to provide fast data projection, 
> CPU/space-efficient data serialization, and a schema language to manage 
> physical storage metadata.  Eventually it could also support predicate 
> pushdown for further performance improvement.  Initially, this layer could be 
> a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-885) New UDFs for piggybank (Bin, Decode, LookupInFiles, RegexExtract, RegexMatch, HashFVN, DiffDate)

2009-07-13 Thread Amr Awadallah (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730585#action_12730585
 ] 

Amr Awadallah commented on PIG-885:
---

very nice collection, reminds me of myna :)

-- amr


> New UDFs for piggybank (Bin, Decode, LookupInFiles, RegexExtract, RegexMatch, 
> HashFVN, DiffDate)
> 
>
> Key: PIG-885
> URL: https://issues.apache.org/jira/browse/PIG-885
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.3.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-885.patch
>
>
> Bunch of UDFs:
> 1. Bin -- Converts a continuous value into discrete values
> 2. Decode -- Converts a given attribute or expression into another string 
> value, based on the value of the source attribute
> 3. LookupInFiles -- Check for the existence of an expression in a serial of 
> text files
> 4. RegexExtract and RegexMatch -- Similar to perl regexes
> 5. HashFVN -- An implementation of FNV hash
> 6. DiffDate -- Caculate the number of days in between

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas

2009-06-23 Thread Amr Awadallah (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723412#action_12723412
 ] 

Amr Awadallah commented on PIG-856:
---


Please keep in mind that when running on a loaded system (i.e. with many 
concurrent jobs) the fair-scheduler will have a better chance of allocating 
mappers with local data to process your job if you have more replicas (not sure 
if capacity also does that). So, while setting replicas to less than 3 might 
improve performance when you are only job running in system, it will harm it 
when you are sharing cluster with many others.

Not to mention that this also affects speculative execution, etc.

-- amr

> PERFORMANCE: reduce number of replicas
> --
>
> Key: PIG-856
> URL: https://issues.apache.org/jira/browse/PIG-856
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>
> Currently Pig uses the default number of replicas between MR jobs. Currently, 
> the number is 3. Given the temp nature of the data, we should never need more 
> than 2 and should explicitely set it to improve performance and to be nicer 
> to the name node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-823) Hadoop Metadata Service

2009-06-10 Thread Amr Awadallah (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718313#action_12718313
 ] 

Amr Awadallah commented on PIG-823:
---

sounds good, thanks for elaborating.

> Hadoop Metadata Service
> ---
>
> Key: PIG-823
> URL: https://issues.apache.org/jira/browse/PIG-823
> Project: Pig
>  Issue Type: New Feature
>Reporter: Olga Natkovich
>
> This JIRA is created to track development of a metadata system for  Hadoop. 
> The goal of the system is to allow users and applications to register data 
> stored on HDFS, search for the data available on HDFS, and associate metadata 
> such as schema, statistics, etc. with a particular data unit or a data set 
> stored on HDFS. The initial goal is to provide a fairly generic, low level 
> abstraction that any user or application on HDFS can use to store an retrieve 
> metadata. Over time a higher level abstractions closely tied to particular 
> applications or tools can be developed.
> Over time, it would make sense for the metadata service to become a 
> subproject within Hadoop. For now, the proposal is to make it a contrib to 
> Pig since Pig SQL is likely to be the first user of the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-823) Hadoop Metadata Service

2009-06-09 Thread Amr Awadallah (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717933#action_12717933
 ] 

Amr Awadallah commented on PIG-823:
---

+1 to unified meta-data service.

-- amr


> Hadoop Metadata Service
> ---
>
> Key: PIG-823
> URL: https://issues.apache.org/jira/browse/PIG-823
> Project: Pig
>  Issue Type: New Feature
>Reporter: Olga Natkovich
>
> This JIRA is created to track development of a metadata system for  Hadoop. 
> The goal of the system is to allow users and applications to register data 
> stored on HDFS, search for the data available on HDFS, and associate metadata 
> such as schema, statistics, etc. with a particular data unit or a data set 
> stored on HDFS. The initial goal is to provide a fairly generic, low level 
> abstraction that any user or application on HDFS can use to store an retrieve 
> metadata. Over time a higher level abstractions closely tied to particular 
> applications or tools can be developed.
> Over time, it would make sense for the metadata service to become a 
> subproject within Hadoop. For now, the proposal is to make it a contrib to 
> Pig since Pig SQL is likely to be the first user of the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-826) DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig

2009-06-02 Thread Amr Awadallah (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715680#action_12715680
 ] 

Amr Awadallah commented on PIG-826:
---

neat.

> DISTINCT as "Function/Operator" rather than statement/operator - High Level 
> Pig
> ---
>
> Key: PIG-826
> URL: https://issues.apache.org/jira/browse/PIG-826
> Project: Pig
>  Issue Type: New Feature
>Reporter: David Ciemiewicz
>
> In SQL, a user would think nothing of doing something like:
> {code}
> select
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url) as url_count
> from
> server_logs;
> {code}
> But in Pig, we'd need to do something like the following.  And this is about 
> the most
> compact version I could come up with.
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> DistinctUsers = distinct (foreach Logs generate user);
> DistinctCountries = distinct (foreach Logs generate country);
> DistinctUrls = distinct (foreach Logs generate url);
> DistinctUsersCount = foreach (group DistinctUsers all) generate
> group, COUNT(DistinctUsers) as user_count;
> DistinctCountriesCount = foreach (group DistinctCountries all) generate
> group, COUNT(DistinctCountries) as country_count;
> DistinctUrlCount = foreach (group DistinctUrls all) generate
> group, COUNT(DistinctUrls) as url_count;
> AllDistinctCounts = cross
> DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount;
> Report = foreach AllDistinctCounts generate
> DistinctUsersCount::user_count,
> DistinctCountriesCount::country_count,
> DistinctUrlCount::url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> It would be good if there was a higher level version of Pig that permitted 
> code to be written as:
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> Report = overall Logs generate
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url)) as url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> I do want this in Pig and not as SQL.  I'd expect High Level Pig to generate 
> Lower Level Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-6) Addition of Hbase Storage Option In Load/Store Statement

2009-06-01 Thread Amr Awadallah (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715385#action_12715385
 ] 

Amr Awadallah commented on PIG-6:
-

Any progress on this?

> Addition of Hbase Storage Option In Load/Store Statement
> 
>
> Key: PIG-6
> URL: https://issues.apache.org/jira/browse/PIG-6
> Project: Pig
>  Issue Type: New Feature
> Environment: all environments
>Reporter: Edward J. Yoon
> Fix For: 0.2.0
>
> Attachments: hbase-0.18.1-test.jar, hbase-0.18.1.jar, m34813f5.txt, 
> PIG-6.patch, PIG-6_V01.patch
>
>
> It needs to be able to load full table in hbase.  (maybe ... difficult? i'm 
> not sure yet.)
> Also, as described below, 
> It needs to compose an abstract 2d-table only with certain data filtered from 
> hbase array structure using arbitrary query-delimited. 
> {code}
> A = LOAD table('hbase_table');
> or
> B = LOAD table('hbase_table') Using HbaseQuery('Query-delimited by attributes 
> & timestamp') as (f1, f2[, f3]);
> {code}
> Once test is done on my local machines, 
> I will clarify the grammars and give you more examples to help you explain 
> more storage options. 
> Any advice welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.