date:20130925

[jira] [Updated] (PIG-3404) Improve Pig to ignore bad files or inaccessible files or folders

2013-09-25 Thread Jerry Chen (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jerry Chen updated PIG-3404:

Attachment: PIG-3404.patch

Patch for reference

Improve Pig to ignore bad files or inaccessible files or folders

Key: PIG-3404
URL: https://issues.apache.org/jira/browse/PIG-3404
Project: Pig
Issue Type: New Feature
Components: data
Affects Versions: 0.11.2
Reporter: Jerry Chen
Labels: Rhino
Attachments: PIG-3404.patch

There are use cases in Pig:
* A directory is used as the input of a load operation. It is possible that
one or more files in that directory are bad files (for example, corrupted or
bad data caused by compression).
* A directory is used as the input of a load operation. The current user may
not have permission to access any subdirectories or files of that directory.
The current Pig implementation will abort the whole Pig job for such cases.
It would be useful to have option to allow the job to continue and ignore the
bad files or inaccessible files/folders without abort the job, ideally, log
or print a warning for such error or violations. This requirement is not
trivial because for big data set for large analytics applications, this is
not always possible to sort out the good data for processing; Ignore a few
of bad files may be a better choice for such situations.
We propose to use “Ignore bad files” flag to address this problem.
AvroStorage and related file format in Pig already has this flag but it is
not complete to cover all the cases mentioned above. We would improve the
PigStorage and related text format to support this new flag as well as
improve AvroStorage and related facilities to completely support the concept.
The flag is “Storage” (For example, PigStorage or AvroStorage) based and can
be set for each load operation respectively. The value of this flag will be
false if it is not explicitly set. Ideally, we can provide a global pig
parameter which forces the default value to true for all load functions even
if it is not explicitly set in the LOAD statement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3404) Improve Pig to ignore bad files or inaccessible files or folders

2013-09-25 Thread Jerry Chen (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777186#comment-13777186
]

Jerry Chen commented on PIG-3404:
-

Hi Park, sorry for the late response and I am glad that we can discuss this
topic further in this JIRA.

Just as mentioned in the JIRA description, we are taking the approach of
“Ignore bad files” flag for each storage. Different storages can be controlled
separately instead of a global flag. On the other hand, in our use cases, we
also want to take care that the current user may not have permission to access
any subdirectories of the input directory, which can be looked as “bad
directory” in concept.

Another thing is the ignore ratio. We currently take an even simpler approach
of “ignore all” or “ignore nothing” using a flag. Just as you mentioned,
PIG-3059 uses a threshold to control how many bad input splits can be ignored.
This is a good thing. While the question is “How many cases in reality that we
need a ratio is not 0 and 1?”

I went through the patch in PIG-3059. I was trying to understand how the ratio
is controlled globally in a distributed MapReduce task environment. It seems
that in InputErrorTracker.java, you use a local variable (numErrors) for error
tracking. I may miss something there but it would be very helpful if you can
help explain.

Thank you too for providing the helpful information and let’s continue the
discussion.

Improve Pig to ignore bad files or inaccessible files or folders

[jira] [Created] (PIG-3481) Unable to check name message

2013-09-25 Thread zhenyingshan (JIRA)

zhenyingshan created PIG-3481:
-

 Summary: Unable to check name message
 Key: PIG-3481
 URL: https://issues.apache.org/jira/browse/PIG-3481
 Project: Pig
  Issue Type: Bug
Reporter: zhenyingshan


I am trying to run a pig script in Java program, I get the following error 
sometimes but not all the time.  Here is the snippet of the program and the 
exception I've got.  I have /user/root directory created in hdfs.

--

URL path = 
getClass().getClassLoader().getResource(cfg/concatall.py); 

LOG.info(CDNResolve2Hbase: reading concatall.py file from  + 
path.toString());

pigServer.getPigContext().getProperties().setProperty(PigContext.JOB_NAME,
CDNResolve2Hbase);
pigServer.registerQuery(A = load ' + inputPath + ' using 
PigStorage('\t') as (ip:chararray, do:chararray, cn:chararray, cdn:chararray, 
firsttime:chararray, updatetime:chararray););
pigServer.registerCode(path.toString(),jython,myfunc);
pigServer.registerQuery(B = foreach A generate 
myfunc.concatall('+ extractTimestamp(inputPath)+',ip,do,cn), cdn, 
SUBSTRING(firsttime,0,8););
outputTable = hbase:// + outputTable;
ExecJob job = 
pigServer.store(B,outputTable,org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:cdn
 d:dtime'));
  


-
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
parsing. Unable to check name hdfs://DC-001:9000/user/root
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
at org.apache.pig.PigServer.registerQuery(PigServer.java:529)
at com.hugedata.cdnserver.datanalysis.CDNResolve2Hbase.execute(Unknown 
Source)
at com.hugedata.cdnserver.DatAnalysis.cdnResolve2Hbase(Unknown Source)
at com.hugedata.cdnserver.task.HandleDomainNameLogTask.execute(Unknown 
Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273)
at 
org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:264)
at 
org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
at org.quartz.core.JobRunShell.run(JobRunShell.java:203)
at 
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:520)
Caused by: Failed to parse: Pig script failed to parse: 
line 6, column 4 pig script failed to validate: 
org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to 
check name hdfs://DC-001:9000/user/root
at 
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
... 15 more
Caused by: 
line 6, column 4 pig script failed to validate: 
org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to 
check name hdfs://DC-001:9000/user/root
at 
org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835)
at 
org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236)
at 
org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1315)
at 
org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:799)
at 
org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:517)
at 
org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:392)
at 
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
... 16 more
Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: 
Unable to check name hdfs://DC-001:9000/user/root
at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:207)
at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:128)
at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:138)
at 
org.apache.pig.parser.QueryParserUtils.getCurrentDir(QueryParserUtils.java:91)
at 
org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:827)
... 22 more
Caused by: java.io.IOException: Filesystem closed

[jira] [Updated] (PIG-3481) Unable to check name message

2013-09-25 Thread zhenyingshan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenyingshan updated PIG-3481:
--

Affects Version/s: 0.11.1

 Unable to check name message
 --

 Key: PIG-3481
 URL: https://issues.apache.org/jira/browse/PIG-3481
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: zhenyingshan

 I am trying to run a pig script in Java program, I get the following error 
 sometimes but not all the time.  Here is the snippet of the program and the 
 exception I've got.  I have /user/root directory created in hdfs.
 --
   URL path = 
 getClass().getClassLoader().getResource(cfg/concatall.py); 
   
   LOG.info(CDNResolve2Hbase: reading concatall.py file from  + 
 path.toString());
   
 pigServer.getPigContext().getProperties().setProperty(PigContext.JOB_NAME,
   CDNResolve2Hbase);
   pigServer.registerQuery(A = load ' + inputPath + ' using 
 PigStorage('\t') as (ip:chararray, do:chararray, cn:chararray, cdn:chararray, 
 firsttime:chararray, updatetime:chararray););
   pigServer.registerCode(path.toString(),jython,myfunc);
   pigServer.registerQuery(B = foreach A generate 
 myfunc.concatall('+ extractTimestamp(inputPath)+',ip,do,cn), cdn, 
 SUBSTRING(firsttime,0,8););
   outputTable = hbase:// + outputTable;
   ExecJob job = 
 pigServer.store(B,outputTable,org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:cdn
  d:dtime'));
   
 -
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
 parsing. Unable to check name hdfs://DC-001:9000/user/root
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:529)
   at com.hugedata.cdnserver.datanalysis.CDNResolve2Hbase.execute(Unknown 
 Source)
   at com.hugedata.cdnserver.DatAnalysis.cdnResolve2Hbase(Unknown Source)
   at com.hugedata.cdnserver.task.HandleDomainNameLogTask.execute(Unknown 
 Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273)
   at 
 org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:264)
   at 
 org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
   at org.quartz.core.JobRunShell.run(JobRunShell.java:203)
   at 
 org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:520)
 Caused by: Failed to parse: Pig script failed to parse: 
 line 6, column 4 pig script failed to validate: 
 org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable 
 to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
   ... 15 more
 Caused by: 
 line 6, column 4 pig script failed to validate: 
 org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable 
 to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1315)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:799)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:517)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:392)
   at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
   ... 16 more
 Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 
 6007: Unable to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:207)
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:128)
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:138)
   at

[jira] [Updated] (PIG-3481) Unable to check name message

2013-09-25 Thread zhenyingshan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenyingshan updated PIG-3481:
--

Environment: hadoop 1.0.3, hbase 0.94.1

 Unable to check name message
 --

 Key: PIG-3481
 URL: https://issues.apache.org/jira/browse/PIG-3481
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
 Environment: hadoop 1.0.3, hbase 0.94.1
Reporter: zhenyingshan

 I am trying to run a pig script in Java program, I get the following error 
 sometimes but not all the time.  Here is the snippet of the program and the 
 exception I've got.  I have /user/root directory created in hdfs.
 --
   URL path = 
 getClass().getClassLoader().getResource(cfg/concatall.py); 
   
   LOG.info(CDNResolve2Hbase: reading concatall.py file from  + 
 path.toString());
   
 pigServer.getPigContext().getProperties().setProperty(PigContext.JOB_NAME,
   CDNResolve2Hbase);
   pigServer.registerQuery(A = load ' + inputPath + ' using 
 PigStorage('\t') as (ip:chararray, do:chararray, cn:chararray, cdn:chararray, 
 firsttime:chararray, updatetime:chararray););
   pigServer.registerCode(path.toString(),jython,myfunc);
   pigServer.registerQuery(B = foreach A generate 
 myfunc.concatall('+ extractTimestamp(inputPath)+',ip,do,cn), cdn, 
 SUBSTRING(firsttime,0,8););
   outputTable = hbase:// + outputTable;
   ExecJob job = 
 pigServer.store(B,outputTable,org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:cdn
  d:dtime'));
   
 -
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
 parsing. Unable to check name hdfs://DC-001:9000/user/root
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:529)
   at com.hugedata.cdnserver.datanalysis.CDNResolve2Hbase.execute(Unknown 
 Source)
   at com.hugedata.cdnserver.DatAnalysis.cdnResolve2Hbase(Unknown Source)
   at com.hugedata.cdnserver.task.HandleDomainNameLogTask.execute(Unknown 
 Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273)
   at 
 org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:264)
   at 
 org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
   at org.quartz.core.JobRunShell.run(JobRunShell.java:203)
   at 
 org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:520)
 Caused by: Failed to parse: Pig script failed to parse: 
 line 6, column 4 pig script failed to validate: 
 org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable 
 to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
   ... 15 more
 Caused by: 
 line 6, column 4 pig script failed to validate: 
 org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable 
 to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1315)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:799)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:517)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:392)
   at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
   ... 16 more
 Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 
 6007: Unable to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:207)
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:128)
   at

[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-09-25 Thread Brian ONeill (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777343#comment-13777343
 ] 

Brian ONeill commented on PIG-3453:
---

First question, for DISTINCT within Storm, do you believe we should have a 
sliding time window within which we perform the distinct?  There is mention of 
the fact that it will be stateful (since we need to keep a set in memory with 
which to de-dupe).  Do we intend to leverage the concept of Trident State for 
this? (which may make sense, implement State then on each commit/flush perform 
the de-duping)

thoughts?

 Implement a Storm backend to Pig
 

 Key: PIG-3453
 URL: https://issues.apache.org/jira/browse/PIG-3453
 Project: Pig
  Issue Type: New Feature
Reporter: Pradeep Gollakota
  Labels: storm

 There is a lot of interest around implementing a Storm backend to Pig for 
 streaming processing. The proposal and initial discussions can be found at 
 https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-09-25 Thread Brian ONeill (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777355#comment-13777355
 ] 

Brian ONeill commented on PIG-3453:
---

Also, we could perform DISTINCT using a backend storage mechanism (like 
Cassandra), where we first check storage to see if the tuple exists, if it does 
not, we emit.  If we first route all the same tuples to a single bolt, then 
check from there that may work (eliminating the potential for two bolts to 
check for existence at the same time).  Using backend storage would allow 
someone to perform a true DISTINCT operation.

 Implement a Storm backend to Pig
 

 Key: PIG-3453
 URL: https://issues.apache.org/jira/browse/PIG-3453
 Project: Pig
  Issue Type: New Feature
Reporter: Pradeep Gollakota
  Labels: storm

 There is a lot of interest around implementing a Storm backend to Pig for 
 streaming processing. The proposal and initial discussions can be found at 
 https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PIG-3481) Unable to check name message

2013-09-25 Thread zhenyingshan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenyingshan resolved PIG-3481.
---

   Resolution: Fixed
Fix Version/s: 0.11.1

It turns out that because my static PigServer is not thread-safe, I use 
ThreadLocal and the problem does not appear anymore.

--
private static ThreadLocalPigServer pigServer = new ThreadLocalPigServer();

public static PigServer getServer() {
if (pigServer.get() == null) {
try {
printClassPath();
Properties prop = SystemUtils.getCfg();
pigServer.set(new PigServer(ExecType.MAPREDUCE, prop));
return pigServer.get();
} catch (Exception e) {
LOG.error(error in starting PigServer:, e);
return null;
}
}
return pigServer.get();
}


 Unable to check name message
 --

 Key: PIG-3481
 URL: https://issues.apache.org/jira/browse/PIG-3481
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
 Environment: hadoop 1.0.3, hbase 0.94.1
Reporter: zhenyingshan
 Fix For: 0.11.1


 I am trying to run a pig script in Java program, I get the following error 
 sometimes but not all the time.  Here is the snippet of the program and the 
 exception I've got.  I have /user/root directory created in hdfs.
 --
   URL path = 
 getClass().getClassLoader().getResource(cfg/concatall.py); 
   
   LOG.info(CDNResolve2Hbase: reading concatall.py file from  + 
 path.toString());
   
 pigServer.getPigContext().getProperties().setProperty(PigContext.JOB_NAME,
   CDNResolve2Hbase);
   pigServer.registerQuery(A = load ' + inputPath + ' using 
 PigStorage('\t') as (ip:chararray, do:chararray, cn:chararray, cdn:chararray, 
 firsttime:chararray, updatetime:chararray););
   pigServer.registerCode(path.toString(),jython,myfunc);
   pigServer.registerQuery(B = foreach A generate 
 myfunc.concatall('+ extractTimestamp(inputPath)+',ip,do,cn), cdn, 
 SUBSTRING(firsttime,0,8););
   outputTable = hbase:// + outputTable;
   ExecJob job = 
 pigServer.store(B,outputTable,org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:cdn
  d:dtime'));
   
 -
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
 parsing. Unable to check name hdfs://DC-001:9000/user/root
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:529)
   at com.hugedata.cdnserver.datanalysis.CDNResolve2Hbase.execute(Unknown 
 Source)
   at com.hugedata.cdnserver.DatAnalysis.cdnResolve2Hbase(Unknown Source)
   at com.hugedata.cdnserver.task.HandleDomainNameLogTask.execute(Unknown 
 Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273)
   at 
 org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:264)
   at 
 org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
   at org.quartz.core.JobRunShell.run(JobRunShell.java:203)
   at 
 org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:520)
 Caused by: Failed to parse: Pig script failed to parse: 
 line 6, column 4 pig script failed to validate: 
 org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable 
 to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
   ... 15 more
 Caused by: 
 line 6, column 4 pig script failed to validate: 
 org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable 
 to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236)
   at

[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-09-25 Thread Jacob Perkins (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777408#comment-13777408
]

Jacob Perkins commented on PIG-3453:

[~boneill], I haven't thought too hard about distinct yet myself. Since I'm
really only thinking about Trident and not storm in general, doing a distinct
strictly within a batch is one straightforward option. Unfortunately, from a
user standpoint, I think this would be (a) minimally useful and (b) confusing.
Instead we could implement something like an approximate distinct using an LRU
cache? Maybe even go so far as to implement a SQF (which I haven't read in its
entirety yet): http://www.vldb.org/pvldb/vol6/p589-dutta.pdf?

Also, what about order by? In what sense is an unbounded stream ordered?

I absolutely do not want to tie the storm/trident execution engine to an
external data store such as cassandra. Pig is supposed to be backend agnostic.
Maybe the -default- tap and sink can be Kafka (tap) and Cassandra (sink).
Finally, it should be possible to run a pig script in storm local mode.

And [~pradeepg26] I'm actually well on the way to having nested foreach
working. They way I'm working it now is each LogicalExpressionPlan becomes its
own Trident BaseFunction. Actually works quite nicely for now. I haven't gotten
to aggregates yet. What I probably won't implement for the POC is the tap and
sink.

Implement a Storm backend to Pig

Key: PIG-3453
URL: https://issues.apache.org/jira/browse/PIG-3453
Project: Pig
Issue Type: New Feature
Reporter: Pradeep Gollakota
Labels: storm

There is a lot of interest around implementing a Storm backend to Pig for
streaming processing. The proposal and initial discussions can be found at
https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal

[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-09-25 Thread Brian ONeill (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777421#comment-13777421
 ] 

Brian ONeill commented on PIG-3453:
---

[~thedatachef] Good points/suggestions.  I'll have a look at bot LRU and SQF.

RE: Cassandra
Sorry, I didn't mean to imply we would create a hard dependency.  I meant we 
could leverage the Trident State abstraction.  (My team happens to own the 
storm-cassandra Trident State implementation 
(https://github.com/hmsonline/storm-cassandra))  We would query the State to 
see if the tuple was processed.  You could just as easily plug in any 
persistence mechanism.  (e.g. https://github.com/nathanmarz/trident-memcached)  

 Implement a Storm backend to Pig
 

 Key: PIG-3453
 URL: https://issues.apache.org/jira/browse/PIG-3453
 Project: Pig
  Issue Type: New Feature
Reporter: Pradeep Gollakota
  Labels: storm

 There is a lot of interest around implementing a Storm backend to Pig for 
 streaming processing. The proposal and initial discussions can be found at 
 https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3458) ScalarExpression lost with multiquery optimization

2013-09-25 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3458:
--

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Patch committed to branch-0.12 and trunk.  
Thanks Mark and Daniel for your feedback!

 ScalarExpression lost with multiquery optimization
 --

 Key: PIG-3458
 URL: https://issues.apache.org/jira/browse/PIG-3458
 Project: Pig
  Issue Type: Bug
Reporter: Koji Noguchi
Assignee: Koji Noguchi
 Fix For: 0.12.0

 Attachments: pig-3458-v01.patch, pig-3458-v02.patch


 Our user reported an issue where their scalar results goes missing when 
 having two store statements.
 {noformat}
 A = load 'test1.txt' using PigStorage('\t') as (a:chararray, count:long);
 B = group A all;
 C = foreach B generate SUM(A.count) as total ;
 store C into 'deleteme6_C' using PigStorage(',');
 Z = load 'test2.txt' using PigStorage('\t') as (a:chararray, id:chararray );
 Y = group Z by id;
 X = foreach Y generate group, C.total;
 store X into 'deleteme6_X' using PigStorage(',');
 Inputs
  pig cat test1.txt
 a   1
 b   2
 c   8
 d   9
  pig cat test2.txt
 a   z
 b   y
 c   x
  pig
 {noformat}
 Result X should contain the total count of '20' but instead it's empty.
 {noformat}
  pig cat deleteme6_C/part-r-0
 20
  pig cat deleteme6_X/part-r-0
 x,
 y,
 z,
  pig
 {noformat}
 This works if we take out first store C statement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3458) ScalarExpression lost with multiquery optimization

2013-09-25 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3458:
--

Component/s: parser

 ScalarExpression lost with multiquery optimization
 --

 Key: PIG-3458
 URL: https://issues.apache.org/jira/browse/PIG-3458
 Project: Pig
  Issue Type: Bug
  Components: parser
Reporter: Koji Noguchi
Assignee: Koji Noguchi
 Fix For: 0.12.0

 Attachments: pig-3458-v01.patch, pig-3458-v02.patch


 Our user reported an issue where their scalar results goes missing when 
 having two store statements.
 {noformat}
 A = load 'test1.txt' using PigStorage('\t') as (a:chararray, count:long);
 B = group A all;
 C = foreach B generate SUM(A.count) as total ;
 store C into 'deleteme6_C' using PigStorage(',');
 Z = load 'test2.txt' using PigStorage('\t') as (a:chararray, id:chararray );
 Y = group Z by id;
 X = foreach Y generate group, C.total;
 store X into 'deleteme6_X' using PigStorage(',');
 Inputs
  pig cat test1.txt
 a   1
 b   2
 c   8
 d   9
  pig cat test2.txt
 a   z
 b   y
 c   x
  pig
 {noformat}
 Result X should contain the total count of '20' but instead it's empty.
 {noformat}
  pig cat deleteme6_C/part-r-0
 20
  pig cat deleteme6_X/part-r-0
 x,
 y,
 z,
  pig
 {noformat}
 This works if we take out first store C statement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

1 2 >

1 - 100 of 102 matches

Mail list logo